Tony Mullen

 

Special Lecture in Computer Science A:

Natural Language Processing in Python

 

This course builds on much of the material covered in last termfs Introduction to NLP, but with an added emphasis on practical implementation. The course will also serve as an introduction to programming in Python.  Topics in NLP covered will include tokenization, part-of-speech tagging, syntax and parsing, and statistical and probabilistic methodologies.  Students will make use of the Natural Language Toolkit (NLTK) in Python and will be expected to do a good deal of programming.

 

Required texts include Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition by Daniel Jurafsky and James Martin and the NLTK tutorials.

 

Recommended texts include Dive Into Python by Mark Pilgrim and ‰‚߂ĂÌPython\Object]oriented programming by Mark Lutz and David Ascher.

 

A strongly recommended resource is the Python Tutorial by Guido van Rossum.

 

Here is a regular expression tutorial.

And here is some information in Japanese.

 

Week 1

œRead NLTK Tutorial: Preface and NLTK Tutorial: Introduction to Natural Language Processing

Run the three graphical demonstrations here:

http://nltk.sourceforge.net/getting_started.html

œIn the recursive descent and the shift-reduce parser, you can edit the grammar and the text to be parsed

under gedith.  Edit the grammar to include gfriendh as a noun, gbith as a verb, and gmyh as a determiner. 

Parse the sentence ga dog bit my friend in the parkh in each parser and take screenshots of the correct parse tree.

œRead  Chapter 2 of gDive Into Pythonh (or the first 2 chapters of the introductory Python textbook of your choice)

If you plan to work on your own computer during this course, please try to install NLTK.

Check your work here.

 

Week 2

œUnderstand the examples in the NLTK tutorial gTokenizing Text and Classifying Wordsh. 

œRead Chapter 3 of gDiving into Pythonh or the chapter on native data structures (lists, tuples, dictionaries) in your preferred Python textbook.

œCreate a gtupleh of five countries names, and a glisth of their capital cities.  Using two variables to represent these structures, create a gdictionaryh which associates the countries in the tuple with the cities in the list, using a single line of code.  Cut and paste the sequence of commands from IDLE into an email.

-Hint: you will probably want to use the dict() and the zip() functions, which you can find out about in the Data Structures chapter of the Python Tutorial.

œDo Exercises 1 and 2 in "Tokenizing Text and Classifying Words" of the tutorial.  For the text, please use this text.

Check your work here.

 

Week 3

œComputer room tutorial session.

Work on Week 2 exercises.

 

Week 4

œReview Objects and Classes/Frequency Distributions.

œDo Exercises 3, 4 and 5 in "Tokenizing Text and Classifying Words" of the tutorial.

Solution to exercise 3.

Solution to exercise 4

Solution to exercise 5

 

Week 5

œUnderstand all examples in NLTK Tutorial:Tagging

œIn NLTK Tutorial:Tagging do exercises 1 and 2

(Partial) solution to exercise 1

 

Week 6

œComputer room tutorial session.

Work on Week 4 and 5 exercises.

 

Week 7

œReview Chapter 9 in Jurafsky & Martin

œRead NLTK Tutorial: Chunking and understand the examples

œDo exercises 1, 2, and 3 in NLTK Tutorial: Chunking

 

Week 8

œComputer room tutorial session.

Work on exercises.

 

Week 9

œDo exercises 9.1, 9.3 and 9.5 in Jurafsky and Martin

œRead chapter 10 through 10.3

œIn the NLTK parsing demo, create a simple grammar for japanese which is capable of accepting:

               inu ga neko o mita

               watashi ga hon o yonda

               watashi ha inu o mita

               neko ha inu ga mita

      But not:

               *inu o neko o mita

               *watashi ga hon ga yonda

œDiscuss the problems.  What sentences are allowed which shouldn't be?  Why?

œConsider how a grammar might be made to account for these phenomena using a "postpositional phrase" constituent consisting of an NP followed by a postposition.  What difficulties come up?