Tony Mullen


Special Lecture in Computer Science A:

Natural Language Processing in Python


This course builds on much of the material covered in last termfs Introduction to NLP, but with an added emphasis on practical implementation. The course will also serve as an introduction to programming in Python.  Topics in NLP covered will include tokenization, part-of-speech tagging, syntax and parsing, and statistical and probabilistic methodologies.  Students will make use of the Natural Language Toolkit (NLTK) in Python and will be expected to do a good deal of programming.


Required texts include Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition by Daniel Jurafsky and James Martin and the NLTK tutorials.


Recommended texts include Dive Into Python by Mark Pilgrim and ߂ĂPython\Object]oriented programming by Mark Lutz and David Ascher.


A strongly recommended resource is the Python Tutorial by Guido van Rossum.


Here is a regular expression tutorial.

And here is some information in Japanese.


Week 1

Read NLTK Tutorial: Preface and NLTK Tutorial: Introduction to Natural Language Processing

Run the three graphical demonstrations here:

In the recursive descent and the shift-reduce parser, you can edit the grammar and the text to be parsed

under gedith.  Edit the grammar to include gfriendh as a noun, gbith as a verb, and gmyh as a determiner. 

Parse the sentence ga dog bit my friend in the parkh in each parser and take screenshots of the correct parse tree.

Read  Chapter 2 of gDive Into Pythonh (or the first 2 chapters of the introductory Python textbook of your choice)

If you plan to work on your own computer during this course, please try to install NLTK.

Check your work here.


Week 2

Understand the examples in the NLTK tutorial gTokenizing Text and Classifying Wordsh. 

Read Chapter 3 of gDiving into Pythonh or the chapter on native data structures (lists, tuples, dictionaries) in your preferred Python textbook.

Create a gtupleh of five countries names, and a glisth of their capital cities.  Using two variables to represent these structures, create a gdictionaryh which associates the countries in the tuple with the cities in the list, using a single line of code.  Cut and paste the sequence of commands from IDLE into an email.

-Hint: you will probably want to use the dict() and the zip() functions, which you can find out about in the Data Structures chapter of the Python Tutorial.

Do Exercises 1 and 2 in "Tokenizing Text and Classifying Words" of the tutorial.  For the text, please use this text.

Check your work here.


Week 3

Computer room tutorial session.

Work on Week 2 exercises.


Week 4

Review Objects and Classes/Frequency Distributions.

Do Exercises 3, 4 and 5 in "Tokenizing Text and Classifying Words" of the tutorial.

Solution to exercise 3.

Solution to exercise 4

Solution to exercise 5


Week 5

Understand all examples in NLTK Tutorial:Tagging

In NLTK Tutorial:Tagging do exercises 1 and 2

(Partial) solution to exercise 1


Week 6

Computer room tutorial session.

Work on Week 4 and 5 exercises.


Week 7

Review Chapter 9 in Jurafsky & Martin

Read NLTK Tutorial: Chunking and understand the examples

Do exercises 1, 2, and 3 in NLTK Tutorial: Chunking


Week 8

Computer room tutorial session.

Work on exercises.


Week 9

Do exercises 9.1, 9.3 and 9.5 in Jurafsky and Martin

Read chapter 10 through 10.3

In the NLTK parsing demo, create a simple grammar for japanese which is capable of accepting:

               inu ga neko o mita

               watashi ga hon o yonda

               watashi ha inu o mita

               neko ha inu ga mita

      But not:

               *inu o neko o mita

               *watashi ga hon ga yonda

Discuss the problems.  What sentences are allowed which shouldn't be?  Why?

Consider how a grammar might be made to account for these phenomena using a "postpositional phrase" constituent consisting of an NP followed by a postposition.  What difficulties come up?