Special Lecture in Computer Science A:
Natural Language Processing in Python
This course builds
on much of the material covered in last termfs Introduction to NLP, but with an
added emphasis on practical implementation. The course will also serve as an
introduction to programming in Python.
Topics in NLP covered will include tokenization, part-of-speech tagging,
syntax and parsing, and statistical and probabilistic methodologies. Students will make use of the Natural Language Toolkit (NLTK) in
Python and will be expected to do a good deal of programming.
Required texts
include Speech
and Language Processing: An Introduction to Natural Language Processing,
Computational Linguistics and Speech Recognition by Daniel Jurafsky and
James Martin and the NLTK
tutorials.
Recommended texts
include Dive Into Python by Mark
Pilgrim and ‰‚߂ĂÌPython\Object]oriented programming by Mark Lutz and
David Ascher.
A strongly
recommended resource is the Python
Tutorial by Guido van Rossum.
Here is a regular expression tutorial.
And here is some information in
Japanese.
Week 1
œRead NLTK Tutorial: Preface and NLTK
Tutorial: Introduction to Natural Language Processing
Run the three
graphical demonstrations here:
http://nltk.sourceforge.net/getting_started.html
œIn the recursive descent and the shift-reduce
parser, you can edit the grammar and the text to be parsed
under gedith. Edit the grammar to include gfriendh as
a noun, gbith as a verb, and gmyh as a determiner.
Parse the sentence
ga dog bit my friend in the parkh in each parser and take screenshots of the
correct parse tree.
œRead
Chapter 2 of gDive Into Pythonh
(or the first 2 chapters of the introductory Python textbook of your
choice)
If you plan to work on your own computer
during this course, please try to install
NLTK.
Week 2
œUnderstand the examples in the NLTK
tutorial gTokenizing Text and Classifying Wordsh.
œRead Chapter 3 of gDiving into Pythonh or the chapter on native data structures
(lists, tuples, dictionaries) in your preferred Python textbook.
œCreate a gtupleh of five countries names,
and a glisth of their capital cities.
Using two variables to represent these structures, create a gdictionaryh
which associates the countries in the tuple with the cities in the list, using
a single line of code. Cut and
paste the sequence of commands from IDLE into an email.
-Hint: you will probably want to use the dict()
and the zip() functions, which you can find out about in the Data Structures chapter of the
Python Tutorial.
œDo Exercises 1 and 2 in "Tokenizing
Text and Classifying Words" of the tutorial. For the text, please use this text.
Week 3
œComputer room tutorial session.
Work on Week 2 exercises.
Week 4
œReview Objects and Classes/Frequency Distributions.
œDo Exercises 3, 4 and 5 in
"Tokenizing Text and Classifying Words" of the tutorial.
Week 5
œUnderstand all examples in NLTK
Tutorial:Tagging
œIn NLTK Tutorial:Tagging do
exercises 1 and 2
(Partial)
solution to exercise 1
Week 6
œComputer room tutorial session.
Work on Week 4 and 5 exercises.
Week 7
œReview Chapter 9 in Jurafsky & Martin
œRead NLTK Tutorial: Chunking and
understand the examples
œDo exercises 1, 2, and 3 in NLTK
Tutorial: Chunking
Week 8
œComputer room tutorial session.
Work on exercises.
Week 9
œDo exercises 9.1, 9.3 and 9.5 in Jurafsky
and Martin
œRead chapter 10 through 10.3
œIn the NLTK
parsing demo, create a simple grammar for japanese which is capable of
accepting:
–
inu ga
neko o mita
–
watashi
ga hon o yonda
–
watashi
ha inu o mita
–
neko
ha inu ga mita
But not:
–
*inu o
neko o mita
–
*watashi
ga hon ga yonda
œDiscuss the problems. What sentences are allowed which
shouldn't be? Why?
œConsider how a grammar might be made to
account for these phenomena using a "postpositional phrase"
constituent consisting of an NP followed by a postposition. What difficulties come up?