User:Alvations/NLTK cheatsheet/CorporaReaders

Here's a tutorial on how to use several corpus that you might be interested in.

Downloading Corpora
First, it is recommended that you download all the corpora you find in nltk.download, if you're unsure of which corpus you want to use. Follow the instructions on http://nltk.org/book/ch01.html#fig-nltk-downloader, in brief: $ python Python 2.7.4 (default, Apr 19 2013, 18:28:01) [GCC 4.7.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import nltk >>> nltk.download
 * 1) Then a window will pop up and simply select the corpora you want
 * 2) press the download button, sit and wait, then close the windows

Book Corpora
The introductory tutorial for NLTK gave a very nice description and use of the book corpora in NLTK, so I shall simply point you to the page: http://nltk.org/book/ch01.html. However you might be wondering what if I have a textfile and I want to use the same functions for concordance as shown in NLTK? Try the following:

from nltk.corpus import PlaintextCorpusReader from nltk.text import Text text1 = """This is a story about a foo bar. Foo likes to go to the bar and his last name is also bar. At home, he kept a lot of gold chocolate bars.""" text2 = """One day, foo went to the bar in his neighborhood and was shot down by a sheep, a blah blah black sheep.""" texts = ['text1','text2'] corpusdir = './mycorpus' # the directory where you keep the corpus for i, t in enumerate(texts): outfilename = 'text'+str(i)+'.txt' print>>open(corpusdir+outfilename,'w'), t mycorpus = PlaintextCorpusReader(corpusdir, '.*') mytext = Text(mycorpus.words) mytext.concoordance('foo')
 * 1) For example, I create an example corpus and output as textfules
 * 1) Read the the example corpus into NLTK's corpus class.
 * 1) Read the NLTK's corpus into NLTK's text class,
 * 2) where your book-like concoordance search is available

Penn Tree Bank
Everybody likes/hates the Penn Tree Bank (PTB) for some reason or another. BTW, the Penn Tree Bank is not exactly a corpus but a collection of corpora that was extended with Part-of-Speech tags and Treebank parses. SADLY, the PTB is NOT FREE but a sample of the English PTB is available on the nltk.download.

Here's how to access the different information that one would need from the PTB with the NLTK interface.

from nltk.corpus import treebank from nltk.corpus import treebank_chunks ptbfiles = [i for i in treebank.fileids] for pf in ptbfiles: for tree in treebank.parsed_sents(pf): print tree for pf in ptbfiles: for chunk_sent in treebank_chunks.chunked_sents: chunks = [i for i in chunk_sents[0].subtrees(filter=lambda t: t.node == 'NP')] for i in chunks: print i for pf in ptbfiles: for tree in treebank.parsed_sents(pf): print " ".join(tree.leaves) # Just the sentences in a single string print tree.pos # Sentence and POS in a list of tuples [(word,pos)...(word,pos)]
 * 1) To get the available files in the PTB sample
 * 1) To get bracketed trees from the PTB, note that each tree is a sentence
 * 1) To get chunks from PTB
 * 1) To just get the sentences and the POS

For the people who paid for the PTB and wants NLTK corpus reader to read all the PTB files instead of just the sample. You could either: Look for the directory where the PTB is stored and then copy and paste all the *.mrg files from the PTB into that directory and the code above will work just fine. $ python >>> import nltk >>> from nltk.corpus import treebank >>> print len(treebank.fileids) 199 # some small number, i forgot how many... >>> nltk.data.find('corpora/treebank/combined') FileSystemPathPointer('/home/alvas/nltk_data/corpora/treebank/combined' >>> exit $ find /home/alvas/savedcorpora/penntreebank_in_wsj_format/2.0/combined/wsj/ -type f -name "*.mrg" -exec cp \{\} /home/alvas/nltk_data/corpora/treebank/combined \; $ python >>> from nltk.corpus import treebank >>> print len(treebank.fileids) 2312

Skip the index of fileids given in the default NLTK sample and load the file directory from your directory import os, glob penndir = '/home/alvas/savedcorpora/penntreebank_in_wsj_format/2.0/combined/wsj/' for subdir in os.listdir(penndir): for pf in glob.glob(os.path.join(penndir+subdir+'/','*.mrg')): for tree in treebank.parsed_sents(pf): print tree

Senseval 2 Corpus
from nltk.corpus import senseval as ss print nltk.corpus.senseval.fileids print nltk.corpus.senseval.raw for id in nltk.corpus.senseval.fileids: # This access all instances from each file. insts = nltk.corpus.senseval.instances(id) # Looping through the instances. for i in insts: print i    # SensevalInstance returns (word, position, context, senses), # so you can each variable as such: print i.word, i.position, i.senses print i.context
 * 1) This shows what files from senseval are in NLTK. sadly there's only 4.
 * 1) This yields the line by line xml version of the senseval data.
 * 2) note the "\n" are also in each line.
 * 1) Most probably the individual instances are what you wanted to get.

SemCor
SemCor is made up of 3 subcorpora from Brown Corpus and 2 of the subcorpora had all content words tagged (i.e. /brown1/tagfiles/ and /brown2/tagfiles/) while 1 of them had only the verbs tagged (/brownv/tagfiles/).

It is indeed quite a pain to read the original SemCor corpus from | Rada Mihalcea's website. The ill-formed XML that cannot be easily read with BeautifulSoup, the lack of quotation marks for attribute values and worst of all unescaped special characters such as !@#$%^&*;?.

I have tried several attempts to write and rewrite readers for SemCor but the simplest, foolproof and most elegant way is to use the corrected XML available in NLTK. Read SemCor from NLTK is not unlike reading the PTB corpus. Here goes...

For the experts who just want to deal with the well-formed XML files: $ python >>> import nltk >>> nltk.data.find('corpora/semcor/') ZipFilePathPointer('/home/alvas/nltk_data/corpora/semcor.zip', 'semcor/')