User:Alvations/NLTK cheatsheet

NLTK (Natural Language Toolkit) (http://nltk.org/) is a nifty library for human language analysis (aka Computational Linguistics/Natural Language Processing). It's written in and for python by reputed computational/field linguist, Steven Bird. This user page is setup to answer some hiccups that new NLTK users will chance upon, especially on using the wordnet modules. I am using an Ubuntu 12.04 LTS Unix distro, so most of my solution to troubleshoot the hiccups are in ubuntu's context. I also have another userpage for python related cheatsheet.

How to install NLTK?
The main nltk page has a simple installation guide (see http://nltk.org/install.html). Note that NLTK requires Python versions 2.6-2.7.

How to check NLTK's version?
Firstly, open the python interpreter, then import nltk and then nltk.__version__, voila, the version number pops up!!  $ python Python 2.7.3 (default, Aug 1 2012, 05:14:39) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import nltk >>> nltk.__version__ '2.0.4'

How do I download the corpora and additional packages from nltk?
Although NLTK provides the basic tools for language processing, often resources like corpora, dictionaries, treebanks, grammars and pre-trained language models are necessary to process the training/testing data. So to use nltk to access these resources, you need to use the download module in nltk.  >>> import nltk >>> nltk.__version__ '2.0.4' >>> nltk.download showing info http://nltk.github.com/nltk_data/

How do I update nltk to the latest version?
I suggest pip install to ensure that the nltk is in-sync with your python distribution. What i normally do is to repeat the installation process to make sure that nltk dependencies like pyyaml or numpy are also updated, i would redo the installation process from nltk installation guide. To simply update nltk, try this: $ sudo pip install nltk

Corpus Readers
From experience, I have tried to recode different corpus readers to read corpora for NLP but none of my readers have came close to the elegant that Steven Bird had in NLTK. So here's my attempts to first go through all the pre-coded corpora reader, then extend the corpora readers to take in more corpus or in a more robust way. Go to NLTK Corpora Readers

How do I convert the from the Princeton format (used in SemCor) to the offset-pos format (e.g. 01234567-x)?
The prefeered Princeton format is also used in SemCor (e.g. bus%1:06:00::), often I find the other offset-pos format more palatable (e.g. 01234567-x). So there is a short piece of code to switch between the format (Author: FrancisBond, source:source click here)  >>> import nltk >>> from nltk.corpus import wordnet as ewn >>> def sc2ss(lemma,sensekey,senseno): ### Look up a synset given the infomation from SemCor ### Assuming it is the same WN version (e.g. 3.0) p = [′′, 'n', 'v', 'a', 'r', 's'] ## pos mapping return ewn.synset('%s.%s.%02d' % \                         (lemma, p[int(sensekey[0])], int(senseno))) >>> ss = sc2ss('live', '2:42:06::', '2') >>> print ss, ss.definition, ss.lexname, '(%08d-%s)' % (ss.offset, ss.pos)

How do I get the corpus instances from Senseval through NLTK?
(source: NLTK-user google grp) >>> import nltk >>> print nltk.corpus.senseval.fileids >>> print nltk.corpus.senseval.raw >>> for id in nltk.corpus.senseval.fileids: # This access all instances from each file. >>> insts = nltk.corpus.senseval.instances(id) # Looping through the instances. >>> for i in insts: >>> print i    # SensevalInstance returns (word, position, context, senses), # so you can each variable as such: >>> print i.word, i.position, i.senses >>> print i.context
 * 1) from nltk.corpus import senseval as ss
 * 1) This shows what files from senseval are in NLTK. sadly there's only 4.
 * 1) This yields the line by line xml version of the senseval data.
 * 2) note the "\n" are also in each line.
 * 1) Most probably the individual instances are what you wanted to get.

How to get the Adverbial forms of Adjectives (quick => quickly)
Although it is rare but WordNet have a relation call pertainym. It connects the relevant adjective to its adverbial form.

>>> from nltk.corpus import wordnet as wn >>> for ss in wn.all_synsets: # loop through all synsets in WordNet ...    for l in ss.lemmas: # loop through the possible lemmas in that synsets. ...            x = j.pertainyms # access lemma's pertainyms ...            if len(x) > 0: ...                    print str(ss.offset)+"-"+ss.pos, l, x

= HPSG related =

How to read ERG *.tdl into python ?
def readTDL(tdlfile): obj,temp = [], [] for line in open(tdlfile): if line[0] == ";": continue; temp.append(line) if line[-3:] == "].\n": obj.append("".join(temp).strip) temp = [] return obj

Get a dictionary from ERG lexicon.tdl
def vocab2lemmas(vocab): lemmas = set for word in vocab: lemma = word.split[0].rpartition("_")[0] if lemma == "": lemma = word.split[0] lemmas.add(lemma) return lemmas

vocab = readTDL('lexicon.tdl') lemmas = vocab2lemmas(vocab)

Getting idioms from ERG idioms.mtr
idioms = [i.split[0] for i in readTDL('idioms.mtr')]

Getting HPSG parses from ACE
import os def installACE: os.system("wget -P ~/ http://sweaglesw.org/linguistics/ace/download/ace-0.9.16-x86-64.tar.gz") os.system("tar -zxvf ~/ace-0.9.16-x86-64.tar.gz -C ~/ace-0.9.16") os.system("wget -P ~/ http://sweaglesw.org/linguistics/ace/download/erg-1212-x86-64-0.9.16.dat.bz2") os.system("bzip2 -dc ~/erg-1212-x86-64-0.9.16.dat.bz2 > ~/ace-0.9.16/erg-1212-x86-64-0.9.16.dat") # os.system(";".join(["wget -P ~/ http://sweaglesw.org/linguistics/ace/download/ace-0.9.16-x86-64.tar.gz","tar -zxvf ~/ace-0.9.16-x86-64.tar.gz -C ~/ace-0.9.16","wget -P ~/ http://sweaglesw.org/linguistics/ace/download/erg-1212-x86-64-0.9.16.dat.bz2","bzip2 -dc ~/erg-1212-x86-64-0.9.16.dat.bz2 > ~/ace-0.9.16/erg-1212-x86-64-0.9.16.dat"])) def aceParse(sent, onlyMRS=False, parameters): if onlyMRS == True: parameters+=" -T" return [p.strip for p in os.popen("echo " +sent+" | ~/ace-0.9.16/ace -g ~/ace-0.9.16/erg-1212-x86-64-0.9.16.dat"+parameters) if p.strip != ""][1:] sentence = "This is a foo bar sentence." parse_outputs = aceParse(sentence) for po in parse_outputs: print po
 * 1) If you have not installed ACE, uncomment and try the following line:
 * 2) installACE
 * 1) TODO: have a prettify function to make the ACE output humanly readable
 * 2) TODO: add ACE parameter functions
 * 3) TODO: properly get the MRS outputs into python variables
 * 4) TODO: proper ace object so that i can have ace.install, ace.parse, parse.prettify
 * 5) TODO: ace.update

=Simple NLP examples=

Movie Review classifier
http://stackoverflow.com/questions/21107075/classification-using-movie-review-corpus-in-nltk-python

Numpy array usage
http://stackoverflow.com/questions/27027680/working-out-word-document-vectors-from-nested-dictionary