Udapi Tutorial

by Martin Popel


Udapi is an API and framework for processing Universal Dependencies available for Python, Perl and Java. This tutorial uses the Python version and expects Linux+Bash and Python 3.3 or higher.

You can download my slides about UD and Udapi.

Step 1: Install Udapi

Follow the instructions at https://github.com/udapi/udapi-python.
Solution
pip3 install --user --upgrade git+https://github.com/udapi/udapi-python.git
export PATH="$HOME/.local/bin/:$PATH"

Step 2: Download sample data

Download and extract ud20sample.tgz. There are just 100 sentences for each language plus two bigger files (train.conllu and dev.conllu) for English and Czech. For full UDv2.0 go to Lindat.
Solution
wget http://ufal.mff.cuni.cz/~popel/udapi/ud20sample.tgz
tar -xf ud20sample.tgz
cd sample

Step 3: Browse your favorite language

Use the udapy commands from my slides.
Solution
cat UD_Ancient_Greek/sample.conllu | udapy -T | less -R

The -R option tells less to display colors (instead of their ANSI codes). Type q to exit.

The -T prints the trees in text mode and it is actually a shortcut for udapy write.TextModeTrees color=1. Run udapy --help to see other useful shortcuts, e.g.

cat UD_English/sample.conllu | udapy -H > en.html
will create a html version, you can open in any modern browser. -HA will include all the nodes' attributes in the html output.

Step 4: Find out what does the discourse deprel (dependency relation) mean

OptionA: search the documentation.

Solution

see the documentation of discourse deprel

OptionB: browse UD_English/dev.conllu as in the previous step and find the occurences of discourse.

Solution
udapy -T < UD_English/dev.conllu | less -R
Now use regex search integrated in less, i.e. type /discourse and then press n to jump to the next occurence.

OptionC: extract all word forms and UPOS tags of nodes annotated with the discourse deprel in UD_English/dev.conllu. Hints: use udapy util.Eval node='PYTHON_CODE' and substitute PYTHON_CODE with a code which should use node.deprel, node.form and node.upos. The standard Unix way of frequency analysis is sort | uniq -c | sort -rn.

Solution
udapy util.Eval node='if node.deprel == "discourse": print(node.form, node.upos)' < dev.conllu > disc.txt
cat disc.txt | sort | uniq -c | sort -rn | less