The URDU.KON-TB Treebank

In this project, the development of the URDU.KON-TB treebank, its annotation scheme, evaluation and guidelines for the South Asian language Urdu, were incorporated. The development of a reliable treebank and a parser will have significant impact on the state-of-the-art for automatic Urdu language processing. The project was started in 2011 and still in progress. By-products of this project till to date includes a semi-semantic part of speech tagset, a semi-semantic syntactic tagset, a functional tagset, annotation guidelines, a grammar with sufficient encoded information for parsing of morphologically rich language Urdu and a part of speech tagged corpus, which can be used for the training of part of speech taggers. These resources will be enhanced further and can be used for natural language processing such as probabilistic parsing, training of POS taggers, disambiguation of spoken sentences, grammar development, language identification, sources for linguistic inquiry and psychological modeling, pattern matching, etc. Resources developed till to date are published and will be available in the following respective sections.
Corpus Collection
The raw corpus used for the URDU.KON-TB Treebank contained 1400 sentences collected from the Urdu Wikipedia and the Jang newspaper. The corpus contained text of local & international news, social stories, sports, culture, finance, religion, traveling, etc. On going effort to increase the size of this corpus contained 600 sentences, which will increase the size of the corpus from 1400 to 2000. Corpus updates will be provided soon.
Annotation Scheme
The hierarchal annotation scheme adopted has a combination of phrase structure and hyper dependency structure. A semi-semantic part of speech tag set, a semi-semantic syntactic tag set and a functional tag set have been designed, which are further revised during the annotation of the raw corpus. The annotation of the sentences was performed manually. Due to the addition of morphology, part of speech, syntactical, semantical, clausal, grammatical and miscellaneous features, the annotation scheme is linguistically rich. This annotation resulted in a treebank known as the URDU.KON-TB treebank. The published work regarding annotation scheme is as follows:

Annotation Evaluation
For an evaluation of the annotation scheme, Krippendorff’s α co-efficient was selected. This is a statistical measure to evaluate inter-annotator agreement. Randomly selected 100 sentences from the URDU.KON-TB treebank were given to five trained annotators for annotation. The annotated sentences then evaluated using the Krippendorff’s α co-efficient. The α values of inter-annotator agreement obtained for part of speech, syntactical and functional annotation were 0.964, 0.817 and 0.806, respectively. All of the three values lie in the range of perfect agreement. The published work regarding annotation evaluation will be provided here soon.
Annotation Guidelines
The annotation guidelines devised in the development of the URDU.KON-TB treebank were revised during and after the annotation evaluation. The updated version will be provided here soon.