Unsupervised Morpheme Analysis -- Morpho Challenge 2009

This is a page of the previous Morpho Challenge 2009. The current challenge is Morpho Challenge 2010.

Results

The full evaluation reports and the descriptions of the participating methods have been published at the Workshop.

Competition 1 - Comparison to Linguistic Morphemes

The segmentation with the highest F-measure is the best. The winner is selected separately for each language.

Arabic (non-vowelized)
Arabic (vowelized)
English
Finnish
German
Turkish

The reference methods are Morfessor Baseline and Morfessor Categories-MAP algorithms (see Creutz and Lagus, 2007) and letters, which simply segments each word to the letters it consists of. This gives the best recall available for any method that is based solely on segmentation.

Competition 2 - Information Retrieval

In Competition 2, the morpheme analyses were compared by using them in an Information Retrieval (IR) task with three languages: English, German and Finnish. The experiments were performed by replacing the words in the corpus and the queries by the submitted morpheme analyses. The evaluation criterion was Mean Average Precision.

If the participant did not submit segmentations for the Competition 2 wordlist, the evaluation was performed with the smaller Competition 1 lists. The additional words in the IR task were then indexed as such without analysis for those participants.

The IR experiments were performed using the freely available LEMUR toolkit version 4.4. The popular Okapi BM25 ranking function was used. Okapi BM25 does not perform well with the indexes that heve many very common terms. An automatic stoplist was used to overcome this. Any term that has a collection frequency higher than 75000 (Finnish) or 150000 (German and English) is added to the stoplist and thus exluded from the corpus.

The results:

Finnish
German
English

Competition 3 - Statistical Machine Translation

In Competition 3, the morpheme analasyses were compared by using them in two machine translation (MT) tasks: German to English and Finnish to English. The experiments were performed by replacing the words in the source language side of the parallel corpus by the submitted morpheme analyses. The target language side (English) was not modified. The final translations were obtained by Minimum Bayes Risk (MBR) combination with a standard word-based translation model. The performance was measured with BLEU scores.

Finnish
German

Proceedings

The proceedings are available in CLEF working notes.
The manuscripts can be downloaded from here.

Reference methods

To study how the different morpheme analysis performed in the IR tasks, we attempted the same tasks with different reference methods. This also revealed us whether the unsupervised morpheme analysis (or even a supervised one) could really be useful in the IR tasks compared to simple word based indexing.

Morfessor Categories-MAP: The same Morfessor Categories-MAP as described in Competition 1 was used for the unsupervised morpheme analysis. The stem vs. suffix tags were kept, but did not receive any special treatment in the indexing as we wanted to keep the IR evaluation as unsupervised as possible.
Morfessor Baseline: All the words were simply split into smaller pieces without any morpheme analysis. This means that the obtained subword units were directly used as index terms as such. This was performed using the Morfessor Baseline algorithm as in Morpho Challenge 2005. We expected that this would not be optimal for IR, but because the unsupervised morpheme analysis is such a difficult task, this simple method would probably do quite well.
dummy: No words were split nor any morpheme analysis provided except hyphens were replaced by spaces so that hyphenated words were indexed as separate words (changed from last year). This means words were directly used as index terms as such without any stemming or tags. We expected that although the morpheme analysis should provide helpful information for IR, all the submissions would not probably be able to beat this brute force baseline. However, if some morpheme analysis method would consistently beat this baseline in all languages and task, it would mean that the method were probably useful in a language and task independent way.
grammatical: The words were analysed using the gold standard in each language that were utilised as the "ground truth" in the Competition 1. Besides the stems and suffixes, the gold standard analyses typically consist of all kinds of grammatical tags which we decided to simply include as index terms, as well. For many words the gold standard analyses included several alternative interpretations that were all included in the indexing. However, we decided to also try the method adopted in the morpheme segmentation for Morpho Challenge 2005 that only the first interpretation of each word is applied. This was here called "grammatical first" whereas the default was called "grammatical all". Words that were not in the gold standard segmentation were indexed as such. Because our gold standards are quite small, 60k (English) - 600k (Finnish), compared to the amount of words that the unsupervised methods can analyse, we did not expect ``grammatical'' to perform particularly well, even though it would probably capture some useful indexing features to beat the "dummy" method, at least.
snowball: No real morpheme analysis was performed, but the words were stemmed by stemming algorithms provided by snowball libstemmer library. Porter stemming algoritm was used for English. Finnish and German stemmers were used for the other languages. Hyphenated words were first split to parts that were then stemmed separately. Stemming is expected to perform very well for English but not necessarily for the other languages because it is harder to find good stems.
TWOL: Two-level morphological analyzer was used to find the normalized forms of the words. These forms were then used as index terms. Some words may have several alternative normalized forms and two cases were studied similarly to the grammatical case. Either all alternatives were used ("all") or only the first one ("first"). Compound words were split to parts. Words not recognized by the analyzer were indexed as such. German analyzer was not available.