Unsupervised Morpheme Analysis -- Morpho Challenge 2008

This is a page of the previous Morpho Challenge 2008. The current challenge is Morpho Challenge 2009.

Results

The full evaluation reports and the descriptions of the participating methods have been published at the Workshop.

Competition 1

The segmentation with the highest F-measure is the best. The winner is selected separately for each language.
These results are preliminary. Download the report manuscript on the final results from here.

Arabic
English
Finnish
German
Turkish

Competition 2

In Competition 2, the organizers applied the analyses provided by the participants in information retrieval experiments. The words in the queries and source documents were replaced by the corresponding morpheme analyses provided by the participants, and the search was then based on morphemes instead of words. The evaluation was perfomed using a state-of-the-art retrieval method (the latest version of the freely available LEMUR toolkit. The evaluation criterion was Uninterpolated Average Precision. The evaluation was performed only for the parcipitants that submitted the segmentation for the extended wordlists gathered from the IR evaluation corpora. Okapi (BM25) term weighting was used for all index terms excluding an automatic stoplist consisting of terms that have collection frequency higher than 75000 (Finnish) or 150000 (German and English). The stoplist was used with the Okapi weighting, because it did not perform well with the indexes that had many very common terms.

Download the report manuscript on the final results from here.

Finnish
German
English

Proceedings

The proceedings are available in CLEF working notes.
The manuscripts and abstracts can be downloaded from here.

Reference methods

To study how the different morpheme analysis performed in the IR tasks, we attempted the same tasks with different reference methods. This also revealed us whether the unsupervised morpheme analysis (or even a supervised one) could really be useful in the IR tasks compared to simple word based indexing.

Morfessor Categories-Map: The same Morfessor Categories-Map as described in Competition 1 was used for the unsupervised morpheme analysis. The stem vs. suffix tags were kept, but did not receive any special treatment in the indexing as we wanted to keep the IR evaluation as unsupervised as possible.
Morfessor Baseline: All the words were simply split into smaller pieces without any morpheme analysis. This means that the obtained subword units were directly used as index terms as such. This was performed using the Morfessor Baseline algorithm as in Morpho Challenge 2005. We expected that this would not be optimal for IR, but because the unsupervised morpheme analysis is such a difficult task, this simple method would probably do quite well.
dummy: No words were split nor any morpheme analysis provided except hyphens were replaced by spaces so that hyphenated words were indexed as separate words (changed from last year). This means words were directly used as index terms as such without any stemming or tags. We expected that although the morpheme analysis should provide helpful information for IR, all the submissions would not probably be able to beat this brute force baseline. However, if some morpheme analysis method would consistently beat this baseline in all languages and task, it would mean that the method were probably useful in a language and task independent way.
grammatical: The words were analysed using the gold standard in each language that were utilised as the "ground truth" in the Competition 1. Besides the stems and suffixes, the gold standard analyses typically consist of all kinds of grammatical tags which we decided to simply include as index terms, as well. For many words the gold standard analyses included several alternative interpretations that were all included in the indexing. However, we decided to also try the method adopted in the morpheme segmentation for Morpho Challenge 2005 that only the first interpretation of each word is applied. This was here called "grammatical first" whereas the default was called "grammatical all". Words that were not in the gold standard segmentation were indexed as such. Because our gold standards are quite small, 60k (English) - 600k (Finnish), compared to the amount of words that the unsupervised methods can analyse, we did not expect ``grammatical'' to perform particularly well, even though it would probably capture some useful indexing features to beat the "dummy" method, at least.
snowball: No real morpheme analysis was performed, but the words were stemmed by stemming algorithms provided by snowball libstemmer library. Porter stemming algoritm was used for English. Finnish and German stemmers were used for the other languages. Hyphenated words were first split to parts that were then stemmed separately. Stemming is expected to perform very well for English but not necessarily for the other languages because it is harder to find good stems.
TWOL: Two-level morphological analyzer was used to find the normalized forms of the words. These forms were then used as index terms. Some words may have several alternative normalized forms and two cases were studied similarly to the grammatical case. Either all alternatives were used ("all") or only the first one ("first"). Compound words were split to parts. Words not recognized by the analyzer were indexed as such. German analyzer was not available.