Results
This page contains only the result tables from the latest evaluation runs.
The full evaluation report was published at the Workshop.
Competition 1
The segmentation with the highest F-measure is the best.
The winner is selected separately for each language.
Download the report on the final results from here.
Finnish
Turkish
German
English
Competition 2
In Competition 2, the organizers applied the analyses provided by the participants in information retrieval experiments.
The words in the queries and source documents were replaced by the corresponding morpheme analyses provided by the participants, and the search was then based on morphemes instead of words.
The evaluation was perfomed using a state-of-the-art retrieval method (the latest version of the freely available LEMUR toolkit.
The evaluation criterion was Uninterpolated Average Precision
There were several different categories in Competition 2 and the winner with the highest Average Precision was selected separately for each language and each category:
- All morpheme analysis from the training data are used as index terms ``withoutnew'' vs. additionally using also the morpheme analysis for new words that existed in the IR data but not in the training data ``withnew''.
- Tfidf (BM25) term weighting was utilized for all index terms without any stoplists vs. Okapi (BM25) term weighting for all index terms excluding an automatic stoplist consisting of 75,000 (Finnish) or 150,000 (German and English) most common terms. The stoplist was used with the Okapi weighting, because it did not perform well with the indexes that had many very common terms.
Download the report on the final results from here.
Finnish
German
English
Proceedings
The proceedings are available in
CLEF working notes.
The manuscripts and abstracts can be downloaded from
here.
Reference methods
To study how the different morpheme analysis performed in the IR tasks, we attempted the same tasks with different reference methods. This also revealed us whether the unsupervised morpheme analysis (or even a supervised one) could really be useful in the IR tasks compared to simple word based indexing.
- Morfessor Categories-Map: The same Morfessor Categories-Map as described in Competition 1 was used for the unsupervised morpheme analysis.
The stem vs. suffix tags were kept, but did not receive any special treatment in the indexing as we wanted to keep the IR evaluation as unsupervised as possible.
- Morfessor Baseline: All the words were simply split into smaller pieces without any morpheme analysis.
This means that the obtained subword units were directly used as index terms as such.
This was performed using the Morfessor Baseline algorithm as in Morpho Challenge 2005.
We expected that this would not be optimal for IR, but because the unsupervised morpheme analysis is such a difficult task, this simple method would probably do quite well.
- dummy: No words were split nor any morpheme analysis provided.
This means the all were directly used as index terms as such without any stemming or tags.
We expected that although the morpheme analysis should provide helpful information for IR, all the submissions would not probably be able to beat this brute force baseline.
However, if some morpheme analysis method would consistently beat this baseline in all languages and task, it would mean that the method were probably useful in a language and task independent way.
- grammatical: The words were analysed using the gold standard in each language that were utilised as the ``ground truth'' in the Competition 1.
Besides the stems and suffixes, the gold standard analyses typically consist of all kinds of grammatical tags which we decided to simply include as index terms, as well.
For many words the gold standard analyses included several alternative interpretations that were all included in the indexing.
However, we decided to also try the method adopted in the morpheme segmentation for Morpho Challenge 2005 that only the first interpretation of each word is applied.
This was here called ``grammatical first'' whereas the default was called ``grammatical all''.
Because our gold standards are quite small, 60k
(English) - 600k (Finnish), compared to the amount of words that the unsupervised methods can analyse, we did not expect ``grammatical'' to perform particularly well, even though it would probably capture some useful indexing features to beat the ``dummy'', at least.
- Porter: No real morpheme analysis was performed, but the words were stemmed by the Porter stemming, an option provided by the Lemur toolkit.
Because this is quite standard procedure in IR, especially for English text material, we expected this to provide the best results, at least for English.
For the other languges the default Porter stemming was not likely to perform very well.
- Tepper: A hybrid method developed by Michael Tepper was utilized to improve the morpheme analysis reference obtained by our Morfessor Categories-MAP.
Based on the obtained performance in Competition 1, we expected that this could provide some interesting results here, as well.
HOME
| RULES
| SCHEDULE
| DATASETS
| EVALUATION
| WORKSHOP
| RESULTS
| FAQ
| CONTACT
Page maintained by webmaster at cis.hut.fi,
last updated Tuesday, 03-Feb-2009 18:20:09 EET