Laboratory of Computer and Information Science / Neural Networks Research Centre CIS Lab Helsinki University of Technology

IR Reference methods

To study how the different morpheme analysis performed in the IR tasks, we attempted the same tasks with different reference methods. This also revealed us whether the unsupervised morpheme analysis (or even a supervised one) could really be useful in the IR tasks compared to simple word based indexing.

  1. Morfessor Categories-MAP: The same Morfessor Categories-MAP as described in Competition 1 was used for the unsupervised morpheme analysis. The stem vs. suffix tags were kept, but did not receive any special treatment in the indexing as we wanted to keep the IR evaluation as unsupervised as possible.
  2. Morfessor Baseline: All the words were simply split into smaller pieces without any morpheme analysis. This means that the obtained subword units were directly used as index terms as such. This was performed using the Morfessor Baseline algorithm as in Morpho Challenge 2005. We expected that this would not be optimal for IR, but because the unsupervised morpheme analysis is such a difficult task, this simple method would probably do quite well.
  3. dummy: No words were split nor any morpheme analysis provided except hyphens were replaced by spaces so that hyphenated words were indexed as separate words (changed from last year). This means words were directly used as index terms as such without any stemming or tags. We expected that although the morpheme analysis should provide helpful information for IR, all the submissions would not probably be able to beat this brute force baseline. However, if some morpheme analysis method would consistently beat this baseline in all languages and task, it would mean that the method were probably useful in a language and task independent way.
  4. grammatical: The words were analysed using the gold standard in each language that were utilised as the "ground truth" in the Competition 1. Besides the stems and suffixes, the gold standard analyses typically consist of all kinds of grammatical tags which we decided to simply include as index terms, as well. For many words the gold standard analyses included several alternative interpretations that were all included in the indexing. However, we decided to also try the method adopted in the morpheme segmentation for Morpho Challenge 2005 that only the first interpretation of each word is applied. This was here called "grammatical first" whereas the default was called "grammatical all". Words that were not in the gold standard segmentation were indexed as such. Because our gold standards are quite small, 60k (English) - 600k (Finnish), compared to the amount of words that the unsupervised methods can analyse, we did not expect ``grammatical'' to perform particularly well, even though it would probably capture some useful indexing features to beat the "dummy" method, at least.
  5. snowball: No real morpheme analysis was performed, but the words were stemmed by stemming algorithms provided by snowball libstemmer library. Porter stemming algoritm was used for English. Finnish and German stemmers were used for the other languages. Hyphenated words were first split to parts that were then stemmed separately. Stemming is expected to perform very well for English but not necessarily for the other languages because it is harder to find good stems.
  6. TWOL: Two-level morphological analyzer was used to find the normalized forms of the words. These forms were then used as index terms. Some words may have several alternative normalized forms and two cases were studied similarly to the grammatical case. Either all alternatives were used ("all") or only the first one ("first"). Compound words were split to parts. Words not recognized by the analyzer were indexed as such.

Return to the result page

HOME | RULES | SCHEDULE | DATASETS | EVALUATION | WORKSHOP | RESULTS | FAQ | CONTACT

[an error occurred while processing this directive]