Unsupervised Morpheme Analysis -- Morpho Challenge 2009

Competition 1

In Competition 1, for each language, the morpheme analyses proposed by the participants' algorithm will be compared against a linguistic gold standard. Since the task at hand involves unsupervised learning, it cannot be expected that the algorithm comes up with morpheme labels that exactly correspond to the ones designed by linguists. That is, no direct comparison will take place between labels as such (the labels in the proposed analyses vs. labels in the gold standard). What can be expected, however, is that two word forms that contain the same morpheme according to the participants' algorithm also have a morpheme in common according to the gold standard. For instance, in the English gold standard, the words "foot" and "feet" both contain the morpheme "foot_N". It is thus desirable that also the participants' algorithm discovers a morpheme that occurs in both these word forms (be it called "FOOT", "morpheme784", "foot" or something else).

In practice, the evaluation will take place by sampling a large number of word pairs, such that both words in the pair have at least one morpheme in common. As the evaluation measure, we will use F-measure, which is the harmonic mean of Precision and Recall:

F-measure = 1/(1/Precision + 1/Recall).

Precision is here calculated as follows: A number of word forms will be randomly sampled from the result file provided by the participants; for each morpheme in these words, another word containing the same morpheme will be chosen from the result file by random (if such a word exists). We thus obtain a number of word pairs such that in each pair at least one morpheme is shared between the words in the pair. These pairs will be compared to the gold standard; a point is given for each word pair that really has a morpheme in common according to the gold standard. The maximum number of points for one sampled word is normalized to one. The total number of points is then divided by the total number of sampled words.

For instance, assume that the proposed analysis of the English word "abyss" is: "abys +s". Two word pairs are formed: Say that "abyss" happens to share the morpheme "abys" with the word "abysses"; we thus obtain the word pair "abyss - abysses". Also assume that "abyss" shares the morpheme "+s" with the word "mountains"; this produces the pair "abyss - mountains". Now, according to the gold standard the correct analyses of these words are: "abyss_N", "abyss_N +PL", "mountain_N +PL", respectively. The pair "abyss - abysses" is correct (common morpheme: "abyss_N"), but the pair "abyss - mountain" is incorrect (no morpheme in common). Precision for the word "abyss" is thus 1/2 = 50%.

Recall is calculated analogously to precision: A number of word forms are randomly sampled from the gold standard file; for each morpheme in these words, another word containing the same morpheme will be chosen from the gold standard by random (if such a word exists). The word pairs are then compared to the analyses provided by the participants; a point is given for each sampled word pair that has a morpheme in common also in the analyses proposed by the participants' algorithm. Points per word is normalized to one and the total number of points is divided by the total number of words.

For words that have several alternative analyses, as well as for word pairs that have more than one morpheme in common, normalization of the points is carried out. In short, an equal weight is given for each alternative analysis, as well as each word pair in an analysis. E.g., if a word has three alternative analyses, the first analysis has four morphemes, and the first word pair in that analysis has two morphemes in common, each of the two common morphemes will amount to 1/3*1/4*1/2=1/24 of the one point available for that word.

Small sets of the gold standards used are available on the datasets page for semi-supervised learning and parameter validation. Neither the labeled training set nor the development test set contain any of the word forms used in the final test set.

Evaluation of a sample (development test set)

You can evaluate your morphological analyses against the available gold standards (separately for each test language). The program to use for this is the Perl script: eval_morphemes_v2.pl. The evaluation program is invoked as follows:

eval_morphemes_v2.pl [-trace] wordpairsfile_goldstd wordpairsfile_result goldstdfile resultfile

Four files are given as arguments to eval_morphemes_v2.pl:

wordpairsfile_goldstd: this is the "random word pairs file" available for download on the datasets page. This file is needed in the calculation of an estimate of the recall of the proposed morpheme analyses.
wordpairsfile_result: this file has to be generated using another program (see below). It is needed in the calculation of a rough estimate of the precision of the proposed morpheme analyses.
goldstdfile:this is the sample of the gold standard available for download on the datasets page. This file contains the correct morpheme analyses.
resultfile: this is the result file that your algorithm produces, i.e., a list of words and their proposed morpheme analyses. Make sure that the file format is correct and does not include any additional whitespace.

The -trace argument is optional and produces output for every evaluated word separately. Regardless of the status of the trace argument, the evaluation program produces output of the following kind:

PART0. Precision: 69.00% (96/139); non-affixes: 81.55% (51/63); affixes: 58.73% (45/76)
PART0. Recall:    25.59% (142/556); non-affixes: 49.78% (105/211); affixes: 10.78% (37/345)
PART0. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%
#
TOTAL. Precision: 69.00%; non-affixes: 81.55%; affixes: 58.73%
TOTAL. Recall:    25.59%; non-affixes: 49.78%; affixes: 10.78%
TOTAL. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%

Note that results are displayed for partition 0 (PART0) and for the entire data (TOTAL). The total scores are here the same as the scores of PART0, since there is only one partition. It is, however, possible to split the data into several partitions and compute results for each partition separately. The overall scores are then calculated as the mean over the partitions. Splitting into partitions is a feature reserved for the final evaluation, when we will assess the statistical significance of the differences between the participants' algorithms.

The figures that count in the final evaluation are the first precision, recall, and F-measure values on the TOTAL lines. These values pertain to all morphemes, but there are also separate statistics for morphemes classified as non-affixes vs. affixes. What counts as an affix is a morpheme with a label starting with a plus sign, e.g., "+PL", "+PAST". This naming convention is applied in the gold standard, which means that you do not have to do anything in order to get the non-affixes/affixes statistics right as far as recall is concerned. However, if you want the same kind of information also for precision, your algorithm must have a means of discovering which morphemes are likely affixes and tag these morphemes with an initial plus sign. Note that it is fully up to you whether you do this or not; it will not affect your position in the competition in any way.

Sampling word pairs for the calculation of an estimate of the precision

In order to get an estimate of the precision of the algorithm, you need to provide the evaluation script eval_morphemes_v2.pl with a file containing word pairs sampled from your result file. Unfortunately, the estimate is likely to be fairly rough. The reason for this is that you do not have the entire gold standard at your disposal. Thus, if you sample pairs of words that are not included in the gold standard labels that you can access, it is impossible to know whether the proposed morphemes are correct or not. What you can do, however, is to make sure that each word that goes into a word pair actually does occur in the gold standard sample. There is still a problem that your algorithm might not propose that many common morphemes for the words within this limited set, and thus the estimate will be based on rather few observations.

Anyway, this is how to do it: First, make a list of relevant words, that is, words that are present in the gold standard sample available:

cut -f1 goldstdfile > relevantwordsfile

Then sample word pairs for 300 words selected by random from your results file:

sample_word_pairs_v2.pl -n 300 -refwords relevantwordsfile < resultfile > wordpairsfile_result

You can use more than 300 words if you wish (the maximum is the amount of words in the gold standard file), or sample several word pair files with different random initializations (-rand switch), but remember that the estimate of precision that you will obtain is biased in any case, due to the limited amount of gold standard labels.

The necessary Perl program is sample_word_pairs_v2.pl. The output file wordpairsfile_result is used as input to eval_morphemes_v2.pl (see above).

Competition 2

Competition 2 does not necessarily require any extra effort by the participants. The organizers will use the analyses provided by the participants in information retrieval experiments. Data from CLEF will be used.
However, those participants who wish to submit morpheme analysis for words in their actual context (competition 2b), please contact the organizers for more information how to register to CLEF to obtain the full texts.

In the competition 2 (and 2b) the words in the queries and documents will be replaced by the corresponding morpheme analyses provided by the participants. We will perform the IR evaluation using the state-of-the-art Okapi (BM25) retrieval method (the latest version of the freely available LEMUR toolkit. The most common morphemes in each participant's submission will be left out from the index. The size of this stoplist will be proportional to the amount of the text data in each language and the stoplist size will be the same for each participant's submission. The evaluation criterion will be Uninterpolated Average Precision. The segmentation with the highest Average Precision will win. The winner is selected separately for competitions 2 and 2b in each language.

Competition 3

In competition 3, the morpheme analyses proposed by the participants' algorithm will be evaluated in a statistical machine translation (SMT) framework. The translation models will be trained to translate from a morphologically complex source language to English. The words of the source language will be replaced by their morpheme analyses before training. The translations from this morpheme-to-word model will be combined with translations from a standard word-to-word translation model. For all models, we will use a state-of-the-art phrase-based SMT system. Evaluation of the translations will be performed by applying an automatic metric such as BLEU on a held-out test set.

Competition 3 does not require any extra effort by the participants. The organizers will use the analyses provided by the participants in machine translation experiments. Data from the Europarl corpus will be used. Those participants who wish to submit morpheme analysis for words in their actual context, please contact the organizers for information on how to get the full corpus.