This is a page of the previous Morpho Challenge 2008. The current challenge is Morpho Challenge 2009.
In Competition 1, for each language, the morpheme analyses proposed by the participants' algorithm will be compared against a linguistic gold standard. Samples of the gold standards used are available for download on the datasets page.
Since the task at hand involves unsupervised learning, it cannot be expected that the algorithm comes up with morpheme labels that exactly correspond to the ones designed by linguists. That is, no direct comparison will take place between labels as such (the labels in the proposed analyses vs. labels in the gold standard). What can be expected, however, is that two word forms that contain the same morpheme according to the participants' algorithm also have a morpheme in common according to the gold standard. For instance, in the English gold standard, the words "foot" and "feet" both contain the morpheme "foot_N". It is thus desirable that also the participants' algorithm discovers a morpheme that occurs in both these word forms (be it called "FOOT", "morpheme784", "foot" or something else).
In practice, the evaluation will take place by sampling a large number of
word pairs, such that both words in the pair have at least one
morpheme in common. As the evaluation measure, we will use
F-measure, which is the harmonic mean of Precision
F-measure = 1/(1/Precision + 1/Recall).
Precision is here calculated as follows: A number of word forms will be randomly sampled from the result file provided by the participants; for each morpheme in these words, another word containing the same morpheme will be chosen from the result file by random (if such a word exists). We thus obtain a number of word pairs such that in each pair at least one morpheme is shared between the words in the pair. These pairs will be compared to the gold standard; a point is given for each word pair that really has a morpheme in common according to the gold standard. The total number of points is then divided by the total number of word pairs.
For instance, assume that the proposed analysis of the English word "abyss" is: "abys +s". Two word pairs are formed: Say that "abyss" happens to share the morpheme "abys" with the word "abysses"; we thus obtain the word pair "abyss - abysses". Also assume that "abyss" shares the morpheme "+s" with the word "mountains"; this produces the pair "abyss - mountains". Now, according to the gold standard the correct analyses of these words are: "abyss_N", "abyss_N +PL", "mountain_N +PL", respectively. The pair "abyss - abysses" is correct (common morpheme: "abyss_N"), but the pair "abyss - mountain" is incorrect (no morpheme in common). Precision here is thus 1/2 = 50%.
Recall is calculated analogously to recall: A number of word forms are randomly sampled from the gold standard file; for each morpheme in these words, another word containing the same morpheme will be chosen from the gold standard by random (if such a word exists). The word pairs are then compared to the analyses provided by the participants; a point is given for each sampled word pair that has a morpheme in common also in the analyses proposed by the participants' algorithm. The total number of points is then divided by the total number of word pairs.
For words that have several alternative analyses, as well as for word pairs that have more than one morpheme in common, some normalization of the points is carried out in order not to give these words considerably more weight in the evaluation than "less complex" words. We will spare the participants from the gory details. (The passionately interested reader may have a look at the source code of the evaluation script.)
You can evaluate your morphological analyses against the available gold standards (separately for each test language). The program to use for this is the Perl script: eval_morphemes.pl. The evaluation program is invoked as follows:
eval_morphemes.pl [-trace] wordpairsfile_goldstd wordpairsfile_result goldstdfile resultfile
Four files are given as arguments to eval_morphemes.pl:
The -trace argument is optional and produces output for every evaluated word separately. Regardless of the status of the trace argument, the evaluation program produces output of the following kind:
PART0. Precision: 69.00% (96/139); non-affixes: 81.55% (51/63); affixes: 58.73% (45/76) PART0. Recall: 25.59% (142/556); non-affixes: 49.78% (105/211); affixes: 10.78% (37/345) PART0. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22% # TOTAL. Precision: 69.00%; non-affixes: 81.55%; affixes: 58.73% TOTAL. Recall: 25.59%; non-affixes: 49.78%; affixes: 10.78% TOTAL. F-measure: 37.33%; non-affixes: 61.82%; affixes: 18.22%
Note that results are displayed for partition 0 (PART0) and for the entire data (TOTAL). The total scores are here the same as the scores of PART0, since there is only one partition. It is, however, possible to split the data into several partitions and compute results for each partition separately. The overall scores are then calculated as the mean over the partitions. Splitting into partitions is a feature reserved for the final evaluation, when we will assess the statistical significance of the differences between the participants' algorithms.
The figures that count in the final evaluation are the first precision, recall, and F-measure values on the TOTAL lines. These values pertain to all morphemes, but there are also separate statistics for morphemes classified as non-affixes vs. affixes. What counts as an affix is a morpheme with a label starting with a plus sign, e.g., "+PL", "+PAST". This naming convention is applied in the gold standard, which means that you do not have to do anything in order to get the non-affixes/affixes statistics right as far as recall is concerned. However, if you want the same kind of information also for precision, your algorithm must have a means of discovering which morphemes are likely affixes and tag these morphemes with an initial plus sign. Note that it is fully up to you whether you do this or not; it will not affect your position in the competition in any way.
Sampling word pairs for the calculation of an estimate of the precision
In order to get an estimate of the precision of the algorithm, you need to provide the evaluation script eval_morphemes.pl with a file containing word pairs sampled from your result file. Unfortunately, the estimate is likely to be fairly rough. The reason for this is that you do not have the entire gold standard at your disposal. Thus, if you sample pairs of words that are not included in the 500-word gold standard that you can access, it is impossible to know whether the proposed morphemes are correct or not. What you can do, however, is to make sure that each word that goes into a word pair actually does occur in the 500-word gold standard sample. The problem here is that your algorithm might not propose that many common morphemes for the words within this limited set, and thus the estimate will be based on rather few observations.
Anyway, this is how to do it: First, make a list of relevant words, that is, words that are present in the gold standard sample available:
cut -f1 goldstdfile > relevantwordsfile
Then sample word pairs for 100 words selected by random from your results file:
sample_word_pairs.pl -refwords relevantwordsfile < resultfile > wordpairsfile_result
Competition 2 does not necessarily require any extra effort by the
participants. The organizers will use the analyses provided by
the participants in information retrieval experiments. Data from CLEF will be used.
However, those participants who wish to submit morpheme analysis for words in their actual context (competition 2b), please contact the organizers for more information how to register to CLEF to obtain the full texts.
In the competition 2 (and 2b) the words in the queries and documents will be replaced by the corresponding morpheme analyses provided by the participants. We will perform the IR evaluation using the state-of-the-art Okapi (BM25) retrieval method (the latest version of the freely available LEMUR toolkit. The most common morphemes in each participant's submission will be left out from the index. The size of this stoplist will be proportional to the amount of the text data in each language and the stoplist size will be the same for each participant's submission. The evaluation criterion will be Uninterpolated Average Precision. The segmentation with the highest Average Precision will win. The winner is selected separately for competitions 2 and 2b in each language.
You are at: CIS → Unsupervised Morpheme Analysis -- Morpho Challenge 2008
Page maintained by webmaster at cis.hut.fi, last updated Tuesday, 03-Feb-2009 18:23:05 EET