Unsupervised Morpheme Analysis -- Morpho Challenge 2010

Rules

Submission of large data files

Send an email to the organizers morphochallenge2010mail.cis.hut.fi and tell where they can download the data files. Small data files (but not larger than a few MBs) can be emailed directly. Please, follow carefully the format of the result files described in datasets.

Acceptance

The organizers retain all rights to the Challenge data, which is given to the participants for use in this challenge only. The organizers may use the data submitted to the Challenge freely, without restrictions.

Eligibility

Anyone is allowed to participate. A participant may be either a single person or a group. A single person can participate in at most two groups. A participant is allowed to submit at most three different solutions, where each solution corresponds to a particular morpheme analysis method. Each of these methods may naturally be applied to each of the test languages. If a participant submits more than three solutions, the organizers decide which of the three will be accepted.

Test languages

Data sets are provided for four languages: English, Finnish, German and Turkish. Participants are encouraged to apply their algorithm to all of these test languages, but are free to leave some languages out, if they wish to do so.

(New languages may be added, if interested co-organizers, suitable data and evaluation analyses become available in time.)

Task

The task is to develop a system that conducts unsupervised morpheme analysis for every word form contained in a word list supplied by the organizers for each test language.

The participants will be pointed to corpora in which the words occur, so that the algorithms may utilize information about word context.

New in 2010: A new category for semi-supervised learning algorithms using the available linguistic gold standard morpheme analysis. The abstracts submitted by the participants must contain clear descriptions of which steps of supervision or parameter optimization are involved in the algorithms.

Participants are also allowed to use additional text corpora from any source for training their algorithms.

Competitions

The segmentations will be evaluated in three complementary ways:

Competition 1: The proposed morpheme analyses will be compared to a linguistic "gold standard".
Competition 2: Information retrieval (IR) experiments will be performed, where the words in the documents and queries will be replaced by their proposed morpheme representations. The search will then be based on morphemes instead of words.
Competition 3: Machine Translation (MT) model is trained, where the words in the source language documents will be replaced by their proposed morpheme representations. The words in the source language evaluation data will then also be replaced by their proposed morpheme representations and the translation will be based on morphemes instead of words.

Competition 1 will include all four test languages. Winners will be selected separately for each language and category (unsupervised or semi-supervised). As a performance measure, the F-measure of accuracy of suggested morpheme analyses is utilized. Should two solutions produce the same F-measure, the one with higher precision will win.

Competition 2 will include three of the test languages. The organizers will perform the IR experiments based on the morpheme analyses submitted by the participants.

Competition 3 will include two of the test languages. Translation will be done from the test language to English. The organizers will train the translation models and perform the evaluation of the translations using an automatic metric such as BLEU.

Workshop and publication

All good results will be acknowledged with fame and glory. Presentations for the challenge workshop will be selected by the organizers based on the results and 4-page extended abstract describing the algorithm and the data submission. However, all groups who have submitted results and a paper are welcome to participate in the workshop to listen to the talks and join the discussions.

Extended abtracts

For the extended abstract you can use the two-column ACL format. For templates and general formatting instructions, see ACL 2010 Instructions for Preparing Camera-Ready Papers. The length of the paper should be around 4 pages. Email your extended abstract to the organizers by July 16.

Arbitration

In the case of disagreement the organizers will decide the final interpretation of the rules.