Unsupervised Morpheme Analysis -- Morpho Challenge 2009

This is a page of the previous Morpho Challenge 2009. The current challenge is Morpho Challenge 2010.

Rules

Submission of large data files

Send an email to the organizers morphochallenge2008mail.cis.hut.fi and tell where they can download the data files. Small data files (but not larger than a few MBs) can be emailed directly. Please, follow carefully the format of the result files described in datasets.

Acceptance

The organizers retain all rights to the Challenge data, which is given to the participants for use in this challenge only. The organizers may use the data submitted to the Challenge freely, without restrictions.

Eligibility

Anyone is allowed to participate. A participant may be either a single person or a group. A single person can participate in at most two groups. A participant is allowed to submit at most three different solutions, where each solution corresponds to a particular morpheme analysis method. Each of these methods may naturally be applied to each of the test languages. If a participant submits more than three solutions, the organizers decide which of the three will be accepted.

Test languages

Data sets are provided for five languages: Arabic, English, Finnish, German and Turkish. Participants are encouraged to apply their algorithm to all of these test languages, but are free to leave some languages out, if they wish to do so.

(New languages may be added, if interested co-organizers, suitable data and evaluation analyses become available in time.)

Task

The task is to develop a system that conducts unsupervised morpheme analysis for every word form contained in a word list supplied by the organizers for each test language.

The participants will be pointed to corpora in which the words occur, so that the algorithms may utilize information about word context.

Solutions, in which a large number of parameters must be "tweaked" separately for each test language, are of little interest. This challenge aims at the unsupervised (or very minimally supervised) morpheme analysis of words. The abstracts submitted by the participants must contain clear descriptions of which steps of supervision or parameter optimization are involved in the algorithms.

Competitions

The segmentations will be evaluated in three complementary ways:

Competition 1: The proposed morpheme analyses will be compared to a linguistic "gold standard".
Competition 2: Information retrieval (IR) experiments will be performed, where the words in the documents and queries will be replaced by their proposed morpheme representations. The search will then be based on morphemes instead of words.
Competition 3: Machine Translation (MT) model is trained, where the words in the source language documents will be replaced by their proposed morpheme representations. The words in the source language evaluation data will then also be replaced by their proposed morpheme representations and the translation will be based on morphemes instead of words.

Competition 1 will include all five test languages. Winners will be selected separately for each language. As a performance measure, the F-measure of accuracy of suggested morpheme analyses is utilized. Should two solutions produce the same F-measure, the one with higher precision will win.

Competition 2 will include three of the test languages. The organizers will perform the IR experiments based on the morpheme analyses submitted by the participants.

Competition 3 will include two of the test languages. Translation will be done from the test language to English. The organizers will train the translation models and perform the evaluation of the translations using an automatic metric such as BLEU.

Workshop and publication

All good results will be acknowledged with fame and glory. Presentations for the challenge workshop will be selected by the organizers based on the results and a paper of at most 10 pages describing the algorithm and the data submission. However, all groups who have submitted results and a paper are welcome to participate in the workshop to listen to the talks and join the discussions.

Workshop papers

For your paper submission (due August 15), please use the single-column CLEF 2007 Notebook Proceedings format. Here are a sample PDF file and a template Latex document. Email your paper submission to the organizers.
Formatting instructions:

size:	A4
format:	pdf (if difficult, ps or MS Word (rtf) are acceptable) Do NOT lock the pdf file.
borders:	top, left right 2.5 cm ; bottom 3 cm
text size:	16 x 24 cm
length:	10 pages maximum
title:	Times 14 pt bold centered
author(s):	Times 10 pt centered
abstract:	Times 10 pt justified
ACM Categories and Subject Descriptors:	Times 10 pt left aligned
Free Keywords:	Times 10 pt left aligned
body text:	Times 10 pt justified
Section Headings:	Times 12 pt bold left aligned
Emphasis:	Times 10 pt italic

Arbitration

In the case of disagreement the organizers will decide the final interpretation of the rules.