[an error occurred while processing this directive]



The organizers retain all rights to the Challenge data, which is given to the participants for use in this challenge only. The organizers may use the data submitted to the Challenge freely, without restrictions.


Anyone is allowed to participate. A participant may be either a single person or a group. A single person can participate in at most two groups. A participant is allowed to submit at most three different solutions, where each solution corresponds to a particular segmentation method. Each of these methods may naturally be applied to each of the test languages. If a participant submits more than three solutions, the organizers decide which of three will be accepted.

Test languages

Data sets are provided for three languages: Finnish, English, and Turkish. Participants are encouraged to apply their algorithm to all of these test languages, but are free to leave some languages out, if they wish to do so.

(New languages may be added, if interested co-organizers, suitable data and evaluation segmentations become available in time.)


The task is the unsupervised segmentation of word forms into sub-word units (segments) given a data set that consists of a long list of words and their frequencies of occurrence in a corpus.

In the proposed segmentation, the number of unique segments must lie within the range 1000 - 300,000 (type count).

Solutions, in which a large number of parameters must be "tweaked" separately for each test language, are of little interest. This challenge aims at the unsupervised (or very minimally supervised) segmentation of words into morphemes. The abstracts submitted by the participants must contain clear descriptions of which steps of supervision or parameter optimization are involved in the algorithms.


The segmentations will be evaluated in two complementary ways:

Competition 1 will include all three test languages. Winners will be selected separately for each language. As a performance measure, the F-measure of accuracy of discovered morpheme boundaries is utilized. Should two solutions produce the same F-measure, the one with higher precision will win.

Competition 2 will include at least Finnish (possibly the other languages as well). The organizers will estimate the language models and perform the required speech recognition experiments. As a performance measure, the phoneme error rate in speech recognition will be utilized.

Workshop and publication

All good results will be acknowledged with fame and glory. Presentations for the challenge workshop will be selected by the program committee based on the results and an extended abstract of at most 6 pages.

Camera-ready submission (final submission in March)

The final camera-ready submissions use a different format than the papers submitted for review. We are sorry about the inconvenience of your having to reformat your documents. For your final paper submission (due March 17th), please use the two-column ACL/COLING format (figures and tables may still span the whole width of a page). Detailed formatting instructions can be found here: PDF or PS. You need the following files: Latex style file, Latex bibliography file, and a template Latex document. Note: Disregard the instructions given in the ACL/COLING files about the length of the publications. The maximum length of your paper is 6 pages (including references and figures). Email your final paper to the organizers.

Submission of results (instructions for first submission in January)

Submissions consist of result files (segmentations) for the languages concerned and of an extended abstract. See the deadlines.

The format of a result file is described on the datasets page, i.e., it is a word list with spaces inserted at the locations of the proposed morpheme boundaries. Email your result files to the organizers morphochallenge2005@mail.cis.hut.fi, preferably gzipped. Alternatively, you can indicate a (not easily guessable) URL, where the organizers can retrieve your files.

In addition to the result file(s), you need to provide us with an extended abstract. The abstract corresponds to a short conference paper and describes the algorithm used and includes references to relevant previous work. Possibly a preliminary assessment of the results is made. Clear descriptions of which steps of supervision or parameter optimization are involved in the algorithms must be provided. Email your abstract to the organizers.

The extended abstract must follow the formatting used in the NIPS conference proceedings, with the exception that the extended abstract may not exceed six pages in length, including figures and references, using a font no smaller than 10 point. The text is to be confined within a 8.25 inch by 5 inch rectangle. Detailed formatting instructions are available in PDF, Postscript, and RTF format. LaTeX support files are available: nips2005e.sty, nips2005.sty, nips2005.tex. Submissions violating these guidelines will not be considered. Web links to supplementary material (e.g., software, videos) may appear in the paper, but reviewers will not be required to view the supplementary material. (Whether the final camera-ready submissions will follow exactly the same format is yet unknown.)


In the case of disagreement the organizers will decide the final interpretation of the rules.


You are at: (none)

Page maintained by (none), last updated (none)