Unsupervised Morpheme Analysis -- Morpho Challenge 2010

Frequently Asked Questions

Question: I am developing a method for unsupervised/semisupervised learning of morphology. Would it be possible to get the full lists of reference analyses for evaluation purposes?

Answer: Unfortunately, we are not able to share the full sets of gold standard analyses or segmentations. First, they are based on external resources for which we do not have permissions to distribute. Second, the sizes of the sets are limited, and we like to reserve non-public test data for future Morpho Challenges.

Note that there are reasonably sized development sets available at the datasets page. They should be useful for developing your method. Moreover, they give you some idea how close you are to the state-of-the-art: the difference between development set results and the test set results should usually be less than 5% absolute.

If there is no forthcoming Morpho Challenges, we may be able to run few linguistic evaluations with the Morpho Challenge test sets on request. This will be limited to one or two submissions per language. We will ask at least the following:

Details of the planned publication forum (e.g. journal or conference) for the new evaluation results.
Details of previous publications on the method to be evaluated, if any.
Agreement that we can use the submitted results in academic research, citing the publication(s) that you have specified when appropriate.

Please contact Sami Virpioja (firstname.lastname

aalto.fi) for requests and further questions.

Question: Could you clarify how to use the development sets to evaluate my results?

Answer: Here is an example of the needed commands for English. It is assumed that your results are in the file proposed.labels.eng in the same directory.

cut -f1 goldstd_develset.labels.eng > relevantwordsfile.eng ./sample_word_pairs_v2.pl -n 300 -refwords relevantwordsfile.eng < proposed.labels.eng > proposed.wordspairs.eng ./eval_morphemes_v2.pl goldstd_develset.wordpairs.eng proposed.wordspairs.eng goldstd_develset.labels.eng proposed.labels.eng

Make sure that you don't have any additional whitespace in your morpheme analyses file. That is known to mess up the evaluation.

Question: I'd like to make use of the word context this time around. From the website, I see that we are allowed to look at the words in context and which corpora they came from, but are the clean/tokenized versions of the full corpora available for download? I only seem to be able to find the word lists and semi-supervised training data, but not the full corpora on the datasets page.

Answer: This time we decided not to distribute any text corpora. If your algorithm needs the full sentences, you are allowed to search for them in any corpora you like, as long as you keep the task unsupervised (or at the most minimally supervised). Remember to report what corpora you have used. The links to the corpora from which our word lists were extracted can be found on the datasets page. We can help you with the text processing scripts for those corpora. You are also free to use the processed text corpora available from the datasets page of the previous Morpho Challenges.

Check also the questions for the previous Morpho Challenges below.

-->

Question: Your website rules page does not define what counts as "unsupervised learning". I suppose this means that the program cannot be explictly given a training file containing "example answers", and nor can example answers be hard-coded into the program. Can you suggest a better definition?

Answer: That sounds like good minimum requirement. Of course, one sees solutions where people make lots of "hard-coded" assumptions about word structure, e.g., stem-final vowels that can be dropped etc., so at some point one wonders where to draw a border between entirely unsupervised methods, minimally supervised methods and so on. Thus, it is important that all such assumptions be explicitly mentioned when results are reported.

Question: Looking at the competition description it seems clear that you are looking for morpheme classification (e.g. distinguishing English plural nouns from third-person-singular verbs, both of which are regularly associated with adding "s" to a stem). I cannot see how such distinctions are possible without access to word classes. However, none of the corpora you provide include POS information. Are you expecting entrants to also write a word-classification algorithm alongside their morphology analyser/classifier, or are you allowing the use of supervised taggers?

Answer: The idea is that your algorithm works in an unsupervised fashion. Maybe you will find different distributions for different stems: if "s" (in English) goes together with "ing" and "ed" you have one kind of morpheme (verb ending), if it does not, or goes together with "'s" you have another kind of morpheme (noun ending). So, you are not allowed to use supervised taggers. However, do not let this put you off. We do not know how advanced systems people will come up with. For instance, treating all "s":s alike may be rather OK if your system otherwise does a good job finding word segments accurately.

Question: How will the submission be done? On the website, it says by email, which is probably cumbersome due to the size of the data files. Could we put them on a server here and we send you a link by email?

Answer: Yes, a link would be much better than emailing the whole data.