ALLOMORFESSOR: TOWARDS UNSUPERVISED MORPHEME ANALYSIS Oskar Kohonen, Sami Virpioja and Mikaela Klami Adaptive Informatics Research Centre, Helsinki University of Technology Morphological analysis is crucial to many modern natural language processing applications, especially when dealing with morphologically rich languages. Consequently, there has been an increasing amount of research on the task of unsupervised segmentation of word forms into smaller useful units, i.e. morphs or morphemes. Ultimately, we would like to perform not morphological segmentation, but the more difficult task of morpheme analysis, where the aim is not only to segment the corpus word forms into subparts, but also to identify surface forms corresponding morphological labels. For this task, the phenomenon of allomorphy places limits on the quality of morpheme analysis achievable by segmentation alone. Our unsupervised method, Allomorfessor, tries to discover common baseforms for allomorphs from an unannotated corpus. The method does not directly model the corpus, but the lexicon of word forms in the corpus. At its core, the model is a probabilistic context-free grammar. The terminal symbols of the grammar are units resembling linguistical morphemes, specifically root stems and affixes. We call the non-terminal symbols virtual morphs; they are units that have substructure. Compared to a successful segmentation method, Morfessor Baseline, we add the notion of mutation to model allomorphic variation. Each virtual morph splits into two parts, prefix morph and suffix morph, with a potential mutation which modifies the prefix morph, which is assumed to be the baseform of the virtual morph. The applied mutations can sequentially delete or substitute letters of the prefix morph, starting from its end. We use Maximum a Posteriori estimation and a local, greedy search procedure to obtain the model parameters. The computationally most challenging task is to find a good set of candidate baseforms and the mutations that modify them to the analyzed surface morph. We restrict the baseforms to those that exist in the initial word list and test only the K nearest candidates. We evaluated the method by participating in the Morpho Challenge 2008 competition 1, where automatic analyses of corpora in English, German, Turkish and Finnish are compared against a linguistic gold standard. Our method achieved high precision but low recall for all the four languages. In practice, low recall means that the method undersegments, i.e., the analyses is only partial and most of the linguistical morphemes are not found. Despite the current problems in the algorithm, we find the general approach to be promising and the problem worth further research.