ALLOMORFESSOR: TOWARDS UNSUPERVISED MORPHEME ANALYSIS

Oskar Kohonen, Sami Virpioja and Mikaela Klami

Adaptive Informatics Research Centre, Helsinki University of Technology

Morphological analysis is crucial to many modern natural language
processing applications, especially when dealing with morphologically
rich languages. Consequently, there has been an increasing amount of
research on the task of unsupervised segmentation of word forms into
smaller useful units, i.e. morphs or morphemes.  Ultimately, we would
like to perform not morphological segmentation, but the more difficult
task of morpheme analysis, where the aim is not only to segment the
corpus word forms into subparts, but also to identify surface forms
corresponding morphological labels.  For this task, the phenomenon of
allomorphy places limits on the quality of morpheme analysis
achievable by segmentation alone.

Our unsupervised method, Allomorfessor, tries to discover common
baseforms for allomorphs from an unannotated corpus. The method does
not directly model the corpus, but the lexicon of word forms in the
corpus. At its core, the model is a probabilistic context-free
grammar. The terminal symbols of the grammar are units resembling
linguistical morphemes, specifically root stems and affixes. We call
the non-terminal symbols virtual morphs; they are units that have
substructure. Compared to a successful segmentation method, Morfessor
Baseline, we add the notion of mutation to model allomorphic
variation. Each virtual morph splits into two parts, prefix morph and
suffix morph, with a potential mutation which modifies the prefix
morph, which is assumed to be the baseform of the virtual morph. The
applied mutations can sequentially delete or substitute letters of the
prefix morph, starting from its end.

We use Maximum a Posteriori estimation and a local, greedy search
procedure to obtain the model parameters. The computationally most
challenging task is to find a good set of candidate baseforms and the
mutations that modify them to the analyzed surface morph. We restrict
the baseforms to those that exist in the initial word list and test
only the K nearest candidates.

We evaluated the method by participating in the Morpho Challenge 2008
competition 1, where automatic analyses of corpora in English, German,
Turkish and Finnish are compared against a linguistic gold standard.
Our method achieved high precision but low recall for all the four
languages. In practice, low recall means that the method
undersegments, i.e., the analyses is only partial and most of the
linguistical morphemes are not found. Despite the current problems in
the algorithm, we find the general approach to be promising and the
problem worth further research.