Daniel Zeman: Using Unsupervised Paradigm Acquisition for Prefixes

We describe a simple method of unsupervised morpheme segmentation of 
words in an unknown language. All what is needed is a raw text corpus 
(or a list of words) in the given language. The algorithm identifies 
word parts occurring in many words and interprets them as morpheme 
candidates (prefixes, stems and suffixes). There are two main phases: 
/morpheme learning/ and proper /morpheme segmentation./ In the first 
phase, we learn morpheme candidates and filter them until we get lists 
of known morphemes. In the second phase, we get back to the original 
words and use the morpheme lists for segmenting of the words into morphemes.

In Zeman (2007) we only were able to cut the word in two parts at most: 
the stem and the suffix. The main innovation over Zeman (2007) is the 
ability to learn prefixes. We propose two algorithms for prefixes. 
“Reversed word” method is just the stem-suffix algorithm applied to a 
reversed word. “Rule-based” method is a more conservative one: required 
properties are specified and all prefixes complying with the constraints 
are learned.

Two segmentation algorithms have been tested: a strict 
(precision-oriented) one, and one less strict. The paper reports on more 
experiments than have been included in the main Morpho Challenge 
competition. The combination of Zeman (2007) stem-suffix learning, the 
rule-based prefix learning and the less strict segmentation is currently 
the most successful one. Resulting F-score of morpheme labeling heavily 
depends on language, ranging from 0.23 (Arabic) to 0.50 (English).

The error analysis section shows how typos affect the results. The 
current algorithm cannot use word frequencies and has no means of 
identifying typos. Numerous examples from data are shown and other 
suggestions for future work are made.

References:

Daniel Zeman. 2007. /Unsupervised Acquiring of Morphological Paradigms 
from Tokenized Text./ In: Working Notes for the Cross Language 
Evaluation Forum (CLEF) 2007 Workshop, Budapest, Hungary. ISSN 
1818-8044. Revised version to appear in C. Peters et al. (eds.): CLEF 
2007, LNCS 5152, pp. 892–899, Springer-Verlag, Berlin / Heidelberg, 
Germany, 2008.