Morphological Induction Through Linguistic Productivity

Sarah A. Goodman

University of Maryland-College Park

sagoodm@umd.edu


The induction program we have crafted relies primarily on the linguistic notion of \texttt{productivity} to find affixes in unmarked text and without the aid of prior grammatical knowledge.  In doing so, the algorithm unfolds in two stages.  It first finds seed affixes, to include infixes and circumfixes, by assaying the character of all possible internal partitions of all words in a small corpus no larger than 3,000 tokens.  It then selects a small subset of these seed affixes by examining the distribution patterns of roots they fashion to, as demonstrated in a possibly larger second training file.  Specifically, it hypothesizes that valid roots take a partially overlapping affix-set, and develops this conjecture into agendas for both feature-set generation and binary clustering.  It collects feature sets for each candidate by what we term affix-chaining, delineating (and storing) a path of affixes joined, with thresholding caveats, via the roots they share.  After clustering these resultant sets, the program yields two affix groups, an ostensibly valid collection and a putatively spurious one.  It refines the membership of the former by again examining the quality of shared root distributions across affixes.  This second half of the program, furthermore, is iterative. This fact is again based in productivity, as we ration that, should a root take one affix, it most likely takes more.  The code therefore seeds a subsequent iteration of training with affixes that associate with roots learned during the current pass. If, for example, it recognizes {\itshape view} on the first pass, and {\itshape viewership} occurs in the second training file, the program will evaluate \texttt{-ership}, along with its mate \texttt{-er}, via clustering and root connectivity on the second pass. The results of this method are thus far mixed according to training file size.  Time constraints imposed by shortcomings in the algorithm's code, have thus far prevented us from fully training on a large file.  For Morpho Challenge 2008, not only did we only train on just 1-30% of the offered text, thereby saddling the stemmer with a number of Out Of Voculary items, but, we also divided that text into smaller parts, thereby, as the results show, omitting valuable information about the true range of affix distributions.