[an error occurred while processing this directive]
The data sets provided by the organizers consist of word lists. Each word in the list is preceded by its frequency in the corpora used. The participants' task is to return exactly the same list(s) of words, with spaces inserted at the locations of proposed morpheme boundaries. The list(s) returned need not contain the word frequency information.
For instance, a subset of the supplied English word list looks like this:
... 6755 sea 1 seabed 1 seabeds 2 seabird 34 seaboard 1 seaboards ...
A submission for this particular set of words may look like this:
... sea sea bed sea bed s sea bird sea board sea board s ...
The data, i.e., word lists, have been preprocessed slightly. Capitalization has been removed. Words have been split at the hyphens (-), which are obvious morpheme boundaries. Thus, words such as "hand-made" are not contained in the data, but rather occur as two separate words: "hand" and "made".
The data are encoded in Unicode (UTF-8) (special comment for Turkish: see below). You must use this encoding in you submission! Note that in UTF-8 a character may be encoded using two bytes instead of one. If you prefer to use another encoding for your internal purposes, you can, e.g., use the Unix command iconv. For instance, to convert the data from UTF-8 to ISO Latin 1 (a frequently used one-byte code), type:
iconv -f UTF-8 -t ISO8859-1 inputfile > outputfile
Remember to perform the reverse conversion before you submit your results!
Language | Files | # word types | # word tokens | |
---|---|---|---|---|
Finnish | Text | Text gzipped | 1,636,336 | 32,017,012 |
English | Text | Text gzipped | 167,377 | 24,447,034 |
Turkish | Text | Text gzipped | 582,923 | 16,619,455 |
The Finnish word list has been extracted from newspaper text and books stored at the Language Bank of CSC. Additionally, newswires from the Finnish National News Agency have been used.
The English word list is based on publications and novels from the Gutenberg project, a sample of the English Gigaword corpus, as well as the entire Brown corpus.
The Turkish word list is based on prose and publications collected from the web, newspaper text, and sports news. (The organizers are grateful to Ebru Arisoy for providing the Turkish data.)
Note that the special characters of the Turkish alphabet
have been rendered as capital letters of the standard Latin alphabet,
e.g.,
"açıkgörüşlülüğünü" is
spelled "aCIkgOrUSlUlUGUnU". Use the same "capitalized" format in your
submissions! (This encoding of Turkish text produces a character set that
is exactly the same in UTF-8 and ISO Latin 1.)
Evaluation of a sample
The desired segmentations ("gold standard") for a small sample of
words in each language is provided for download and inspection by the
participants (UTF-8 encoding):
Language | File | # word types |
---|---|---|
Finnish | Text | 660 |
English | Text | 532 |
Turkish | Text | 774 |
Each line contains the word and its segmentation. The word is separated from the segmentation by a TAB character. Word segments are separated from each other by a space character. For some words there are multiple correct segmentations. These alternative segmentations are separated by a comma (,). An English example:
pitchers pitch er s, pitcher s
The Finnish gold standard is based on the two-level morphology analyzer FINTWOL from Lingsoft, Inc. The English gold standard is based on the CELEX English data base and the Comprehensive Grammar of the English Language by Quirk et al. (1985). The Turkish linguistic segmentations have been obtained from a morphological parser developed at Boğaziçi University. The Turkish parser is based on Oflazer's finite-state machines, with a number of changes.
For instructions on how to evaluate your segmentation against the gold standard samples, see the Evaluation section.
HOME | RULES | SCHEDULE | EVALUATION | DATASETS | WORKSHOP | RESULTS | FAQ | CONTACT
You are at: (none)
Page maintained by (none), last updated (none)