[an error occurred while processing this directive]

Datasets

Instructions

The data sets provided by the organizers consist of word lists. Each word in the list is preceded by its frequency in the corpora used. The participants' task is to return exactly the same list(s) of words, with spaces inserted at the locations of proposed morpheme boundaries. The list(s) returned need not contain the word frequency information.

For instance, a subset of the supplied English word list looks like this:

...
6755 sea
1 seabed
1 seabeds
2 seabird
34 seaboard
1 seaboards
...

A submission for this particular set of words may look like this:

...
sea
sea bed
sea bed s
sea bird
sea board
sea board s
...

The data, i.e., word lists, have been preprocessed slightly. Capitalization has been removed. Words have been split at the hyphens (-), which are obvious morpheme boundaries. Thus, words such as "hand-made" are not contained in the data, but rather occur as two separate words: "hand" and "made".

The data are encoded in Unicode (UTF-8) (special comment for Turkish: see below). You must use this encoding in you submission! Note that in UTF-8 a character may be encoded using two bytes instead of one. If you prefer to use another encoding for your internal purposes, you can, e.g., use the Unix command iconv. For instance, to convert the data from UTF-8 to ISO Latin 1 (a frequently used one-byte code), type:

iconv -f UTF-8 -t ISO8859-1 inputfile > outputfile

Remember to perform the reverse conversion before you submit your results!

Download

Language	Files		# word types	# word tokens
Finnish	Text	Text gzipped	1,636,336	32,017,012
English	Text	Text gzipped	167,377	24,447,034
Turkish	Text	Text gzipped	582,923	16,619,455

The Finnish word list has been extracted from newspaper text and books stored at the Language Bank of CSC. Additionally, newswires from the Finnish National News Agency have been used.

The English word list is based on publications and novels from the Gutenberg project, a sample of the English Gigaword corpus, as well as the entire Brown corpus.

The Turkish word list is based on prose and publications collected from the web, newspaper text, and sports news. (The organizers are grateful to Ebru Arisoy for providing the Turkish data.)

Note that the special characters of the Turkish alphabet have been rendered as capital letters of the standard Latin alphabet, e.g., "açıkgörüşlülüğünü" is spelled "aCIkgOrUSlUlUGUnU". Use the same "capitalized" format in your submissions! (This encoding of Turkish text produces a character set that is exactly the same in UTF-8 and ISO Latin 1.)

Evaluation of a sample

The desired segmentations ("gold standard") for a small sample of words in each language is provided for download and inspection by the participants (UTF-8 encoding):

Language	File	# word types
Finnish	Text	660
English	Text	532
Turkish	Text	774

Each line contains the word and its segmentation. The word is separated from the segmentation by a TAB character. Word segments are separated from each other by a space character. For some words there are multiple correct segmentations. These alternative segmentations are separated by a comma (,). An English example:

pitchers        pitch er s, pitcher s

The Finnish gold standard is based on the two-level morphology analyzer FINTWOL from Lingsoft, Inc. The English gold standard is based on the CELEX English data base and the Comprehensive Grammar of the English Language by Quirk et al. (1985). The Turkish linguistic segmentations have been obtained from a morphological parser developed at Boğaziçi University. The Turkish parser is based on Oflazer's finite-state machines, with a number of changes.

For instructions on how to evaluate your segmentation against the gold standard samples, see the Evaluation section.

You are at: (none)

Page maintained by (none), last updated (none)