[an error occurred while processing this directive]

Frequently Asked Questions

Question: You provide data sets for three languages in the form of word lists. Does that imply that my algorithm should learn the morphology only from that word list? The point is that my algorithm needs contextual information for each word which it acquires in an entirely language independent and unsupervised manner from raw text. Thus in order to participate in your challenge the algorithm would have to have access to an appropriate corpus of the corresponding language. I could do that separately and let the algorithm learn on corpora which I acquired by myself, but would such an entry then still be eligible for the challenge?

Answer: You are allowed to use any corpora you like, as long as you keep the task unsupervised (or at the most minimally supervised). Remember to report what corpora you have used. Unfortunately, we cannot provide you with the corpora from which our word lists were extracted.

Question: The problem is that I am unlikely to gather as large a corpus of Finnish as you probably have access to. Not speaking of the quality. Thus the most interesting language, Finnish, out of the three provided would probably be either left our by me or badly represented.

Answer: Finnish news paper and book text (which we have used) are available at the Language Bank of CSC. You can apply for a user account there and run your experiments on their computers. You are not, however, allowed to copy their corpora away from their server. Apply for a user account.

Question: Your website rules page does not define what counts as "unsupervised learning". I suppose this means that the program cannot be explictly given a training file containing "example answers", and nor can example answers be hard-coded into the program. Can you suggest a better definition?

Answer: That sounds like good minimum requirement. Of course, one sees solutions where people make lots of "hard-coded" assumptions about word structure, e.g., stem-final vowels that can be dropped etc., so at some point one wonders where to draw a border between entirely unsupervised methods, minimally supervised methods and so on. Thus, it is important that all such assumptions be explicitly mentioned when results are reported.


You are at: (none)

Page maintained by (none), last updated (none)