Heiki-Jaan Kaalep
University of Tartu
Estonian is an inflective language, meaning that a word appears in text in different forms. Obviously, this poses problems for computational tools which rely on string-based methods for establishing word similarity, count tokens for subsequent statistical analysis or text analytics. Consequently, a program needs to be plugged in where the text serves as the input and the output consists of the morphological tags and lemmas, whereas it ascribes to each word those variant(s) of analysis that are suitable in the given context.
Vabamorf is a set of open-source morphological tools for Estonian. In includes a morphological analyzer and a synthesizer, and a disambiguator for singling out the most likely analysis in a given context. The tools were released as open-source by Filosoft Ltd. in 2015. They have been the basis of the commercial Estonian speller for 20 years, and have been used for large-scale corpus tagging projects. The authors have made special efforts to make the tools capable of working on real-life texts, to be able to cope with words that are not included in the program’s lexicon.
The paper concentrates on the quality issues of the output, drawing examples from etTenTen, a 270 million word corpus, available for queries at www.keeleveeb.ee.
