Kais Allkivi-Metsoja, Kaisa Norak, Karina Kert, Silvia Maine, Pille Eslon (Tallinn University)

Error Classification and Annotation of Learner Language for Developing Estonian Grammar Correction

Developing Grammatical Error Correction (GEC) systems requires examples of authentic language errors and their corrections. Although state-of-the-art GEC systems are mostly trained on large amounts of synthetic error data, smaller manually error-annotated datasets are essential for testing the corrector’s performance. Such ‘gold standard’ corpora are also used to fine-tune correction models and calculate error probabilities for error generation.
We introduce a dataset of Estonian learner writings (levels A2–C1) that has been error- annotated in the M2 format, indicating the error type, scope and correction. We further discuss the used error classification and its differences to the ERRANT categories developed for English.