Error Classification and Annotation of Learner Language for Developing Estonian Grammar Correction
Developing Grammatical Error Correction (GEC) systems requires examples of authentic language errors and their corrections. Although state-of-the-art GEC systems are mostly trained on large amounts of synthetic error data, smaller manually error-annotated datasets are essential for testing the corrector’s performance. Such ‘gold standard’ corpora are also used to fine-tune correction models and calculate error probabilities for error generation.
We introduce a dataset of Estonian learner writings (levels A2–C1) that has been error- annotated in the M2 format, indicating the error type, scope and correction. We further discuss the used error classification and its differences to the ERRANT categories developed for English.