Tuuli Tuisk (University of Tartu)

Preserving Linguistic Heritage: Materials of Estonian Dialects and Kindred Languages at the University of Tartu

This paper introduces the Archives of Estonian Dialects and Kindred Languages (AEDKL) and the Corpus of Estonian Dialects (CED) – a collection of materials in Estonian and related languages that are freely available online. The AEDKL consists of fieldwork recordings, written materials, photos and videos of Estonian Dialects, Finno-Ugric and Uralic languages. The CED is a collection of electronic data containing authentic dialect texts from all Estonian dialects.

The AEDKL contain four types of materials: 1) sound recordings (Finnic: Estonian, Livonian, Votic, Ingrian, Veps, Karelian, Finnish, Ingrian Finnish; Finno-Ugric: Inari Saami, Erzya, Moksha, Komi, Udmurt, Hungarian, Khanty; Samoyedic: Nenets, Kamas); 2) unpublished manuscripts, including student papers and thesis defended at the Institute of Estonian and General Linguistics, fieldwork diaries, transcriptions and written notes on the Uralic languages; 3) photos from fieldwork expeditions and various linguistic events; 4) video recordings.

The archives consist of about 2800 hours of fieldwork recordings of the Uralic languages. The majority of the sound recordings are of the Estonian dialects with the remainder composed of recordings of other Finno-Ugric languages. The earliest sound recordings date back to the 1950s and written materials to the 1920s. There are a total of 393 000 pages of written manuscripts in the archives (~ 268 000 pages are digitally available). The collection holds about 2900 photos from fieldwork expeditions and linguistic events. Photos are divided into two series based on media type: paper and digital photos. There are around 1300 paper photos that are digitized, and digitization is still in progress. Around 1600 digital photos are from recent years of fieldwork and different linguistic events. Video recordings are from fieldwork conducted during recent years. Also, old film rolls from the 1970s and 1980s have been digitized. There are 51 hours of video recordings, but in the current version of the database viewing these videos is not integrated into the online archive system.

The CED offers a data set of Estonian dialectology and consists of sound recordings, transcribed texts utilizing Finno-Ugric phonetic transcription, dialect texts in simplified transcription, morphologically annotated texts, syntactically parsed texts, a database containing information about consultants and recordings. The corpus enables one to apply various methods that are used in corpus linguistics and corpus-based dialectology, thereby opening up new horizons in the study of certain aspects of Estonian dialects such as dialect syntax.

The online databases of the AEDKL and CED are freely accessible and open to all researchers at www.murre.ut.ee/arhiiv/ and http://www.murre.ut.ee/murdekorpus.