Peeter Tinits (Tallinn University / University of Tartu)

Estonian Language Community ca. 1900: Learning from Linked Metadata

The expansion of digital resources has provided new avenues for historical research in a number of ways. Based on digital text collections, corpus linguistics has become one of the core disciplines for language researchers over the last few decades. Enriched texts allow one to extract facts, people, or geographic locations from texts and allow us to better understand what people were writing about.

An intriguing option that has come to be explored more recently is the use is the study of collection metadata themselves (e.g. see Lahti et al. 2019). That is, with the study of collections and registers themselves, it may be possible to say something substantial about the historical events too. This naturally entails a more careful consideration of how the data points end up in the archives and whether they can be considered representative of an era, or may be biased in some way (e.g. by focussing on authors that were later canonized by critics, cf. Algee-Hewitt et al. 2016).

In this presentation, I will present explorations of historical bibliographic data (i.e. Estonian National Bibliography), that aims to give a complete and comprehensive overview of publications in Estonian or related to Estonia. It is an aggregate of various bibliographies collected by book scholars over generations, and has been now made available in a digital structured format. I explore it from two angles: 1) printed books in the context of community demographics; 2) individuals involved in writing and publishing books and their backgrounds.

Pre-processing / Data and methods

The study relies on the Estonian National Bibliography (ENB), which is publically available on The data was harmonized with some heuristics and custom dictionaries.

Demographic data for the period was aggregated from various published primary and secondary sources. The individuals involved in the language community were retrieved from publication data in ENB, by taking all names associated with the publications (excluding original authors of translated works). Finally, for enrichment ENB data links with VIAF were relied upon, adding biographic information to the authors based on Wikidata and DNB collections, and adding a few more sources (ISIK, VEPER). As a result, bibliographic data combined with demographic data was established, and an enriched dataset of individuals actively involved with print publications.

Case studies

The demographic data show that due to internal migration within the Estonian population, most cities consisted of around 50% of immigrants around the turn of the century which has been described as heavy dialect contacts. However, only a minority of them were born in a different dialect area, due to which practical influence of dialect contacts on language can be expected to be marginal in terms of spoken language.

The publication record shows an exponential growth in both the number of printed works, and number of printed works per capita, as well as number of authors per capita. This provides a foundation for a steady rise in the written language community, that is mediated a bit by political events. Publication record shows rather abrupt changes in the relative roles of competing Estonian, German, and Russian languages as result of administrative policies.

The birthplaces of the associated individuals show a dominance of Livland in the late 19th century, to the extent that cumulatively, Livland comes to dominate the intellectual population over Estland. This trend can be understood in terms of administrative policies, that resulted in Livonian communities gaining affluence and also good coverage of public schools a decade or two earlier. However, between the north-south split of Estonian dialects, northern dialects are also dominant in large parts of Livland, so among the written language community, speakers with a native northern language still dominate.


The study of collection metadata provides an intriguing avenue for humanities research. It also opens up new discussions, particularly on the potential representativeness of the collections and the different ways that data points could be harmonized or generalized from. While these discussions may take a while to take place, opening up collections as datasets, and making them structured and machine-readable, is a clear step towards exploring these possibilities. In the case studies presented here, the encyclopedic metadata was used to study the shape of an emerging language community more than a 100 years ago. The same datasets, and other datasets like it, could be used to study many different questions relevant to humanities scholars of different fields. The more datasets become interlinked to each other, the more their investigative as well as critical potential can be released.


Algee-Hewitt, M., Allison, S., Gemma, M., Heuser, R., Moretti, F. & Walser, H. (2016). Canon/archive: large-scale dynamics in the literary field. Literary Lab Pamphlet 11.

Lahti, L., Marjanen, J., Roivainen, H. & Tolonen, M. (2019). Bibliographic Data Science and the History of the Book (c. 1500–1800). Cataloging & Classification Quarterly, 57(1), pp. 5–23.