Maciej Janicki (University of Helsinki), Mari Sarv (Estonian Literary Museum)

Text Similarity in Oral Runosong Tradition: Towards a Large-Scale Quantitative Analysis

Oral poetry in general is characterized by the composition method where the traditional elements and structures (formulae) are combined into the coherent wholes in the process of performance according to the circumstances (see for example Foley 1995, Honko 2000). On the one hand, this leads to the recurrent use of word pairs, lines, motifs, plots in the circulation. On the other hand, flexible variation enables tradition to gradually adapt to the changes in language, environment and mindset (Sarv 2008). Texts collected from oral tradition are typically not independent, individual creations. Instead, they string together motifs and formulae that circulate in the tradition, and thus individual collected texts bear a lot of partial similarities to each other, forming a dense network of intertextual relationships.

In the project “Formulaic intertextuality, thematic networks and poetic variation across regional cultures of Finnic oral poetry”, we apply computational methods to study intertextuality and text similarity within Finnic oral folk poetry collections. In general, detection of similar content units is complicated in case of this material due to the dialectal variation of the poetic idiom. In this presentation, we take a look at the Estonian Runosongs Database, which currently contains around 100,000 texts. Among those, automatic text similarity computation, based on combination of verse similarity (Janicki et al. 2022) and alignment similarity calculations, found around 7.2 million pairs of texts that overlap in some parts.

The distribution of computed similarity scores reveals a continuum, with a fluid transition from exact duplicates, through “almost same texts”, texts overlapping in a major part, all the way through texts related only via the use of a few lines. While a classification of the different degrees of similarity could be desirable, initial exploration of the examples shows how difficult it is to define boundaries between the different stages.

Further, we study the distribution of the number of neighbors (similar poems) to a poem. The analysis reveals that the distribution is very skewed, with a few poems having thousands of neighbors, while the median number being 31. We also discovered a partial trend in that poems that have a highly similar neighbor tend to have more neighbors overall. In other words, high similarities typically do not occur in isolation, but are accompanied with lower similarities.

Finally, we study the relationship between the automatically computed similarities and the manually constructed typology, in which the content of every text is classified as belonging to one or more types. It turns out that studying text similarity is largely complementary to the typology: textual similarity does not guarantee that the poems were indexed in the same way, and even less does the same indexing speak of computed textual similarity.

In summary, the computational similarity detection opens up possibilities for a new methodology of studying the phenomenon of intertextuality in the oral tradition. It allows us to verify general claims about vast amounts of material, as well as quickly find examples of more specific relationships that one wants to study.


Foley, John Miles 1995. The Singer of Tales in Performance. Bloomington: Indiana University Press.

Honko, Lauri (2000). Text and Process as Practice: the Textualization of Oral Epics. In: Lauri Honko (ed.), Textualization of Oral Epics. Berlin; New York: De Gruyter Mouton, pp. 3−54.

Janicki, Maciej, Kati Kallio and Mari Sarv (2022). Exploring Finnic Written Oral Folk Poetry Through String Similarity. Digital Scholarship in the Humanities.

Sarv, Mari (2008). Loomiseks loodud: regivärsimõõt traditsiooniprotsessis. PhD thesis, Tartu: Tartu Ülikooli Kirjastus.