Versification and Authorship Recognition
Contemporary stylometry has developed extremely accurate and sophisticated methods of authorship recognition. The logic behind them is to tell the author by measuring the degree of stylistic similarity between the text in question and particular texts written by candidate authors. Various style markers are being taken into account for this purpose: frequencies of words, frequencies of parts-of-speech, frequencies of character n-grams, frequencies of collocations… One important aspect of style (of one important form of literature) however seems to be completely disregarded – versification.
The talk will present the ongoing project focusing on whether characeristics such as frequencies of stress patterns, frequencies of rhyme types etc. may be useful in the process of authorship recognition. Some pilot experiments comparing various classification methods (Delta family, SVM, Random forest) and their evaluation with Czech, German, Spanish, and English poetry will be presented.
Petr Plecháč is specialized in quantitative and corpus verse studies. He has been participating in the project of building the Corpus of Czech Verse (http://versologie.cz/v2/web_content/corpus.php?lang=en), in the project POSTDATA maintained by Laboratorio de Innovación de Humanidades Digitales, UNED Madrid (http://postdata.linhd.es), and most currently is being the leader of the project focusing on using versification characteristics for the purpose of the authorship attribution (http://versologie.cz/v2/web_content/projects.php?lang=en).
Detecting language change for the digital humanities; challenges and opportunities
For the last decade, automatic detection of word sense change has primarily focused on detecting the main changes in meaning of a word. Most current methods rely on new, powerful embedding technologies, but do not differentiate between different senses of a word, which is needed in many applications in the digital humanities. Of course, this radically reduces the complexity, but often fails to answer questions like: what changed and how, and when did the change occur?
In this talk, I will present methods for automatically detecting sense change from large amounts of diachronic data. I will focus on a study on a Historical Swedish Newspaper Corpus, the Kubhist dataset with digitized Swedish newspapers from 1749-1925. I will present our work with detecting and correcting OCR errors, normalizing spelling variations, and creating representations for individual words using a popular neural embedding method, namely Word2Vec.
Methods for creating (neural) word embeddings are the state-of-the-art in sense change detection, and many other areas of study, and mainly studied on English corpora where the size of the datasets are sufficiently large. I will discuss the limitations of such methods for this particular context; fairly small-sized data with a high error rate as is common in a historical context for most languages. In addition, I will discuss the particularities of text mining methods for digital humanities and what is needed to bridge the gap between computer science and the digital humanities.
Automatic analysis of large literary corpora
Automatic analysis of large literary corpora gave rise to a new direction of research within the digital humanities, which Franco Moretti coined as ‘distant reading’. This lecture will sketch the use of an additional potentially fruitful signal for the analysis of literary works – readers’ behaviour in various forms – book sales figures, online reviews and ratings, as well as e-book reading logs.
Joris van Eijnatten – Topic to be announced
Andra Siibak – Topic to be announced