Mining Dialects From Spoken Language and Social Media
Topic models are most often used to find repeating topics from a collection of documents. For such text mining purposes, the language in the documents is standardized and lemmatized. If the documents are left unprocessed, the same method can be used to find structural, including dialectal, differences. In this study we test the method both on spoken dialect corpora and on large collections of social media data to explore dialectal differences in them. We scrutinize dialect corpora from Finnish, Norwegian and Swiss German, as well as South Slavic tweets and German Jodel data.