Aleksandra Miletić and Yves Scherrer (University of Helsinki)

Occitan in Wikipedia Discussions: Initial Findings

Occitan is a regional language spoken in southern France and in parts of Italy and Spain. Like many such languages, it has only recently started to enter the digital era. Basic digital tools and resources (text databases, electronic dictionaries, text-to-speech tools) have been created and Occitan Wikipedia is also being developed.

We present OcWikiDisc, a 500,000-word corpus extracted from Occitan Wikipedia’s discussion pages. It contains direct user-to-user interactions on various topics. We analyze Occitan dialects and spelling norms on a corpus sample in a first attempt to model the use of Occitan on this medium.