Sree Ganesh Thottempudi (University of Köln), Ajinkya Prabhune, Rohan Raj, Kashish, Sarthak Manas Tripathy, Komal Kakade, Sudipt Panda (SRH-Heidelberg)

Optical Character Recognition with Neural Networks

Sanskrit language has contained very ancient culture etc. Medicine, Mathematics, Hindu mythology, Indian civilization. Most of these manuscripts are written on palm leaves. It therefore becomes critical that access to these manuscripts is made easy; Digital Humanities is giving a scope to share this knowledge with the world and to facilitate further research on this Ancient literature. With this motivation, our first step towards this is to create a OCR for Sanskrit language. In this paper, we propose a Neural Network based Optical Character Recognition system (OCR) which accurately digitizes Ancient Sanskrit manuscripts (Devanagari Script) that are not necessarily in good condition. We use an image segmentation algorithm for calculating pixel intensities to identify letters in the image. The OCR considers typical compound characters (half letter combinations) as separate classes in order to improve the segmentation accuracy. For out ground truth we are taking Ramayana text written by Valmiki.