Large vocabulary recognition for online Turkish handwriting with sublexical units

We present a system for large vocabulary recognition of online Turkish handwriting, using hidden Markov models. While using a traditional approach for the recognizer, we have identified and developed solutions for the main problems specific to Turkish handwriting recognition. % First, since large amounts of Turkish handwriting samples are not available, the system is trained and optimized using the large UNIPEN dataset of English handwriting, before extending it to Turkish using a small Turkish dataset. % The delayed strokes, which pose a significant source of variation in writing order due to the large number of diacritical marks in Turkish, are removed during preprocessing. % Finally, as a solution to the high out-of-vocabulary rates encountered when using a fixed size lexicon in general purpose recognition, a lexicon is constructed from sublexical units (stems and endings) learned from a large Turkish corpus. A statistical bigram language model learned from the same corpus is also applied during the decoding process. The system obtains a 91.7 % word recognition rate when tested on a small Turkish handwritten word dataset using a medium sized (1950 words) lexicon corresponding to the vocabulary of the test set and 63.8 % using a large, general purpose lexicon (130,000 words). However, with the proposed stem+ending lexicon (12,500 words) and bigram language model with lattice expansion, a 67.9 % word recognition accuracy is obtained, surpassing the results obtained with the general purpose lexicon while using a much smaller one.% that still has the same text corpus coverage.