Large vocabulary recognition for online Turkish handwriting with sublexical units

Large vocabulary recognition for online Turkish handwriting with sublexical units

We present a system for large vocabulary recognition of online Turkish handwriting, using hidden Markovmodels. While using a traditional approach for the recognizer, we have identified and developed solutions for the mainproblems specific to Turkish handwriting recognition. First, since large amounts of Turkish handwriting samples are notavailable, the system is trained and optimized using the large UNIPEN dataset of English handwriting, before extendingit to Turkish using a small Turkish dataset. The delayed strokes, which pose a significant source of variation in writingorder due to the large number of diacritical marks in Turkish, are removed during preprocessing. Finally, as a solution tothe high out-of-vocabulary rates encountered when using a fixed size lexicon in general purpose recognition, a lexicon isconstructed from sublexical units (stems and endings) learned from a large Turkish corpus. A statistical bigram languagemodel learned from the same corpus is also applied during the decoding process.The system obtains a 91.7% word recognition rate when tested on a small Turkish handwritten word datasetusing a medium sized (1950 words) lexicon corresponding to the vocabulary of the test set and 63.8% using a large,general purpose lexicon (130,000 words). However, with the proposed stem+ending lexicon (12,500 words) and bigramlanguage model with lattice expansion, a 67.9% word recognition accuracy is obtained, surpassing the results obtainedwith the general purpose lexicon while using a much smaller one

___

  • Plamondon R, Srihari SN. On-line and off-line handwriting recognition: A comprehensive survey. IEEE T Pattern Anal 2000; 22: 63-84.
  • Al-Helali BM, Mahmoud SA. Arabic online handwriting recognition (AOHR): a survey. ACM Comput Surv 2017; 50: 33:1-33:35.
  • Tagougui N, Kherallah M, Alimi AM. Online Arabic handwriting recognition: a survey. Int J Doc Anal Recog 2013; 16: 209-226.
  • Doermann DS, Jaeger S. Arabic and Chinese Handwriting Recognition : Summit, SACH 2006 Selected Papers. Berlin, Germany: Springer, 2008.
  • Plötz T, Fink GA. Markov models for offline handwriting recognition: a survey. Int J Doc Anal Recog 2009; 12: 269-298.
  • Arıca N, Yarman-Vural FT. Optical character recognition for cursive handwriting. IEEE T Pattern Anal 2002; 24: 801-813.
  • Hu J, Lim SG, Brown MK. Writer independent on-line handwriting recognition using an HMM approach. Pattern Recogn 2000; 33: 133-147.
  • Biem A. Minimum classification error training for online handwriting recognition. IEEE T Pattern Anal 2006; 28: 1041-1051.
  • Liwicki M, Bunke H. Handwriting recognition of whiteboard notes. In: Proceedings of the 12th Conference of the International Graphonomics Society; 26–29 June 2005; Salerno, Italy. pp. 118-122.
  • Liwicki M, Bunke H. Handwriting recognition of whiteboard notes - studying the influence of training set size and type. Int J Pattern Recogn 2007; 21: 83-98
  • Rabiner LR. A tutorial on Hidden Markov Models and selected applications in speech recognition. P IEEE 1989; 77: 257-286.
  • Liwicki M, Graves A, Bunke H, Schmidhuber J. A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proceedings of the 9th International Conference on Document Analysis and Recognition ICDAR; 23–26 September 2007; Curitiba, Brazil. pp. 367-371.
  • Graves A, Fernández S, Liwicki M, Bunke H, Schmidhuber J. Unconstrained on-line handwriting recognition with recurrent neural networks. In: Proceedings of the 20th International Conference on Neural Information Processing Systems NIPS; 3–6 December 2007; Vancouver, BC, Canada. pp. 577-584.
  • Graves A, Liwicki M, Fernandez S, Bertolami R, Bunke H, Schmidhuber J. A novel connectionist system for unconstrained handwriting recognition. IEEE T Pattern Anal 2009; 31: 855-868.
  • Jäger S, Manke S, Reichert J, Waibel A. Online handwriting recognition: the NPen++ recognizer. Int J Doc Anal Recog 2001; 3: 169-180.
  • Garcia-Salicetti S, Dorizzi B, Gallinari P, Wimmer Z. Maximum mutual information training for an online neural predictive handwritten word recognition system. Int J Doc Anal Recog 2001; 4: 56-68.
  • Kozielski M, Doetsch P, Ney H. Improvements in RWTH’s system for offline handwriting recognition. In: 12th International Conference on Document Analysis and Recognition ICDAR; 25–28 August 2013; Washington, DC, USA. pp. 935-939.
  • Doetsch P, Hamdani M, Ney H. Gimenez A, Andres-Ferrer J, Alfons J. Comparison of Bernoulli and Gaussian HMMs using a vertical repositioning technique for offline handwriting recognition. In: International Conference on Frontiers in Handwriting Recognition ICFHR; 18–20 September 2012; Bari, Italy. pp. 3-7.
  • Caillault É, Viard-Gaudin C. Mixed discriminant training of hybrid ANN/HMM systems for online handwritten word recognition. Int J Pattern Recogn 2007; 21: 117-134.
  • Schenk J, Rigoll G. Novel hybrid NN/HMM modelling techniques for on-line handwriting recognition. In: Tenth International Workshop on Frontiers in Handwriting Recognition IWFHR; 23–26 October 2006; La Baule, France. pp. 619-623.
  • Gauthier N, Artières T, Gallinari P, Dorizzi B. Strategies for combining on-line and off-line information in an on-line handwriting recognition system. In: 6th International Conference on Document Analysis and Recognition ICDAR; 10–13 September 2001; Seattle, WA, USA. pp. 412-416.
  • Marukatat S, Artières T, Gallinari P, Dorizzi B. Sentence recognition through hybrid neuro-markovian modeling. In: 6th International Conference on Document Analysis and Recognition ICDAR; 10–13 September 2001; Seattle, WA, USA. pp. 731-737.
  • Schenkel M, Guyon I, Henderson D. On-line cursive script recognition using time-delay neural networks and Hidden Markov Models. Mach Vision Appl 1995; 8: 215-223.
  • Çapar A, Tasdemir K, Kılıç Ö, Gökmen M. A Turkish handprint character recognition system. In: 18th International Symposium on Computer and Information Sciences ISCIS; 3–5 November 2003; Antalya, Turkey. pp. 447-456.
  • Kaplan K, Ertunç HM, Vardar E. Handwriting character recognision by using fuzzy logic. Fırat University Turkish Journal of Science & Technology 2017; 12: 71-77.
  • Korkmaz SU, Kirçiçeği G, Akıncı Y, Atalay V. A character recognizer for Turkish language. In: 7th International Conference on Document Analysis and Recognition ICDAR; 3–6 August 2003; Edinburgh, Scotland, UK. pp. 1238- 1241.
  • Yanıkoğlu B, Kholmatov A. Turkish handwritten text recognition: a case of agglutinative languages. In: Document Recognition and Retrieval X DRR; 22–23 January 2003; Santa Clara, CA, USA. pp. 227-233.
  • Şekerci M. Turkish connected and slant handwritten recognition system. MSc, Trakya University, Edirne, Turkey, 2007
  • Vural E, Erdoğan H, Oflazer K, Yanıkoğlu BA. An online handwriting recognition system for Turkish. In: Document Recognition and Retrieval XII DRR; 16–20 January 2005; San Jose, CA, USA. pp. 56-65.
  • Abdelazeem S, Eraqi HM. On-line Arabic handwritten personal names recognition system based on HMM. In: 11th International Conference on Document Analysis and Recognition ICDAR; 18–21 September 2011; Beijing, China. pp. 1304-1308.
  • Alimi AM. An evolutionary neuro-fuzzy approach to recognize on-line Arabic handwriting. In: 4th International Conference Document Analysis and Recognition ICDAR, 18–20 August 1997; Ulm, Germany. pp. 382-386.
  • Flann NS. Recognition-based segmentation of on-line cursive handwriting. In: Proceedings of the 6th International Conference on Neural Information Processing Systems NIPS; 29 November–2 December 1993; Denver, CO, USA. pp. 777-784.
  • Biadsy F, El-Sana J, Habash J. Online Arabic handwriting recognition using Hidden Markov Models. In: Tenth International Workshop on Frontiers in Handwriting Recognition IWFHR; 23–26 October 2006; La Baule, France.
  • Abdelaziz I, Abdou S, Al-Barhamtoshy H. Large vocabulary Arabic online handwriting recognition system. Pattern Anal Appl 2016; 4: 1129-1141.
  • Ghods V, Kabir E, Razzazi F. Effect of delayed strokes on the recognition of online Farsi handwriting. Pattern Recogn Lett 2013; 34: 486-491.
  • Arısoy E, Dutağacı H, Arslan LM. A unified language model for large vocabulary continuous speech recognition of Turkish. Signal Process 2006; 86: 2844-2862.
  • Erdoğan H, Büyük O, Oflazer K. Incorporating language constraints in sub-word based speech recognition. In: Proceedings of the 23rd Workshop of the Italian Neural Networks Society WIRN; 23–25 May 2013; Vietri sul Mare, Salerno, Italy. pp. 98-103.
  • Sak H, Saraçlar M, Güngör T. Morphology-based and sub-word language modeling for Turkish speech recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP; 14–19 March 2010; Dallas, TX, USA. pp. 5402-5405.
  • Sak H, Saraçlar M, Güngör T. Morpholexical and discriminative language models for Turkish automatic speech recognition. IEEE T Audio Speech 2012; 20: 2341-2351.
  • Arısoy E, Can D, Parlak S, Sak H, Saraçlar M. Turkish broadcast news transcription and retrieval. IEEE T Audio Speech 2009; 17: 874-883.
  • Liwicki M, Bunke H, HMM-based on-line recognition of handwritten whiteboard notes. In: Tenth International Workshop on Frontiers in Handwriting Recognition IWFHR; 23–26 October 2006; La Baule, France. pp. 595-599.
  • Günter S, Bunke H. Optimizing the number of states, training iterations and gaussians in an HMM-based handwritten word recognizer. In: 7th International Conference on Document Analysis and Recognition ICDAR; 3–6 August 2003; Edinburgh, Scotland, UK. pp. 472-476.
  • Young SJ, Kershaw D, Odell J, Ollason D, Valtchev V, Woodland P. The HTK Book Version 3.4. Cambridge, UK: Cambridge University Press, 2006.
  • Çöltekin Ç. A set of open source tools for Turkish natural language processing. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation LREC’14; 26–31 May 2014; Reykjavik, Iceland. pp. 1079-1086.
  • Jurafsky D, Martin JH. Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition. 2nd ed. Upper Saddle River, NJ, USA: Prentice Hall, 2009.
  • Weng F, Stolcke A, Sankar A. Efficient lattice representation and generation. In: Proceedings of 5th International Conference on Spoken Language Processing; 30 November–4 December 1998; Sydney, Australia. pp. 2531-2534
  • Sak H, Güngör T, Saraçlar M. Resources for Turkish morphological processing. Lang Resour Eval 2011; 45: 249-261.
  • Stolcke A. SRILM - an extensible language modeling toolkit. In: Proceedings of 7th International Conference on Spoken Language Processing; 16–20 September 2002; Denver, CO, USA. pp. 901-904.
Turkish Journal of Electrical Engineering and Computer Sciences-Cover
  • ISSN: 1300-0632
  • Yayın Aralığı: Yılda 6 Sayı
  • Yayıncı: TÜBİTAK