Efficient feature integration with Wikipedia-based semantic feature extraction for Turkish text summarization

This study presents a novel hybrid Turkish text summarization system that combines structural and semantic features. The system uses 5 structural features, 1 of which is newly proposed and 3 are semantic features whose values are extracted from Turkish Wikipedia links. The features are combined using the weights calculated by 2 novel approaches. The first approach makes use of an analytical hierarchical process, which depends on a series of expert judgments based on pairwise comparisons of the features. The second approach makes use of the artificial bee colony algorithm for automatically determining the weights of the features. To confirm the significance of the proposed hybrid system, its performance is evaluated on a new Turkish corpus that contains 110 documents and 3 human-generated extractive summary corpora. The experimental results show that exploiting all of the features by combining them results in a better performance than exploiting each feature individually.

Efficient feature integration with Wikipedia-based semantic feature extraction for Turkish text summarization

This study presents a novel hybrid Turkish text summarization system that combines structural and semantic features. The system uses 5 structural features, 1 of which is newly proposed and 3 are semantic features whose values are extracted from Turkish Wikipedia links. The features are combined using the weights calculated by 2 novel approaches. The first approach makes use of an analytical hierarchical process, which depends on a series of expert judgments based on pairwise comparisons of the features. The second approach makes use of the artificial bee colony algorithm for automatically determining the weights of the features. To confirm the significance of the proposed hybrid system, its performance is evaluated on a new Turkish corpus that contains 110 documents and 3 human-generated extractive summary corpora. The experimental results show that exploiting all of the features by combining them results in a better performance than exploiting each feature individually.

___

  • D.R. Radev, K. McKeown, “Generating natural language summaries from multiple on-line sources”, Computational Linguistics, Vol. 24, pp. 469–500, 1998.
  • S.H. Sanda, F. Lacatusu, “Generating single and multi-document summaries with gistexter”, Document Understanding Conference, pp. 30–38, 2002.
  • H. Saggion, G. Lapalme, “Generating indicative-informative summaries with Su-muM”, Computational Linguistics, Vol. 28, pp. 497–526, 2002.
  • H. Jing, K.R. McKeown, “Cut and paste based text summarization”, Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp. 178–185, 2000.
  • H.P. Luhn, “The automatic creation of literature abstracts”, IBM Journal of Research Development, Vol. 2 , pp. 159–165, 1958.
  • H.P. Edmundson, “New methods in automatic extracting”, Journal of the Association for Computing Machinery, Vol. 16 , pp. 264–285, 1969.
  • K. Wong, M. Wu, W. Li, “Extractive summarization using supervised and semi-supervised learning”, Proceedings of the 22nd International Conference on Computational Linguistics, Manchester, pp. 985–992, 2008.
  • Y. Gong, X. Liu, “Generic text summarization using relevance measure and latent semantic analysis”, Proceedings of the 24th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 19–25, 2001.
  • J. Steinberger, “Text summarization within the LSA framework”, PhD Thesis, University of West Bohemia, Czech Republic, 2007.
  • J.Y. Yeh, H.R. Ke, W.P. Yang, I.H. Meng, “Text summarization using a trainable summarizer and latent semantic analysis”, Journal of Information Processing and Management, Vol. 41, pp. 75–95, 2005.
  • L. Hennig, “Topic-based multi-document summarization with probabilistic latent semantic analysis”, International Conference on Recent Advances in Natural Language Processing, pp. 144–149, 2009.
  • J. Lee, S. Park, C. Ahn, D. Kim, “Automatic generic document summarization based on non-negative matrix factorization”, Information Processing and Management, Vol. 45, pp. 20–34, 2009.
  • J. Kupiec, O.P. Jan, C. Francine, “A trainable document summarizer”, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68–73, 1995.
  • S.H. Teufel, M. Moens, “Sentence extraction as a classification task”, ACL/EACL Workshop on Intelligent Scalable Text Summarization, pp. 58–65, 1997.
  • E. Filatova, V. Hatzivassiloglou, “A formal model for information selection in multi-sentence text extraction”, Proceedings of the 20th International Conference on Computational Linguistics, pp. 397–403, 2004.
  • R. McDonald, “A study of global inference algorithms in multi-document summarization”, 29th European Conference on IR Research, pp. 557–564, 2007.
  • R.M. Alguliev, R.M. Aliguliyev, M.S. Hajirahimove, C.A. Mehdiyev, “MCMR: Maximum coverage and minimum redundant text summarization model”, Expert Systems with Applications, Vol. 38, pp. 14514–14522, 2011.
  • Z. Altan, “A Turkish automatic text summarization system, IASTED International Conference on Artificial Intelligence and Applications, 2004.
  • E. Uzundere, E. Dedja, B. Diri, M.F. Amasyalı, “Automatic text summarization for Turkish texts”, National Conference of the ASYU, 2008.
  • C. Pembe, “Automated query-biased and structure-preserving document summarization for web search tasks”, PhD Thesis, Bo˘ gazi¸ ci University, Turkey, 2011.
  • C. Cı˘ gır, M. Kutlu, I. Cicekli, “Generic text summarization for Turkish”, The Computer Journal, Vol. 53, pp. 1315–1323, 2010.
  • A. G¨ uran, E. Bekar, S. Akyoku¸s, “A comparison of feature and semantic-based summarization algorithms for Turkish”, International Symposium on Innovations in Intelligent Systems and Applications, 2010.
  • M. ¨ Ozsoy, ˙I. C ¸ i¸ cekli, F.N. Alpaslan, “Text summarization of Turkish texts using latent semantic analysis”, Proceedings of the 23rd International Conference on Computational Linguistics, pp. 869–876, 2010.
  • A. G¨ uran, N. G¨ uler Bayazıt, E. Bekar, “Automatic summarization of Turkish documents using non-negative matrix factorization”, International Symposium on Innovations in Intelligent Systems and Applications, pp. 480–484, 2011. T.L. Saaty, The Analytic Hierarchy Process, New York, McGraw-Hill, 1980.
  • D. Karaboga, B. Basturk, “A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm”, Journal of Global Optimization, Vol. 39, pp. 459–171, 2007.
  • C.Y. Lin, E. Hovy, “Automatic evaluation of summaries using N-gram co-occurrence statistics”, Language Technology Conference, Vol. 1, pp. 71–78, 2003.
  • Zemberek- Zemberek 2 is an open source NLP library for Turkic languages 2011–2012, available at: http://code.google.com/p/zemberek/downloads/list.
  • M.F. Amasyalı, A. Beken, “T¨ urk¸ ce kelimelerin anlamsal benzerliklerinin ¨ ol¸ c¨ ulmesi ve metin sınıflandırmada kullanılması”, National Conference of SIU, 2009.
  • E. Gabrilovich, S. Markovich, “Computing semantic relatedness using Wikipedia-based explicit semantic analysis”, 20th International Joint Conference on Artificial Intelligence, pp. 1606–1611, 2007.
  • E. Gabrilovich, S. Markovitch, “Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge”, 21st National Conference on Artificial Intelligence, Vol. 2, pp. 1301–1306, 200
  • K. Ramanathan, Y. Sankarasubramaniam, N. Mathur, A. Gupta, “Document summarization using Wikipedia”, 1st International Conference on Human-Computer Interaction, pp. 254–260, 2009.
  • G. Williams, “In search of representativity in specialised corpora: Categorisation through collocation”, International Journal of Corpus Linguistics, Vol. 7, pp. 43–64, 2002.
  • O. Ferret, “Using collocations for topic segmentation and link detection”, 19th International Conference on Computational Linguistics, Vol. 1, pp. 260–266, 2002.
  • O. Sunercan, A. Birturk, “Wikipedia missing link discovery: a comparative study”, AAAI Spring Symposium on Linked Data Meets Arti?cial Intelligence, 2010.
  • C. Calli, “Improving search result clustering by integrating semantic information from Wikipedia”, MS Thesis, Middle East Technical University, Department of Computer Engineering, 2010.
  • A. Boynuegri, “Cross-lingual information retrieval on Turkish and English texts”, MS Thesis, Middle East Technical University, Department of Computer Engineering, 2010.
  • I.V. Mashechkin, M.I. Petrovskiy, D.S. Popov, D.V. Tsarev, “Automatic text summarization using latent semantic analysis”, Programming and Computer Software, Vol. 37, pp. 299–305, 2011.
  • Standard Score from Wikipedia, the free encyclopedia 2001–2012, available at: http://en.wikipedia.org/wiki/Standard score.
  • L. Felf¨ oldi, A. Kocsor, “AHP-based classifier combination”, Proceedings of the 4th International Workshop on Pattern Recognition in Information Systems, pp. 45–58, 2004.
  • Collaboration and Decision, Support Software for Groups and Organizations 2011–2012, available at: http://www.expertchoice.com/
  • R. Srinivasa Rao, S.V.L. Narasimham, M. Ramalingaraju, “Optimization of distribution network configuration for loss reduction using artificial bee colony algorithm”, International Journal of Electrical Power and Energy Systems Engineering, Vol. 1 , pp. 116–122, 2008.
  • F. Kang, J. Li, Q. Xu, “Structural inverse analysis by hybrid simplex artificial bee colony algorithms”, Computers and Structures, Vol. 87, pp. 861–870, 2009.
  • S.N. Omkar, J. Senthilnath, “Artificial bee colony for classification of acoustic emission signal”, International Journal of Aerospace Innovations, Vol. 1, pp. 129–143, 2009.
  • D. Karaboga, C. Ozturk, “A novel clustering approach: artificial bee colony (ABC) algorithm”, Applied Soft Computing, Vol. 11, pp. 652–657, 2011.
  • D. Karaboga, C. Ozturk, “Fuzzy clustering with artificial bee colony algorithm, Scientific Research and Essays”, Vol. 5, pp. 1899–1902, 2010.