Koşullu Rastgele Alanlar ile Türkçe Haber Metinlerinin Etiketlenmesi

Her geçen gün belge sayısı artan Web'in tam potansiyeliyle kullanılması için anlamsal ağ alanındaki çalışmaların Web'in geleceğini oluşturacağı düşünülmektedir. Belge sayısındaki bu artışa bağlı olarak istenilen metne erişebilmek için bu metni en iyi temsil eden söz öbeklerinin bulunması doğru bir yaklaşım olmaktadır. Tüm metni okumadan o metni en iyi ifade edecek söz öbeklerine erişmek hem kullanıcı açısından hem de tarayıcı açısından büyük önem taşımaktadır. Bu çalışmanın amacı haber metinlerinde, haber metninin öznesi, yüklemi, yer ve zamanını belirtecek söz öbeklerinin metinde bulunup, metnin etiketlenmesidir. Haber metnin öznesi, metindeki en baskın kişi, şey veya süjeyi ifade eder. Metnin yüklemi ise metindeki oluşu ifade eder. Metnin yeri ve zamanı ise metindeki olayın geçtiği zaman ve yeri ifade eder. Bu amaçla, metinde geçen cümleler içerisinden seçilen en baskın özne, yüklem, yer ve zaman bilgilerinin çıkarılması hedeflenmektedir. Kapsam olarak Türkçe haber metinleri seçilmiştir. Elle etiketleme işlemi yapılan metinler otomatik etiketleme işlemi esnasında bir kısmı eğitim ve diğer kısmı ise sınama verisi olarak kullanılmıştır.

Anahtar Kelimeler:

Doğal Dil İşleme, Bilgi Çıkarımı, Koşullu Rastgele Alanlar, Varlık İsmi Tanıma

Labelling Turkish News Stories with Conditional Random Fields

Drastical document increase in Web requires semantic web applications in order to lead the Web to its full potential. Extracting important phrases in a document facilitates finding expected information. In this paper, a new approach that is labelling the main subject, main predicate, main location and main date of an electronic document is introduced. The main subject label tells whom or what the document about. The main predicate label tells what the subject is or does. The main location label tells where the activities passed and the main date label tells when the document passed. With the help of this new methodology, extraction of not only high level description of the content, but also the attribute of a phrase in a document is provided. As an experimental set Turkish news stories are selected. To use as a training and test set, manual labeling is made by human annotators. Then, different models for each label are implemented to extract the labels automatically and they are compared to manually labelled results to evaluation process of this study.

Keywords:

Natural Language Processing, Information Extraction, Conditional Random Fields, Named Entity Recognition,

PDF

___

Silviu Cucerzan and David Yarowsky, Language Independent Named Entity Recognition, Combining Morphological and Contextual Evidence. s. 90-99, 1999
Soderland, S., Fisher, D., Aseltine, J. ve Lehnert,W., 1995, CRYSTAL: Inducing a Conceptual Dictionary,
Bikel, D.M., Miller, S., Schwartz, R. ve Weischedel R., 1997. Nymble: a highperformance learning name-finder, Proceedings of the fifth conference on Applied natural language processing, ANLC ’97, Association for Computational Linguistics, Stroudsburg, PA, USA, s.194–201
NetOwl Server, Proceedings of the fifth conference on Applied natural language processing, ANLC ’97, Association for Computational Linguistics, Stroudsburg, PA, USA, s.15–16
Kucuk, D. ve Yazici, A., 2009. Named Entity Recognition Experiments on Turkish Texts, Proceedings of the 8th International Conference on Flexible Query Answering Systems, FQAS ’09, Springer-Verlag, Berlin, Heidelberg,
Özkan Bayraktar, 1991. Local Grammar, Person Name Recognition in Turkish Financial Texts by Using Local Grammar Approach, METU, s.19–27.
Tür, G., Hakkani-tür, D. ve Oflazer, K., 2003. A statistical information extraction system for Turkish, Nat. Lang. Eng., 9(2), 181– 210,
Nallapati, R., Allan, J. ve Mahadevan, S., Extraction of Key Words from News Stories.
Cohen, J.D., 1995. Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting, Journal of the American Society for Information Science, 46(3), 162– 174.
Matsuo, Y. ve Ishizuka, M., 2004. Keyword extraction from a single document using word co-occurrence statistical information, International Journal on Artificial Intelligence Tools, 13(1), 157–169
van der Plas, L., Pallotta, V., Rajman, M. ve Ghorbel, H., 2004. Automatic Keyword Extraction from Spoken Text. A Comparison of two Lexical Resources: the EDR and WordNet, CoRR, cs.CL/0410062
Hulth, A., 2003. Improved automatic keyword extraction given more linguistic knowledge, Proceedings of the 2003 conference on Empirical methods in natural language processing, EMNLP ’03, Association for Computational Linguistics, Stroudsburg, PA, USA, s.216–223
Pala, N. ve Çiçekli, I., 2007. Turkish Keyphrase Extraction Using KEA, in, Proceedings of the 22nd International Symposium on Computer and Information Sciences (ISCIS 2007.
Wang, J., Peng, H. ve Hu, J.s., 2006. Automatic keyphrases extraction from document using neural network, Proceedings of the 4th international conference on Advances in Machine Learning and Cybernetics, ICMLC’05, Springer-Verlag, Berlin, Heidelberg, s.633–641
Quinlan, J.R., 1993. C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA.
Cicekli, I. ve Kalaycilar, F., 2008. TurKeyX: Turkish Keyphrase Extractor, Proceedings of the 23rd International Symposium on Computer and Information Sciences, TeX Users Group, s.84–89.
Lafferty, J., McCallum, A. ve Pereira, F., 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proc. 18th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA, s.282–289.
Oflazer, K., 1993. Two-level description of Turkish morphology, Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics, EACL ’93, Association for Computational Linguistics, Stroudsburg, PA, USA, s.472–472,
Sak, H., Güngör, T. ve Saraçlar, M., 2008. Turkish Language Resources: Morphological Parser, Morphological Disambiguator and Web Corpus, Proceedings of the 6th international conference on Advances in Natural Language Processing, GoTAL ’08, Springer-Verlag, Berlin, Heidelberg, s.417–427,
Eryiğit, G., 2007. ITU Treebank Annotation Tool, Proceedings of the ACL workshop on Linguistic Annotation (LAW 2007), Prague.
http://www.ntvmsnbc.com/
http://www.hurriyet.com.tr/anasayfa/
http://www.milliyet.com.tr/Haber/
http://www.zaman.com.tr/
Ozkaya, S.; Diri, B., "Named Entity Recognition by Conditional Random Fields from Turkish informal texts," Signal Processing and Communications Applications (SIU), 2011 IEEE 19th Conference on , vol., no., pp.662-665, 20-22 April 2011