Basılı Türkçe’nin Önemli Bazı İstatistiksel Özellikleri

Bu çalışmanın amacı, basılı Türkçe’nin bazı istatistiksel değerlerinin belirlenmesidir. Derlenen istatistikler tekli, ikili, …, beşli harf gruplarının sıklık dağılımları, ilk/son harf çözümlemeleri, harf başına belirsizlik (entropi)ve fazlalık, rastgelelik endeksi, sözcük uzunluk dağılımı, sesli/sessiz harf oranı’nı içermektedir. Hürriyet gazetesinin internet arşivinden bir Türkçe külliyat (corpus) oluşturularak anılan değerler elde edilmiştir. Bununla yetinilmeyip, Türkçe’ye ilişkin öteki çalışmalar da kullanılarak, tüm bu çalışmaların ağırlıklı bileşkesi olan, bugüne kadar elde edilen en geniş Türkçe külliyat tabanı ve metin çeşitliliğine sahip, en kapsamlı sonuçlar elde edilmiştir. Farklı çalışmalarda elde edilen sonuçların birbiriyle uyumluluk derecesini belirlemek amacıyla bir benzerlik ölçütü geliştirilmiş ve mevcut çalışmaların sonuçlarına uygulanmıştır.

Anahtar Kelimeler:

Türkçe’nin İstatistiksel Özellikleri, N-Gram Sıklık Dağılımları, Belirsizlik, İlk/Son Harf Çözümlemesi, Sözcük Uzunlukları, Sıralı Liste Benzerlik Ölçütü

Some Important Statistical Properties of Printed Turkish

The goal of this study is to determine some statistical properties of printed Turkish. Compiled statistics include the letter frequency (monogram, digram, ..., pentagram) distributions of Turkish, first/last letter analyses, per letter entropy and redundancy, index of coincidence, word length distribution, vowel/consonant proportion. These values are obtained by compiling a corpus from the Internet archive of daily Hurriyet newspaper. Furthermore, using existing studies on Turkish and combining them together, the largest Turkish corpus base to date with the widest text variety and the most comprehensive results are obtained. To determine the degree of agreement for the results of the different studies, a similarity rate measure has been developed and applied to the existing studies' results.

Keywords:

Statistical Properties of Turkish, N-Gram Frequency Distributions, Entropy, First/Last Letter Analysis, Word Lengths, Similarity Assessment of Sorted Lists,

PDF

___

COVER, T. and KING, R. (1978), A Convergent Gambling Estimate of the Entropy of English, IEEE Transactions on Information Theory, IT-24, n.4, 413-421
DALKILIÇ, G. (2001), Günümüz Türkçesi’nin İstatistiksel Özellikleri ve Bir Metin Sıkıştırma Uygulaması, Yüksek Lisans Tezi, Uluslararası Bilgisayar Enst., Ege Üniversitesi.
DALKILIÇ, and M.E. DALKILIÇ, G. (2000), On the Entropy, Redundancy and Compression of Contemporary Printed Turkish Proc. of the XV International. Symposium on Computer and Information Sciences, 60-67.
DİRİ, B. (2000), A Text Compression System Based on the Morphology of Turkish Language, Proc. of the XV Int’l. Symp. on Computer & Information Sciences, 12-23.
GÖKSU, T. and ERTAUL. L. (1998), Yer Değiştirmeli ve Dizi Şifreleyiciler için Türkçe’nin Yapısal Özelliklerini Kullanan Bir Kriptoanaliz, BAS’98, 184-194.
GÖNENÇ, G. (1980), Türkçe abece İçin ‘En İyi’ Kodlar, 3. Ulusal Bilişim Kurultayı, Bilişim’80 Bildiriler Kitabı, 73-75.
JURAFSKY, D. and MARTIN, J.H. (2000), Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Prentice Hall.
KOLTUKSUZ, A. (1995), Simetrik Kriptosistemler için Türkiye Türkçesinin Kriptaanalitik Ölçütleri, Doktora Tezi, Bilgisayar Mühendisliği, Ege Üniversitesi.
SHANNON, C.E. (1951), Prediction and Entropy of Printed English, Bell System Technical Journal, 30(1), 50-64.
SIEGEL, S. (1956), Nonparametric Statistics for the Behavioral Sciences, McGrowHill.
STINSON, D.R. (1995), Cyrptography Theory and Practice, Newyork: CRC Press.
TÖRECİ, E. (1975), Statistical Investigations on the Turkish Language Using Digital Computers, Yüksek Lisans Tezi, ODTÜ, (Gönenç, 1980 de referans edildiği şekilde).