Statistical digram and trigram analysis of turkish in terms of coverage and entropy for possible language and speech based applications


Uslu I. B., YILMAZ A. E., İLK H. G.

18th European Signal Processing Conference, EUSIPCO 2010, Aalborg, Danimarka, 23 - 27 Ağustos 2010, ss.776-780 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Basıldığı Şehir: Aalborg
  • Basıldığı Ülke: Danimarka
  • Sayfa Sayıları: ss.776-780
  • Ankara Üniversitesi Adresli: Evet

Özet

In this study two frameworks, made up of digrams and trigrams, are built for a complete coverage of the Turkish language. In addition, character, digram and trigram entropy values for Turkish, English and Spanish are compared. Examining meaningful Turkish texts, we have achieved the result that, there are 3 major digram clusters which constitute slightly more than 60% of Turkish texts. Similar to digram distributions, there are 3 major trigram clusters which cover almost 40% of Turkish texts. The statistics show that, for 99% coverage of Turkish, 391 (of 841 theoretical) digrams and 3,396 (of 24,389 theoretical) trigrams are sufficient. The results of this study would constitute a general roadmap for rapid coverage to researchers who would like to work on Turkish language and speech based applications. As an application, the results could lead to a general framework for setting up the rules of prioritization in duration modeling in concatenative text-to-speech synthesis systems. © EURASIP, 2010.