Assessment of Mutation Susceptibility in DNA Sequences with Word Vectors

With the advent of natural language processing (NLP) techniques empowered with deep learning approaches, more detailed relationships between words have been unraveled. Word2Vec is quite robust in discovering contextual and semantic relationships. Genome being a long text, is subject to similar studies to unravel yet to be discovered relationships between DNA k-mers. Dna2vec applies Word2Vec approach to whole genome so that DNA k-mers are represented as vectors. The cosine similarity queries on DNA vectors reveal unusual relationships between DNA k-mers. In this study, we examined DNA sequence based prediction of mutation susceptibility. Initially, we generated word vectors for human and mouse genome via dna2vec. On the other hand, we retrieved coordinates of common and all mutations from dbSNP. For each coordinate, we extracted 8 nucleotide k-mers intersecting mutations and results are aggregated such a way that number of mutations for each 8-mer has been tabulated. These results are incorporated with dna2vec cosine similarity data. Our results showed that for a given k-mer, k-mers with highest cosine similarity coincide with highest mutation count k-mer. In other words, the neighbor with the highest cosine similarity for a k-mer was also seen to be the neighbor overlapping the mutation count. As a result of our studies, human and mouse, dna2vec vs. mutation overlap is 80% and 70%, respectively. In conclusion, dna2vec and other word embedding approaches can be used to reveal mutation or variation characteristics of genomes without sequencing or experimental data, solely using the genome sequence itself. This might pave the way for understanding the underlying mechanism or dynamics of mutations in genomes.

DNA Dizilerinde Kelime Vektörleri ile Mutasyon Yatkınlığının Değerlendirilmesi

Derin öğrenme yaklaşımları ile güçlendirilen doğal dil işleme (NLP) tekniklerinin ortaya çıkmasıyla, kelimeler arasındaki daha ayrıntılı ilişkiler ortaya çıkarılmıştır. Bu açıdan word2vec yöntemi bağlamsal ve anlamsal ilişkileri keşfetme konusunda oldukça gelişmiştir. Uzun bir metin olan genom, DNA k-merleri arasındaki ilişkileri henüz keşfedememiş olan benzer çalışmalara tabidir. Dna2vec, DNA k-merlerinin vektör olarak gösterilmesi için tüm genoma word2vec yaklaşımını uygular. DNA vektörleri üzerindeki kosinüs benzerlik sorguları DNA k-merleri arasında olağandışı ilişkiler ortaya koymaktadır. Bu çalışmada, mutasyon duyarlılığının DNA dizisi temelli tahmini incelendi. Başlangıçta, insan ve fare genomu için dna2vec yoluyla sözcük vektörleri üretildi. Diğer yandan, ortak ve tüm mutasyonların koordinatlarını dbSNP'den alındı. Her koordinat için, kesişen mutasyonlar halinde 8 nükleotidlik k-merler çıkarıldı ve sonuçlar toplandı. Bu sonuçlar dna2vec kosinüs benzerlik verileri ile birleştirildi. Sonuçlarımız, belirli bir k-mer için, en yüksek kosinüs benzerliğine sahip k-merlerin en yüksek mutasyon sayısına sahip k-merler ile çakıştığını göstermiştir. Başka bir deyişle, bir k-mer için kosinüs benzerliği en yüksek komşunun, mutasyon sayısıyla çakışan komşu olduğu da görülmüştür. Çalışmalarımız sonucunda insanda ve farede, k-mer kosinüs benzerliği ve mutasyon örtüşmesinin oranları sırasıyla %80 ve %70 olduğu görülmüştür. Bu çalışmalar sonucunda dna2vec ve diğer kelime gömme yaklaşımları, sadece genom dizisinin kendisi kullanılarak, dizileme veya deney verileri olmadan genomların mutasyon veya varyasyon özelliklerini ortaya çıkarmak için kullanılabilir olduğu gösterilmiştir. Bu durum, genomlardaki mutasyonların altında yatan mekanizmayı veya dinamikleri anlamanın yolunu açabilir.

Kaynakça

Abdul-Mageed, M., & Ungar, L. (2017, July). Emonet: Finegrained emotion detection with gated recurrent neural networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 718-728).

Akın, A. A., & Akın, M. D. (2007). Zemberek, an open source nlp framework for turkic languages. Structure, 10, 1-5.

Arora, S., Li, Y., Liang, Y., Ma, T., & Risteski, A. (2018). Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6, 483-495.

Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems (pp. 4349-4357).

Chen, M. (2017). Efficient vector representation for documents through corruption. arXiv preprint arXiv:1707.02377.

Chen, X., & Lawrence Zitnick, C. (2015). Mind's eye: A recurrent visual representation for image caption generation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2422-2431).

De Boom, C., Van Canneyt, S., Bohez, S., Demeester, T., & Dhoedt, B. (2015, November). Learning semantic similarity for very short texts. In 2015 ieee international conference on data mining workshop (icdmw) (pp. 1229- 1234). IEEE.

Dos Santos, C., & Gatti, M. (2014, August). Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 69-78).

Eisner, B., Rocktäschel, T., Augenstein, I., Bošnjak, M., & Riedel, S. (2016). emoji2vec: Learning emoji representations from their description. arXiv preprint arXiv:1609.08359.

Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276.

Gladkova, A., Drozd, A., & Matsuoka, S. (2016, June). Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. In Proceedings of the NAACL Student Research Workshop (pp. 8-15).

Jauhar, S. K., Dyer, C., & Hovy, E. (2015). Ontologically grounded multi-sense representation learning for semantic vector space models. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 683-693).

Kiros, R., Salakhutdinov, R., & Zemel, R. S. (2014). Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539.

Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015, June). From word embeddings to document distances. In International conference on machine learning (pp. 957- 966).

Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188-1196).

Levy, O., & Goldberg, Y. (2014, June). Linguistic regularities in sparse and explicit word representations. In Proceedings of the eighteenth conference on computational natural language learning (pp. 171-180).

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Yih, W. T., & Zweig, G. (2013, June). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746- 751).

Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3), 443–453. doi:10.1016/0022-2836(70)90057-4

Ng, P. (2017). dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279. 45

Pedersoli, M., Lucas, T., Schmid, C., & Verbeek, J. (2017). Areas of attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1242-1250).

Perozzi, B., Al-Rfou, R., & Skiena, S. (2014, August). Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 701-710). ACM.

Preoţiuc-Pietro, D., Lampos, V., & Aletras, N. (2015, July). An analysis of the user occupational class through Twitter content. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1754-1764).

Preoţiuc-Pietro, D., Liu, Y., Hopkins, D., & Ungar, L. (2017, July). Beyond binary labels: political ideology prediction of twitter users. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 729-740).

Schwartz, R., Reichart, R., & Rappoport, A. (2015, July). Symmetric pattern based word embeddings for improved word similarity prediction. In Proceedings of the nineteenth conference on computational natural language learning (pp. 258-267).

Sienčnik, S. K. (2015, May). Adapting word2vec to named entity recognition. In Proceedings of the 20th nordic conference of computational linguistics, nodalida 2015, may 11-13, 2015, vilnius, lithuania (No. 109, pp. 239- 243). Linköping University Electronic Press.

Uricchio, T., Ballan, L., Seidenari, L., & Del Bimbo, A. (2017). Automatic image annotation via label transfer in the semantic space. Pattern Recognition, 71, 144-157.

Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., ... & Liu, H. (2018). A comparison of word embeddings for the biomedical natural language processing. Journal of biomedical informatics, 87, 12-20.

Yang, X., Macdonald, C., & Ounis, I. (2018). Using word embeddings in twitter election classification. Information Retrieval Journal, 21(2-3), 183-207.

Yao, Y., Li, X., Liu, X., Liu, P., Liang, Z., Zhang, J., & Mai, K. (2017). Sensing spatial distribution of urban land use by integrating points-of-interest and Google Word2Vec model. International Journal of Geographical Information Science, 31(4), 825-848.

Zhao, Z., Yang, Z., Luo, L., Lin, H., & Wang, J. (2016). Drug drug interaction extraction from biomedical literature using syntax convolutional neural network. Bioinformatics, 32(22), 3444-3453.

Kaynak Göster