An alignment-free method for bulk comparison of protein sequences from different species

An alignment-free method for bulk comparison of protein sequences from different species

The available number of protein sequences rapidly increased with the development of new sequencing techniques. This in turn led to an urgent need for the development of new computational methods utilizing these data for the solution of different biological problems. One of these problems is the comparison of protein sequences from different species to reveal their evolutional relationship. Recently, several alignment-free methods have been proposed for this purpose. Here in this study, we also proposed an alignment-free method for the same purpose. Different from the existing methods, the proposed method not only allows for a pairwise comparison of two protein sequences, but also it allows for a bulk comparison of multiple protein sequences simultaneously. Computational results performed on gold-standard datasets showed that, bulk comparison of multiple sequences is much faster than its pairwise counterpart and the proposed method achieves a performance which is quite competitive with the state-of-the-art alignment-based method, ClustalW.

___

  • Z. Jiang and Z. Yanhong, "Using bioinformatics for drug target identification from the genome." American Journal of Pharmacogenomics 5.6 (2005): 387-396.
  • M.S. Waterman, "Identification of common molecular subsequence." Mol. Biol 147 (1981): 195-197.
  • S. F. Altschul, et al., "Basic local alignment search tool." Journal of molecular biology 215.3 (1990): 403-410.
  • J. Yang and L. Zhang, "Run probabilities of seed-like patterns and identifying good transition seeds." Journal of Computational Biology 15.10 (2008): 1295-1313.
  • A. Chakraborty and B. Sanghamitra, "FOGSAA: Fast optimal global sequence alignment algorithm." Scientific reports 3 (2013): 1746.
  • O. Gotoh, "An improved algorithm for matching biological sequences." Journal of molecular biology 162.3 (1982): 705-708.
  • X. Liu, et al., "Number of distinct sequence alignments with k-match and match sections." Computers in biology and medicine 63 (2015): 287-292.
  • C. Li, et al., "Protein Sequence Comparison and DNA-binding Protein Identification with Generalized PseAAC and Graphical Representation." Combinatorial chemistry & high throughput screening 21.2 (2018): 100-110.
  • L. Yu, et al., "Protein sequence comparison based on physicochemical properties and the position-feature energy matrix." Scientific Reports 7 (2017): 46237.
  • J.D. Thompson, G.H. Desmond and J.G. Toby, "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic acids research 22.22 (1994): 4673-4680.
  • W. Hou, et al., "A new method to analyze protein sequence similarity using Dynamic Time Warping." Genomics 109.2 (2017): 123-130.
  • L. He, et al. "A novel alignment-free vector method to cluster protein sequences." Journal of theoretical biology 427 (2017): 41-52.
  • Z. Qi, and J. Meng-Zhe, "An Intuitive Graphical Method for Visualizing Protein Sequences Based on Linear Regression and Physicochemical Properties." MATCH-Communications in Mathematical and in Computer Chemistry 75.2 (2016): 463-480.
  • C. Li, L. Xueqin and L. Yan-Xia., "Numerical characterization of protein sequences based on the generalized Chou’s pseudo amino acid composition." Applied Sciences 6.12 (2016): 406.
  • Y. Zhang, et al., "Novel numerical characterization of protein sequences based on individual amino acid and its application." BioMed research international 2015 (2015).
  • Z. Qi, et al., "A protein mapping method based on physicochemical properties and dimension reduction." Computers in biology and medicine 57 (2015): 1-7.
  • C. Yu, et al., "Protein map: an alignment-free sequence comparison method based on various properties of amino acids." Gene 486.1 (2011): 110-118.
  • Y. Yao, et al., "Similarity/dissimilarity analysis of protein sequences based on a new spectrum-like graphical representation." Evolutionary Bioinformatics 10 (2014): EBO-S14713.
  • L. Wang, P. Hui and Z. Jinhua, "ADLD: a novel graphical representation of protein sequences and its application." Computational and mathematical methods in medicine 2014 (2014).
  • C. Wu, et al., "A novel model for protein sequence similarity analysis based on spectral radius." Journal of theoretical biology 446 (2018): 61-70.
  • N. Jafarzadeh and A. Iranmanesh, "A new measure for pairwise comparison of protein sequences." MATCH: Communications in Mathematical and in Computer Chemistry 74 (2015): 563-574.
  • Y. Li, et al., "An alignment-free algorithm in comparing the similarity of protein sequences based on pseudo-Markov transition probabilities among amino acids." PloS one 11.12 (2016): e0167430.
  • H.J. Yu and H. De-Shuang, "Normalized feature vectors: a novel alignment-free sequence comparison method based on the numbers of adjacent amino acids." IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 10.2 (2013): 457-467.
  • C. Yu, L.He. Rong and SS. Yau, "Protein sequence comparison based on K-string dictionary." Gene 529.2 (2013): 250-256.
  • A. Czerniecka, et al., "20D-dynamic representation of protein sequences." Genomics 107.1 (2016): 16-23.
  • Y. Zhang, "A new model of amino acids evolution, evolution index of amino acids and its application in graphical representation of protein sequences." Chemical Physics Letters 497.4-6 (2010): 223-228.
  • A. El-Lakkani, and H. Mahran, "An efficient numerical method for protein sequences similarity analysis based on a new two-dimensional graphical representation." SAR and QSAR in Environmental Research 26.2 (2015): 125-137.
  • Z. Mu, et al., "3D–PAF Curve: A Novel Graphical Representation of Protein Sequences for Similarity Analysis." MATCH: Communications in Mathematical and in Computer Chemistry 75.2 (2016): 447-462.
  • Y. X. Liu, et al, "P–H curve, a graphical representation of protein sequences for similarities analysis." MATCH Communications in Mathematical and in Computer Chemistry 70.1 (2013): 451-466.
  • ZC. Wu, X. Xuan and C. Kuo-Chen, "2D-MH: A web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids." Journal of theoretical biology 267.1 (2010): 29-34.
  • G. Huang, and J. Hu., "Similarity/Dissimilarity Analysis of Protein Sequences by a New Graphical Representation." Current Bioinformatics 8.5 (2013): 539-544.
  • K.V. Holmes, "SARS coronavirus: a new challenge for prevention and therapy." The Journal of clinical investigation 111.11 (2003): 1605-1609.
  • E.J. Snijder, et al., "Unique and conserved features of genome and proteome of SARS-coronavirus, an early split-off from the coronavirus group 2 lineage." Journal of molecular biology 331.5 (2003): 991-1004.
  • N. Abbaspour, R. Hurrell and R. Kelishadi, "Review on iron and its importance for human health." Journal of research in medical sciences: the official journal of Isfahan University of Medical Sciences 19.2 (2014): 164.
  • M.J. Ford, "Molecular evolution of transferrin: evidence for positive selection in salmonids." Molecular biology and evolution 18.4 (2001): 639-647.
  • G. Chang, and W. Tianming, "Phylogenetic analysis of protein sequences based on distribution of length about common substring." The protein journal 30.3 (2011): 167-172.
  • H. Kim, et al., "Marine antifreeze proteins: structure, function, and application to cryopreservation as a potential cryoprotectant." Marine drugs 15.2 (2017): 27.