Feature Importance in the Quality of Protein Templates

Muhamed Adilović, Altijana Hromić-Jahjefendić

Abstract


Proteins are in the focus of research due to their importance as biological catalysts in various cellular processes and diseases. Since the experimental study of proteins is time-consuming and expensive, in silico prediction and analysis of proteins is common. Template-based prediction is the most reliable, which is why the aim of this study is to analyze how important are the primary features of proteins for their quality score. Statistical analysis shows that protein models with a resolution lower than 3 Å or R value lower than 0.25 have higher quality scores when compared individually to their counterparts. Machine learning algorithm random forest analysis also shows resolution to have the highest importance, while other features have lower but moderate importance scores. The exception is the presence of ligand in protein models, which does not have an effect on the global protein quality scores, both through statistical and machine learning analyses.

Keywords


Protein Template Quality Assessment, Feature Importance, Machine Learning

Full Text:

PDF

References


“Introduction to Proteins: Structure, Function, and Motion, Second Edition,” CRC Press. https://www.crcpress.com/Introduction-to-Proteins-Structure-Function-and-Motion-Second-Edition/Kessel-Ben-Tal/p/book/9781498747172 (accessed Oct. 02, 2019).

C. A. Orengo, A. E. Todd, and J. M. Thornton, “From protein structure to function,” Curr. Opin. Struct. Biol., vol. 9, no. 3, pp. 374–382, Jun. 1999, doi: 10.1016/S0959-440X(99)80051-7.

R. A. Chica, “Protein Engineering in the 21st Century,” Protein Sci. Publ. Protein Soc., vol. 24, no. 4, pp. 431–433, Apr. 2015, doi: 10.1002/pro.2656.

“Comparison of Crystallography, NMR and EM - Creative Biostructure.” https://www.creative-biostructure.com/comparison-of-crystallography-nmr-and-em_6.htm (accessed Oct. 30, 2019).

R. P. D. Bank, “RCSB PDB: Homepage.” https://www.rcsb.org/ (accessed Oct. 02, 2019).

C. L. P. Gupta, A. Bihari, and S. Tripathi, “Protein Classification using Machine Learning and Statistical Techniques: A Comparative Analysis,” ArXiv190106152 Cs Q-Bio Stat, Jan. 2019, Accessed: Oct. 02, 2019. [Online]. Available: http://arxiv.org/abs/1901.06152.

A. Dalkiran, A. S. Rifaioglu, M. J. Martin, R. Cetin-Atalay, V. Atalay, and T. Doğan, “ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature,” BMC Bioinformatics, vol. 19, no. 1, p. 334, Sep. 2018, doi: 10.1186/s12859-018-2368-y.

J. Cheng, A. N. Tegge, and P. Baldi, “Machine Learning Methods for Protein Structure Prediction,” IEEE Rev. Biomed. Eng., vol. 1, pp. 41–49, 2008, doi: 10.1109/RBME.2008.2008239.

M. Gao, H. Zhou, and J. Skolnick, “DESTINI: A deep-learning approach to contact-driven protein structure prediction,” Sci. Rep., vol. 9, no. 1, pp. 1–13, Mar. 2019, doi: 10.1038/s41598-019-40314-1.

S. Wang, J. Peng, J. Ma, and J. Xu, “Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields,” Sci. Rep., vol. 6, p. 18962, Jan. 2016, doi: 10.1038/srep18962.

S. P. Nguyen, Y. Shang, and D. Xu, “DL-PRO: A novel deep learning method for protein model quality assessment,” in 2014 International Joint Conference on Neural Networks (IJCNN), Jul. 2014, pp. 2071–2078, doi: 10.1109/IJCNN.2014.6889891.

R. Cao, B. Adhikari, D. Bhattacharya, M. Sun, J. Hou, and J. Cheng, “QAcon: single model quality assessment using protein structural and contact information with machine learning techniques,” Bioinformatics, vol. 33, no. 4, pp. 586–588, Feb. 2017, doi: 10.1093/bioinformatics/btw694.

K. Uziela, D. Menéndez Hurtado, N. Shu, B. Wallner, and A. Elofsson, “ProQ3D: improved model quality assessments using deep learning,” Bioinformatics, vol. 33, no. 10, pp. 1578–1580, May 2017, doi: 10.1093/bioinformatics/btw819.

R. Cao, Z. Wang, Y. Wang, and J. Cheng, “SMOQ: a tool for predicting the absolute residue-specific quality of a single protein model with support vector machines,” BMC Bioinformatics, vol. 15, no. 1, p. 120, Apr. 2014, doi: 10.1186/1471-2105-15-120.

J. Moult, K. Fidelis, A. Kryshtafovych, T. Schwede, and A. Tramontano, “Critical assessment of methods of protein structure prediction (CASP)—Round XII,” Proteins Struct. Funct. Bioinforma., vol. 86, no. S1, pp. 7–15, 2018, doi: 10.1002/prot.25415.

Y. Zhang, “Protein Structure Prediction: Is It Useful?,” Curr. Opin. Struct. Biol., vol. 19, no. 2, pp. 145–155, Apr. 2009, doi: 10.1016/j.sbi.2009.02.005.

A. Fiser, “Template-based protein structure modeling,” Methods Mol. Biol. Clifton NJ, vol. 673, pp. 73–94, 2010, doi: 10.1007/978-1-60761-842-3_6.

J. Lee, P. L. Freddolino, and Y. Zhang, “Ab Initio Protein Structure Prediction,” in From Protein Structure to Function with Bioinformatics, D. J. Rigden, Ed. Dordrecht: Springer Netherlands, 2017, pp. 3–35.

S. Vangaveti, T. Vreven, Y. Zhang, and Z. Weng, “Integrating ab initio and template-based algorithms for protein–protein complex structure prediction,” Bioinformatics, doi: 10.1093/bioinformatics/btz623.

S. Abeln, J. Heringa, and K. A. Feenstra, “Strategies for protein structure model generation,” 2017.

Protein Data Bank, “RCSB PDB - 2LYZ: Real-space refinement of the structure of hen egg-white lysozyme.” https://www.rcsb.org/structure/2lyz (accessed Mar. 08, 2021).

“PDB101: Learn: Guide to Understanding PDB Data: Introduction,” RCSB: PDB-101. http://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/introduction (accessed Oct. 02, 2019).

G. J. Kleywegt and T. A. Jones, “Phi/psi-chology: Ramachandran revisited,” Struct. Lond. Engl. 1993, vol. 4, no. 12, pp. 1395–1400, Dec. 1996, doi: 10.1016/s0969-2126(96)00147-5.

H. Zhou and Y. Zhou, “Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction,” Protein Sci., vol. 11, no. 11, pp. 2714–2726, 2002, doi: 10.1110/ps.0217002.

R. Lüthy, J. U. Bowie, and D. Eisenberg, “Assessment of protein models with three-dimensional profiles,” Nature, vol. 356, no. 6364, pp. 83–85, Mar. 1992, doi: 10.1038/356083a0.

J. U. Bowie, R. Lüthy, and D. Eisenberg, “A method to identify protein sequences that fold into a known three-dimensional structure,” Science, vol. 253, no. 5016, pp. 164–170, Jul. 1991, doi: 10.1126/science.1853201.

R. A. Laskowski, M. W. MacArthur, D. S. Moss, and J. M. Thornton, “PROCHECK: a program to check the stereochemical quality of protein structures,” J. Appl. Crystallogr., vol. 26, no. 2, Art. no. 2, Apr. 1993, doi: 10.1107/S0021889892009944.

C. Colovos and T. O. Yeates, “Verification of protein structures: patterns of nonbonded atomic interactions,” Protein Sci. Publ. Protein Soc., vol. 2, no. 9, pp. 1511–1519, Sep. 1993, doi: 10.1002/pro.5560020916.

J. Pontius, J. Richelle, and S. J. Wodak, “Deviations from standard atomic volumes as a quality measure for protein crystal structures,” J. Mol. Biol., vol. 264, no. 1, pp. 121–136, Nov. 1996, doi: 10.1006/jmbi.1996.0628.

P. Benkert, M. Biasini, and T. Schwede, “Toward the estimation of the absolute quality of individual protein structure models,” Bioinforma. Oxf. Engl., vol. 27, no. 3, pp. 343–350, Feb. 2011, doi: 10.1093/bioinformatics/btq662.

M. Shen and A. Sali, “Statistical potential for assessment and prediction of protein structures,” Protein Sci. Publ. Protein Soc., vol. 15, no. 11, pp. 2507–2524, Nov. 2006, doi: 10.1110/ps.062416606.

K. Olechnovič and Č. Venclovas, “VoroMQA: Assessment of protein structure quality using interatomic contact areas,” Proteins Struct. Funct. Bioinforma., vol. 85, no. 6, pp. 1131–1145, 2017, doi: 10.1002/prot.25278.

F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” J. Mach. Learn. Res., vol. 12, no. null, pp. 2825–2830, Nov. 2011.

Scikit Learn, “1. Supervised learning — scikit-learn 0.24.1 documentation.” https://scikit-learn.org/stable/supervised_learning.html#supervised-learning (accessed Mar. 08, 2021).

wwPDB consortium, “Protein Data Bank: the single global archive for 3D macromolecular structure data,” Nucleic Acids Res., vol. 47, no. D1, pp. D520–D528, Jan. 2019, doi: 10.1093/nar/gky949.

R. P. D. Bank, “PDB Statistics: PDB Data Distribution by Resolution.” https://www.rcsb.org/stats/distribution-resolution (accessed Dec. 09, 2020).

J. Y. Young et al., “OneDep: Unified wwPDB System for Deposition, Biocuration, and Validation of Macromolecular Structures in the PDB Archive,” Structure, vol. 25, no. 3, pp. 536–545, Mar. 2017, doi: 10.1016/j.str.2017.01.004.

S. Gore et al., “Validation of Structures in the Protein Data Bank,” Struct. England1993, vol. 25, no. 12, pp. 1916–1927, Dec. 2017, doi: 10.1016/j.str.2017.10.009.

S.-H. Chong and S. Ham, “Folding Free Energy Landscape of Ordered and Intrinsically Disordered Proteins,” Sci. Rep., vol. 9, no. 1, Art. no. 1, Oct. 2019, doi: 10.1038/s41598-019-50825-6.

G. Studer, C. Rempfer, A. M. Waterhouse, R. Gumienny, J. Haas, and T. Schwede, “QMEANDisCo—distance constraints applied on model quality estimation,” Bioinformatics, vol. 36, no. 6, pp. 1765–1771, Mar. 2020, doi: 10.1093/bioinformatics/btz828.




DOI: http://dx.doi.org/10.21533/pen.v9i2.1830

Refbacks

  • There are currently no refbacks.


Copyright (c) 2021 Muhamed Adilović, Altijana Hromić-Jahjefendić

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

ISSN: 2303-4521

Digital Object Identifier DOI: 10.21533/pen

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License