Correlation of model quality between predicted proteins and their templates

Protein structure prediction is an important process that carries a lot of benefits for various areas of science and industry. Template modeling is the most reliable and most popular method, depending on the solved structures from the Protein Data Bank. An important part of it is template selection, using different methods, which is a challenging task that requires special attention because the proper selection of protein template can lead to a more accurate protein prediction. This study focuses on the relationships between predicted proteins, taken from the Swiss-model repository, and their templates, on a larger scale. Features of predicted proteins are taken into account, including protein length, sequence identity, and sequence coverage. Quality assessment scores are compared and analyzed between the predicted proteins and their templates. Overall, quality assessment scores of predicted proteins show a moderate positive correlation to the sequence identity with the templates. Moreover, based on our data, the level of template quality is noticeably correlated with the predicted protein structuers, because templates with higher quality scores will, on average, also allow for the modeling of predicted proteins with higher quality scores.


Introduction
Proteins are one of the main components of living organisms and the main workers in cells [1]. Understanding their mechanisms of action is helpful for the understanding of many biological pathways and treatment of various diseases [1], [2]. Since protein function is closely related to its structure, focusing on structural biology is the key component in the overall study of proteins [3]. There are several methods of determining protein structure, but the two most important ones are the physical determination of structure through various methods, and prediction of protein structure through different algorithms. Physically, the structure is determined using different techniques, among which the most popular ones are X-ray crystallography, nuclear magnetic resonance (NMR), and cryogenic electron microscopy (cryo-EM) [4]. These types of structures are generally made publicly available through Protein Data Bank (PDB)an online repository of solved protein structures [5]. When it comes to protein structure prediction, it also has different methods, with the main ones being templatebased prediction and ab initio. The template-based uses uploaded protein structures available at the PDB, finds the most similar ones mostly by comparing the sequence of amino acids (although different tools have different methods), and builds/predicts the novel protein using the available template [6]. Ab initio, on the other hand, tries to independently predict the structure of a novel protein (although it can also use parts of the existing templates) [7]. Template-based prediction is inherently more accurate, however, the accuracy of ab initio is improving, and there are also situations where it is more useful due to specific constraints, e.g. [8], [9]. In general, in silico study of protein structures is very popular since it can find potential targets for later research while using a fraction of the resources [10], which is why there has been a lot of development in computational biology and bioinformatics, especially in the area of structural biology, e.g. using machine learning (ML) in order to study various aspects of proteins including their structure [11]- [13], structural quality [14, p.], [15]- [17], or classification [18], [19]. Validation of protein's 3D structure is another important aspect, and there are different quality assessment (QA) tools developed for this purpose. The issue is that, with the prediction of proteins' structure using template-based methods, structural properties of templates, including the quality, might be transferred onto the predicted structures. This is an important aspect to be considered since the selection of the best template for the prediction of the protein is a challenging task, with novel methods of template selection showing an improvement in the quality of the prediction of proteins [20], which indicates that there is a potential for further optimization of this process. The aim of this research is to assess whether there is a "transfer of characteristics", namely the quality level, from the templates to predicted protein structures, and to what degree. In order to do this, the aim is also to assess the correlation of predicted proteins' QA levels with their structural features and, if there is a connection, to group them into corresponding subgroups for the proper study and analysis.

Materials and methods
This study contains two databasesa database of template structures collected from the PDB, and a database of predicted proteins from Swiss-model (S-M) [21]. Both databases have been cross-referenced and filtered so that the analyses in this study have been done only on those predicted proteins which contain templates from the first database, and vice versaon the templates which have predicted proteins in the second database.

Sample collection and retrieval of the information
The first database contains template proteins from the PDB solved with X-ray crystallography. From the total of 35,710 present, 6656 are used at least once in the second database containing predicted proteins. Additional details for the first database as well as QA results and descriptive statistics are found in the previous study done only on the protein templates [22]. The second database containing the predicted protein structures has been taken from the S-M repository available online [21]. It contained more than 400,000 proteins during the collection process, which have been cross-referenced with the first database, so the final database containing predicted protein structures contains 49,000 proteins all of which have been predicted based on the templates that are in the first database. The online repository also contains two important parameters: percentage of sequence used from the template, and percentage of similarity to the template sequence which have been taken for further usage and analysis. Additionally, proteins in the online repository are grouped according to the organism from which the sequences have been taken, which has also been adapted to the database. The final list of protein features used in the second database is available in table 1. Moreover, all features from the first database (template database) can also be cross-referenced to the second database (database of predicted proteins). The total number of residues in the protein model Sequence Identity Percentage of similarity between the template and the protein Sequence Coverage Percentage of residues from the template used during modeling

Quality assessment
The list of different methods used for the Quality assessment is in Table 2. These are all the same QA tools used for the assessment of database containing template proteins from the PDB. The main difference is that predicted proteins do not have R value, since it is the measurement obtained experimentally during the determination of the protein structure.

Correlation between the first and the second database
The features from the second database (organism of origin, protein length, sequence identity, and sequence coverage) have been analyzed against the QA scores using Pearson correlation and based on them, predicted proteins were adequately divided and compared. The comparison is done by taking into account the QA scores of template proteins and predicted proteins.
The following correlation was made between the two databases: predicted proteins were first divided based on the sequence identity to the template into 20 distinct groups -5 points in sequence identity is taken as a cut-off value, and since the identity goes from 0 to 100, the end result is 20 groups. From each of the group, proteins were filtered based on the number of predicted proteins having the same template -if there are at least 30 proteins with the same template -the mean of their quality scores is taken into account, otherwise, the sample is deemed too small and is excluded. Cross-comparison has then been performed with t-test -comparing the means of each of the protein samples among themselves, and analyzing the difference in the means of quality scores of predicted proteins with the difference in the quality scores of their templates, taking into consideration statistically significant differences.

Descriptive statistics
The total number of proteins in the 2nd database is 48523. The total number of templates from the 1st database used in the prediction of the proteins from the 2nd database is 6656. This means that most of the PDB proteins are not used as a template, others are used more than once, and some are very commonly used as templates. E.g. 4UXV is the most common template -1387 proteins are predicted from it. It is "Cytoplasmic domain of bacterial cell division protein EzrA". 5XG2 is the second most common template -1007 proteins are predicted from it. It is "Crystal structure of a coiled-coil segment (residues 345-468 and 694-814) of Pyrococcus yayanosii Smc". Out of 6656 templates, 4392 (~66%) have been used for the prediction of only up to 3 proteins (inclusive). Table 3 includes the summary of descriptive statistics for protein features.       . Quality assessment scores based on the sequence identity between the predicted protein and the template Figure 5b. Quality assessment scores based on the sequence identity between the predicted protein and the template After organizing the predicted proteins into 20 groups based on the sequence identity with the template (since that feature of predicted proteins has been shown to have a consistent correlation with the quality scores of predicted proteins), they were filtered so that there are at least 30 proteins with the same template per subgroup. Then, a total of 498 cross-comparisons have been made and the end result is that the difference in quality scores between the predicted proteins is consistent with the difference in quality scores between the templates 82% of the time. This means that, if two protein templates are compared, when one template has higher QA scores than the other, e.g., it can also be expected that the proteins predicted from that template will also have higher QA scores, than the proteins predicted from the template with the lower QA scores. This relation is illustrated in Figure 6, which does not contain actual data but simply a representation of the relationship. T1 (blue) represents the QA scores of template 1, with QA scores of its predicted proteins shown below it, while the template 2 (T2) with its predicted proteins is shown in yellow. Dashed lines represent the average QA scores of predicted proteins. Note that the templates have been positioned higher on the Y-axis for the ease of representationthis does not necessarily indicate that the templates have higher QA scores than the predicted proteins, however, the analysis does show that template proteins have higher QA scores, on average, when compared to the predicted proteins (results shown in the discussion part). Figure 6. The relation of the QA scores between templates and predicted proteins

Discussion
When the relationship between the templates and the predicted proteins is analyzed, it is interesting to see the large disparity in the number of proteins being used as templatesout of 35710 experimentally determined proteins from the sample (first database), only ~18% (6656) have been used as templates for the prediction of 48523 proteins from the second database. Two of them have been used as a template for the prediction of more than 1000 other proteins each, while two-thirds of them (~66%) have been used as a template only 1, 2, or 3 times. The sequence identity of predicted proteins to their templates is low on average (35.11%), which is also visible from figure 3, but it is still reasonable since it has been shown that proteins with sequence identity as low as 20% can actually be homologous in terms of structure and function [34]. Sequence coverage, on the other hand, is larger, with mean value of 73.27, indicating that S-M tends to include a bigger portion of the protein template during the prediction. This is also visible from figure 2, showing that the largest proportion of predicted proteins from the database have between 95% and 100% sequence coverage. This is expected due to the fact that some amino acids have similar physiochemical properties and even though they might differ between the template and the prediction, the end effect on the structure might be similar [1]. Most of the proteins fall into the shorter group, visible from figure 1, with an average length of 287.60, but this is expected since the first database, containing templates, contains only monomeric proteins. Regarding the QA, it is interesting to note that the quality assessment software could be divided into two categories with respect to the quality scoresthose which give similar quality score to templates and predicted proteins, and those which give significantly lower scores, on average. The results for the templates are shown in the separate study [22], however they are briefly mentioned here where necessary. Negative energy and Ramachandran assessment give very similar average scores to templates and predicted proteins (Energy standardized: -2.08 vs -2.10: Ramachandran: 99.53 vs 98.15). This shows that modeling of the proteins is generally performed in such a way that the protein model is optimized in terms of packing and phi/psi angles. It is interesting to note here that Ramachandran scores seem to be very good, most of the time, which would put Ramachandran as the least reliable method for the determination of structural quality, on its own. Other tools show large differences between the mean QA scores of templates and predicted proteins ( Regarding the features of predicted proteins: sequence length and coverage of template sequences do not show a consistent correlation with the quality scores, and their average correlation is very low and low, respectively. However, it is visible that certain QA tools might be "susceptible" to the length of the protein, with Verify3D and VoroMQA showing a moderate correlation between the QA scores and protein lengths, which is consistent with the previous study done only on the protein templates [22]. The characteristic of predicted proteins which showed a consistent correlation with the quality scores is sequence identity to the template -a consistent increase in the quality of predicted proteins that have higher sequence identity to the template is visible, compared to those with lower sequence identity. This is the reason why the predicted proteins have been divided according to the sequence identity in the last part of the analysis. The average Pearson correlation coefficient is moderately positive -0.41, and these results are partially expected. The higher the sequence identity, the closer the two proteins are structurally (template and predicted protein), which would, as a consequence, result in their QA scores being more similar as well. This is consistent with other studies which have shown that it is possible to predict protein model QA scores from the sequence alignmenta step necessary for the prediction [35]. Finally, this study shows that the quality scores of predicted proteins are generally consistent with the quality scores of their templates in more than 80% of the cases. Comparing predicted proteins based on two templates, average quality scores of predicted proteins are higher if the corresponding template has a higher quality score, and vice versa (with the adjustment to the sequence identity). Although correlation doesn't necessarily mean causation, relatively high number (82%) indicates that there is "a transfer of property" between the templates and predicted proteins, when it comes to the QA scores, even with the adjustment for the possible sequence identity bias by making the comparison only between the proteins of similar sequence identities.

Conclusion
Analyzing the relationships between predicted proteins (from S-M repository) and their templates (from the PDB) on a larger scale, certain trends are visible. Sequence identity can play an important role on the QA scores of predicted proteins, with most of the QA results showing moderate positive correlation to it. Sequence coverage and protein length do not show the same level of correlation, although it is moderate in some instances, indicating that certain QA tools can be biased towards the protein length, with longer proteins having better QA scores. Correlating the QA scores between the predicted proteins and their templates, a significant link can be noticed between the predicted proteins having higher QA scores on average. This occurs if the template they are predicted from also has a higher QA score, when compared to the predicted proteins and templates of lower QA scores. This is an important aspect that should be taken into consideration during the protein prediction process and template selection. Further analysis of QA scores on a local level might give additional insights into the trends of QA tools when scoring protein 3D structures.