Prediction of Antigenic Epitope Patches on Protein Surface Using Antigen Structure Information and Support Vector Machine

Identification of antigen-antibody interacting sites is an important task for vaccine design, and hence reliable computer based prediction methods are highly desirable. The prediction performances of the current existing methods to predict the conformational B-cell epitope residues are still not satisfying and remain far from ideal. This is a new approach in the area of vaccine development to predict the antigenic surface patches that hold the majority number of epitope residues in the surface of the antigen protein structure. The proposed method is a support vector machine based model to predict the epitope patches in the antigen structures by combining the accessible surface area and B-factor structural features. The Predictions are made for the known structures of benchmark dataset after removing antigens sequence redundancy where no two antigen sequences have more than 40% sequence identity. The predictions are successful for 70% of the antigen structure chains of the benchmark dataset. We compared the prediction performance of our model with a protein – protein interaction prediction server “Sharp2” using the same antigen structures of the benchmark dataset and observed that our model outperforms on Sharp2 by more than 40% accuracy. This paper demonstrates that the identification of the antigenic determinant sites in the protein surface using the antigen structural information outperforms the traditional protein-protein interaction algorithms to predict the interacting sites in the antigen protein surface. It provides a new approach for the scientists to only use the predicted antigenic epitope surface patch from the target antigen structure in vaccine development rather than using the predicted epitope residues. A web server “PatchTope” has been developed for predicting antigenic epitope surface patches on an antigen protein structure surface and is available at http://www.fci.cu.edu.eg:8080/PatchTope/.


Introduction
Vaccine design is the process of creating d rug (vaccine) to stimulate adaptive immunity to a disease. Vaccine can either be live attenuated (weakened) fo rms of pathogens (bacteria or v iruses), killed or inactivated forms of these pathogens, or a refiner material such as proteins. Evolution of weakened pathogens can be one of the potential safety problems raised fro m such vaccines [1]. In order to overcome on such safety problems, the subunit vaccine is introduced. Subunit vaccine is produced fro m a specific portion of the protein antigen or virus separated from the pathogenic organism called ep itope. B-cell epitopes are segments of the antigen mo lecules recognized by antibodies or B-cells. They are classified into two groups: continuous and discontinuous. A continuous (linear) epitopes are short segment of continuous amino acid sequence fragment of a protein [2] wh ile a d iscontinuous (conformational) ep itopes are composed of a bundle of amino acid residues of a protein antigen that are far away fro m each other in the primary sequence of the antigen but are brought to close proximity within the folded protein structure [3]. The large majority of B-cell epitopes although they are composed of short linear peptides, are conformational.
Identificat ion of B-cell epitope is considered the main challenging task in the epitope-driven vaccine design [4]. Manual identification of B-cell epitopes by actual experimentations and testing done by scientists is very expensive and has a lot of limitat ions. Such limitations are (t ime scale, some experiments can't be done by scientists and ethical concerns). Wherefore, co mputer based systems can play an important role in this task by developing computational methods in predicting B-cell epitopes for scientists.
Co mputational methods for conformational B-cell epitopes identificat ion require a co mplete analysis in the context o f the native antigen structure however; the linear epitopes only require the sequence of the antigen to be available [5]. Several co mputational methods have been developed for predicting B-cell ep itopes of the both types: linear and conformational ep itopes.
The conformational B-cell ep itope prediction methods are also co mposed of two major approaches; these approaches are sequence and structure based approaches. Sequence based prediction methods try to predict the conformat ional B-cell epitopes fro m the antigen primary sequence while in the structure based prediction methods; the antigen 3D structure must be available. The sequence based prediction approach for predicting conformational B-cell epitopes has the advantage that there is no need for the antigen 3D structure to be available fo r predict ion but only the antigen sequence is enough. CBTOPE [22] is a prediction method relies on the sequence based approach to predict conformational B-cell epitopes fro m the antigen primary sequence. On the other hand, there are some few methods for predicting conformat ional B-cell epitopes fro m the antigen structure; these methods are CEP [23], DiscoTope [3], PEPITO [24], Ellipro [25], EPCES [26], EPSVR [27], EPMeta [27], and Liu R et al. [28]. Unfortunately, the prediction performances of these methods are still not satisfying and remain far fro m ideal.
In this paper, we present a different vision for identification of the antigenic ep itope sites in the antigen structure chain by predicting the antigen overlapping surface patches that hold the majority of epitope residues in the antigen structure; and thus the scientists can use in vaccine development. Fro m a given antigen structure, the overlapping surface patches are generated, and the surface patch that holds maximu m nu mber of epitope residues is considered the epitope patch which is used for vaccination by the scientists. The method is a support vector machine model trained on epitope and non-epitope surface patches generated fro m antigen structure chains of Pernille et al.'s dataset [3]. The method always choses three top scored paths and treats them as predicted paths. Then the prediction is considered as correct if any of these paths predicts at least 70% of interacting residues. To evaluate the performance of our model, p redictions are made for known structures of an independent test set of antigen chains generated by Po-nomarenko, Ju lia et al. [29]. Additionally, we evaluated our model in terms of the area under receiver operator characteristics curve (AUC) by conducting fivefold cross validation technique on the representative training set collected by Pernille et al. [3]. We co mpared our model with Sharp2 [30]: a server for the prediction of protein-protein interaction sites on the surface of the protein structure. We compared the prediction accuracy of Sharp2 in the benchmark dataset with our model, and it is observed that the identification of the antigenic determinant sites in the protein surfaces using the antigen structural informat ion outperforms the protein-protein interaction method "Sharp2" to predict the antibody interacting sites in the antigen protein surfaces.

2.1.1.Train ing Dataset
We obtain 75 (Ag-Ab) co mplexes prepared by Pernille et al. [3] fro m Discotope supplementary materials. These complexes were selected using X-ray crystallography with resolution less than 3 Å. The corresponding antigen PDB file is obtained from Protein Data Bank [31]. Pernille et al. [3] had d ivided the 75 antigens into 25 heterogeneous groups. The 25 heterogeneous groups of antigens were divided into five data sets for cross validation and testing. In this dataset, a residue in the antigens is determined as epitope residue if the distance between any of its atoms and any atom of antibodies is less than 4Å. This dataset contains 1202 antibody interacting and 13242 non-antibody interacting residues.

2.1.2.Independent Testset
We evaluated our model on a Bench mark dataset generated by Ponomarenko et al. [29]. Th is dataset contains 161 protein chains obtained from 144 (Ag-Ab) co mplex structures. The antigen residue is considered an epitope residue if the d istance between any of its atoms and any atom of antibodies is less than 4Å. We removed sequence redundancy from the 161 antigen chain sequences using CDHIT [32] at 40% cutoff, obtaining only 50 antigen chains where no two antigen sequences have more than 40% sequence identity. In order ensure the low sequence identity between training and testing datasets; we removed the chains that already exist in the Ponomarenko et al.'s dataset (our training dataset) and the representative remaining proteins have been selected as testing dataset.

Surface Patch Generation Algorithm
The identification of the protein surface is not an easy task even when the antigen 3-d structure is known. The relative solvent accessibility of a residue in the protein structure is considered a measure of how large the amino acid residue is exposed to the solvent surrounding the protein [33]. NACCESS [34] is a co mputer program used to compute the atomic accessible surface area o f a given 3-dimensional co-ordinate sets (PDB files). For each antigen, the relative accessible surface area is computed and the residues with a relat ive surface area ≥ 5% are considered the protein surface accessible residues [35]. Each surface accessible residue is used to define a surface patch. A surface patch is co mposed of the central accessible surface residue follo wed by N nearest surface accessible neighbour residues [35], (N + 1) is the patch size. The nearest surface accessible neighbour residues fro m the patch central accessible surface residue are determined by the Euclidean distance [36] between all surface residues and the patch central residue. Using this procedure, overlapping patches of surface accessible residues are generated from each antigen structure. Figure 1 shows a flo w chart o f the surface patch generation algorith m.

Data Preparati on
For each antigen structure in the Pernille et al.'s dataset [3], the overlapping surface patches are generated. Ep itope patch is the antigen surface patch that holds the maximu m number of epitope surface accessible residues, while non-epitope surface patch is the antigen surface patch that holds the min imu m nu mber of epitope surface accessible residues in the antigen structure. Follo wing this rule, the training dataset is composed of 75 and 75 epitope and non-epitope surface patches, respectively. In order to increase the training dataset, we increased the number of non-epitope surface patches for each antigen structure by selecting 6 surface patches holding minimu m nu mber of epitope residues, while only one epitope surface patch is generated. For each surface patch, 1 label is assigned to an epitope surface patch and 0 label is assigned to non-epitope surface patch. This train ing dataset contains a total of 75 epitope surface patches and 450 non-epitope surface patches.

Normalized Relati ve Sol vent Accessibility
For each surface residue in the antigen structure, the relative solvent accessibility is measured using the program NACCESS [34], and normalized using the following equation: whereRSA r is the relative solvent accessibility of residue r; max ( RSA ) and min ( RSA ) are the maximu m and minimu m relat ive solvent accessibility values of all residues in the antigen chain, respectively.

Normalized B -Factor
Also called "temperature factor", it reflects the flexibility of residues in the protein structure resulting fro m protein crystallography [37]. For each surface residue in the antigen structure, the B-Factor value is ext racted fro m the antigen 3-dimensional co-ordinate file (PDB file), and normalized using the following equation: Where BFactor r is the B-Factor of residue r, <BFactor r > and ∂(BFactor r ) are the mean value and the standard deviation of the B-Factor values of all residues in the antigen chain, respectively.

Support Vector Machine Model
Support vector machine [38] is a classification algorith m aims to find a deterministic mapping function between the input features. Given a set of labeled training patterns (xi, yi), where xi∈ Rp, yi ∈ {+1, −1}, t rain ing a SVM classifiers involves finding a maximu m-margin hyper plane that divides positive and negative training data samp les. The hyper plane can be written as f(x) = w . x + b, where "." denotes the dot products, w is a normal vector and b is a bias term. In case of the train ing data are not linearly separable, a kernel function is used to map the non-linearly separable data into a higher-dimensional space and thus the data are assumed to be linearly separable. Given any two sample observation in the input space (xi, xj), the kernel function can be written as a dot product of two feature vectors into high dimensional feature space K(xi, xj) = Φ (xi)T Φ(xj). In this paper, we used Gaussian Rad ial-Basis Function (RBF) as a kernel function for our support vector machine classifier: where ∂ is a parameter. The support vector machine models had been used in a number of bio logical applications [39]. We have developed a SVM models using Weka [40] a machine learn ing workbench.

Epi tope Surface Patches Prediction Alg orithm
Given the antigen structure, all of its surface residues are generated, and hence all its corresponding surface patches are obtained. The normalized relative solvent accessibility and B-Factor features are calculated for each residue in the surface patch, and hence each surface patch is represented by a vector of dimensions Nx2 where (N is the patch size). A prediction score is associated with each surface patch based on the support vector machine scores for the feature vectors. The top three non-overlapping surface patches with highest prediction scores are generated where no two surface patches have more than 50% residues overlap Figure  2. Figure 3 shows the flow chart of the prediction algorith m.

Accuracy Measures Using the Inde pe ndent Testset
For each antigen structure chain in the independent test set, all the surface patches are obtained. For each surface patch, the number of ep itope residues is calculated. The surface patch which holds the maximu m nu mber of ep itope residues over all surface patches generated from the antigen structure is defined as the epitope real surface patch. Figure  4 shows the process of determining the real surface patch.
The top three predicted surface patches are generated using the prediction algorithm. For each predicted surface patch, the relative overlap with the real epitope surface patch is calculated using the follo wing equation: whereNe R is the number of epitope residues in the real surface patch, and Ne P is the number of epitope residues in the predicted surface patch. If the relative overlap for any of the top three predicted surface patches of the antigen structure exceeds 70%, then the prediction is defined to be correct. The prediction accuracy is defined as the ratio of the number of correct ly predicted surface patches from the antigen structures to the number of all antigen structures in the independent test set.

Analysis of Anti body Interacting Sites
In order to understand whether the B-cell ep itope residues are located on the surface of the protein structure, we analyzed the Pernille et al.'s dataset to find distributions of the amino acid preference of epitope and non-epitope residues. As shown in Figure 5, most of epitope amino acid residues like Asparagine, Glycine, Arg inine, Lysine, Aspartic, and Threonine are polar (Hydrophilic in nature) wh ile most of the non-epitope amino acid residues like Cysteine, Phenylalanine, Methionine, Alan ine, and Tryptophan are hydrophobic (non-polar). It is known that the hydrophobic amino acid residues are not accessible to the solvent while the polar and charged amino acid residues are accessible to the surface of the molecule and are in contact with the solvent [6]. The same Pernille et al.' s dataset is analyzed with respect to the antigen surface residues identificat ion, fo llo wing the role that the antigen residue with relat ive solvent accessibility ≥ 5% is considered a protein surface accessible residue. The total number of epitope residues in the protein surface and protein body is 1164 and 38 residues, respectively. These findings confirm that most of antigenic epi-tope residues are located in the surface of the antigen protein structure.

Prediction Results of Fi vefol d Cross Validati on
Support vector mach ine model based on Gaussian Radial-Basis Function (RBF) kernel has been developed using the combination of the two antigen structural features (relative solvent accessibility and B-factor). The features are represented by a vector of dimension Nx2 (N is the patch size and equals to 20 residues). The surface patches were generated for each antigen structure in the 75 antigen chains of Pernille et al.'s dataset. Fivefold cross validations were conducted on the five antigen groups of Pernille et al.'s dataset [41]. Fo r each run, one group was left out for testing, while the remaining four groups were used for training. The average area under receiver operator characteristics curve for the 5 antigen groups reached a maximu m of 0.894.

Prediction of Antigenic Patches in Benchmark Dataset
The benchmark dataset is used to independently evaluate our model for predicting the antigenic surface patches from the protein antigens. The 75 antigen chains of the Pernille et al.'s dataset are used to train our SVM model, while we predict the antigenic surface patches of the 30 antigen chains generated fro m the Ponomaren ko et al.'s dataset after removing sequence identity. After applying our prediction algorith m it is observed that 70% of the antigen chains were correctly predicted Tab le 1. We obtained an area under receiver operator characteristics curve (AUC) of 0.809. Total number of surface pat ches o f each antigen chain. b Number of residues in the surface patch. c Number of epitope residues in the real epitope patch and the top three non-overlapping surface pat ches generated from our predi ction algorithm. d Relative overlap between the top three surface pat ches and the real surface pat ch. e The prediction accuracy, if the relative overlap fo r any of the top three predicted surface patches o f the antigen structure exceeds 70%, then the prediction is defined to be correct.

Comparisons with a Protein-Protein Interaction Server
Sharp2 [30] is a web server fo r pred icting protein-protein interaction sites on the surface of the 3D structure of a protein. The protein interacting sites may be an identical protein, a different protein that is larger, a different protein that is smaller, or an antibody. The user friendly web server enables the scientists to choose the protein interacting sites, and hence the algorithm parameters for predict ing protein-protein interaction sites are changed accordingly [35]. For each antigen chain in the benchmark dataset, the protein interacting surface patches have been downloaded using Sharp2 web server with parameters: (protein type = Type D; Interacting partner is an antibody; and patch size = 20). The surface patches predicted from the antigen chain are ranked based on the patches with h ighest combined scores. The top three non-overlapping surface patches with highest combined scores are generated where no two surface patches have more than 50% residues overlap. For each antigen chain in the benchmark dataset, the relative overlap of each surface patch in the top three non-overlapping surface patches with the real epitope surface patch is calculated. We observed that only 28% of the antigen chains were correctly predicted ( Table 2). Co mparing this prediction accuracy in the Ponomarenko et al.'s dataset with our model reveals that identifying the antigenic determinant sites in the protein surfaces using the antigen structural info rmation outperforms the traditional protein-protein interaction algorithms to predict the interacting sites in the protein surfaces.

Visualizati on of Predicted Patches for an Example
To illustrate the effectiveness of our method, we choose a chain comp lex (PDB ID: 1ZTX, Chain ID: E) fro m the benchmark dataset as an example to visualize the predicted surface patch which holds the maximu m nu mber of ep itope residues in the chain complex. We co mpare the residues of the predicted epitope surface patch with the actual epitope residues determined in the benchmark epitope dataset. (Figure 6) shows that the predicted surface patch identified by our classifier holds most of the epitope residues in the protein complex structure (11 out of 16 ep itope residues). The predicted surface patch can then be used in vaccine development.

Patch Tope Implementation
PatchTope is a user friendly web-based bioinformatics tool for the prediction of the antigenic epitope surface patches which hold the most epitope residues in the given antigen protein structure. The server is developed using Java Servlet and HTM L. The user may submit the antigen structure by entering its PDB-Id or uploading a structure file in a PDB format. Moreover, the user may enter the chain Id for the protein chain of interest and then click on submit button. For the given antigen structure, the surface accessible residues are extracted, and each one defines a surface patch. For each surface patch, the feature vector is generated by computing the relative solvent accessibility of each residue using NACCESS [34] program, and ext racting the B-Factor feature fro m the PDB file. The generated surface patches are then passed as input to the trained support vector machine model (one by one). The top three non-overlapping surface patches with highest prediction scores are generated where no two surface patches have more than 50% residues overlap. 3D view of the antigen structure is generated using JMOL [42] and the predicted surface patch residues are marked as yellow colo r for each predicted surface patch according to the user selection. PatchTope requires Netscape v6.0 or Internet Exp lorer v 6.0 or higher and Java script enabled. The web-server is freely availab le at http://www.fci.cu.edu.eg:8080/PatchTope/.

Conclusions
In this paper we p ropose a new computational method for predicting antigenic surface patches that interact with B-cell. Co mputing the relative overlapping of predicted patches with the real epitope patch in known structures of independent test set showed that structural information of the antigen chains can be used in predicting the p rotein interacting sites on the surface of the protein structure. Co mpared with popular p rediction methods for p redicting protein -protein interaction using patch analysis, our approach showed better performance in terms of predict ion accuracy.