EpiGASVM – a New Technique for MHC Class-II Epitope Prediction

Identification of major histocompatibility complex binding peptides is an important step in the selection of T-Cell epitope candidates suitable for usage in new vaccines.The binding groove of the MHC Class-II molecule is opened at both sides, which allows for high variability in length of the peptides that bind to this molecule and consequently complicates the prediction of the binding core motif. An accurate and efficient computational approach for the prediction of such peptides can greatly reduce the time and cost required for the design of new vaccines for infectious diseases and cancers. We have developed EpiGASVM, a new approach for the in silico prediction of MHC Class-II epitopes, by combining two artificial intelligence techniques namely: evolutionary algorithms and support vector machines. We have applied nine variations of EpiGASVM to a dataset of similarity-reduced benchmark data and we have calculated the prediction accuracy and the area under the receiver operating characteristic curve as measures of performance.The results indicate that Epi-GASVM is a promising new technique that could provide researchers with a new tool for the in silico selection of candidate peptides that can be used in rational vaccine design.


Introduction
Major Histocompatibility Complex (MHC) molecules are cell membrane proteins which play a very important role in the immune system through the process of antigen presentation. The outer extracellular domains of these molecules form a cleft in which a peptide fragment is bound. These peptides are derived from proteins degraded inside the cell, including both self and foreign protein antigens. MHC molecules bound to peptides are carried to the cell surface where they present their cargo to T cells. This antigen presentation process is essential for the recognition of the antigen by T cell receptors. MHC molecules are of two classes: Class I and Class II. Class I is responsible for presentation of peptides of intracellular origins e.g. self-antigens and viral peptides. These Class I molecules are present on all nucleated cells and present peptides to cytotoxic T cells. Class II is responsible for presentation of peptides of extracellular origins e.g. endocytosed and digested bacterial antigens. These Class II molecules are present on specialized immune system cells called Antigen-Presenting Cells (e.g. macrophages and dendritic cells) and they present peptides to helper T cells. Naïve helper T cells when exposed to APCs with MHC Class II loaded with an antigen start to proliferate and differentiate into Effector T helper cells (Memory T helper cells and Regulatory T helper cells) with their specific roles adaptive cellular and humoral immunity. From the above we can see the importance of having the ability to determine which peptides bind to MHC-II molecules in the development of epitope-based vaccines and immunotherapeutics for infectious diseases, cancer and autoimmune diseases that are better tolerated and have fewer side effects than conventional vaccines.
MHC Class II molecules are characterized by having a peptide-binding groove that is open at both ends which allows peptides of great variability (typically 11 to 30 amino acids) in length to bind to these molecules [1]. This variability in length complicates computational approaches for the prediction of the core nonamer essential for binding. However, several studies indicate that a core of nine amino acids is the most essential part in the binding [2,3].
Several computational methods for the in silico prediction of MHC-II binding peptides have been proposed including: evolutionary algorithms and artificial neural networks [4], particle swarm optimization [5], hidden Markov models [6], Gibbs sampling [7], support vector machines [8] and ant colony search [9]. These computational methods can be used to reduce the number of candidate peptides that will be used in a wet-lab for further testing thus reducing time and cost of the development of new epitope-based vaccines.
In this study we propose a new approach, EpiGASVM, based on evolutionary algorithms and support vector machines. We have used evolutionary algorithms as a tool to search for the core binding nonamers. We then calculated the similarity scores between each pair of nonamers using BLOSUM62 substitution matrix and we used the resulting similarity matrix as input to a Support Vector Machine using Radial Basis Function (RBF) Kernel to produce the final classification results. We have tried this approach on a similarity-reduced dataset of known binders and non-binders. The dataset was originally extracted from IEDB and is composed of 12 allele datasets. From each allele dataset we selected 60 instances (30 binders and 30 non-binders). We have split the data into training and test subsets of 40 and 20 balanced instances respectively. We calculated the prediction accuracy and area under the receiver operating characteristics curve (ROC) to measure the performance of each variation of EpiGASVM.

Datasets and Data Preparation
In a study by El-Manzalawy et al. [10] it was demonstrated that the predictive performance of algorithms applied to the MHC Class-II prediction problem is affected by the peptide similarity in the training and test data.We have utilized similarity-reduced datasets available from the Repository of Epitope Datasets (RED)[11] to provide more accurate performance results of EpiGASVM.
The dataset was originally extracted from IEDB (identified by IEDB-SRDS2 in [10]) and we refer to it as the IEDB dataset from this point forward. In [10] the author details the methodology that was used for the similarity reduction in 3 sequential steps namely: filtering to select unique peptides followed by filtering to ensure no pair of peptides shares a 9-mer subsequence then finally filtering to ensure that the sequence identity between any 2 pairs is less than 80%.
We prepared the datasets by randomly selecting 30 unique binders and 30 unique non-binders from the RED datasets to produce a 60 instance balanced dataset for each allele (12 alleles for the IEDB data). Each of these datasets was used with the 9 variations of EpiGASVM to produce the results reported in this work.

Idea
As a result of the great variability in length of MHC Class-II binding peptides and the fact that a 9-mer core is critical for binding and since typically we try to test several peptides at a time, an exhaustive search methodology would be impractical. We have utilized several variations of evolutionary algorithms to search for the core nonamer by locating the most similar nonamers in all of the sequences under test. Similarity between nonamers is calculated by using pairwise scores between each participating nonamer. The higher the score the more similar the nonamers and consequently the more probable that these selected nonamers are the core nonamers that binds in the MHC-II groove.The similarity score is calculated based on the values of the BLOSUM62 substitution matrix. Each individual that's progressively developed in the EA is a representation of a list of nonamers and the fitness value of this individual is the sum of all pairs of similarity scores of all participating sequences. As the generations of the EA progress, the individuals of population are developed and the selected nonamers change with the hope of finding more fit individual which represents a more similar set of nonamers.
We run our variations of EA against known binders to produce a list of most similar nonamers that represent our positive instances. We then select randomly one nonamer from each non-binder to complement the dataset with negative instances. The list of positive and negative nonamers (binders and non-binders) is vectorized by calculating the pairwise similarity score for each nonamer against the rest of the nonamers in the list. The output from this step is a list of vectorized nonamers which are split into training data and test data for the support vector machine. The SVM is trained using the training data to produce a prediction model and then the prediction model is tested using the test data to calculate the final prediction accuracy. The process is illustrated in " Figure. 1".

Evolutionary Algorithms
Evolutionary algorithms (EAs) are a group of metaheuristic optimization algorithms which are designed to optimize an objective function through providing a set of solutions (individuals) that are iteratively developed and improved over time using operators that are biologically inspired e.g. selection, recombination, mutation, etc. The iteration in an EA is usually called a generation and each generation, with the exception of the first one which is randomly generated, is developed from members of the previous generation by applying the genetic operators on members of the previous generation. This is motivated by a hope that the new population will present better solutions to the objective function.

Population Structure
Evolutionary algorithms can be classified based on the structure of the population into: structured EAs and non-structured (panmictic) EAs. In panmictic EAs there exists a single population that includes all the individuals (solutions) under evaluation and consequently genetic operators are applied to all members of the population as a whole e.g. recombination can occur between any two members of the population. On the other hand, structured evolutionary algorithms present multiple sub-populations where genetic operators are applied within the sub-population. The structuring of the population provides some bene-fits like better sampling of the search space and the ability to balance the exploration/exploitation power of the algorithm.
Since the fitness landscape for our problem is unknown, we experimented with structured EAs as well as panmictic EA. Experimentation with structured EAs gave us the ability to control the trade-off between exploration and exploitation [12]. This allowed us to tailor the algorithm to thenature of the problem by giving it more explorative power to escape local maxima/minima or by giving it more exploitative power to converge quickly to a global maximum/ minimum.
We have developed nine variations of EpiGASVM that differ in the implemented EA (two panmictic algorithms: Steady-State GA and Generational GA, and seven structured algorithms: distributed GA, cellular GA, adaptive cellular GA, hierarchical cellular GA, and three memetic algorithms). We have utilized Simulated Annealing (SA) as a local search strategy in the 3 memetic algorithms which differ in the way the locality of the neighbourhood is defined as follows: Neighbourhood: the candidate solutions are randomly selected from the same cell of the individual under consideration.
Population: the candidate solutions are randomly selected from the whole population.
Random: the candidate solutions are randomly generated by assigning arbitrary values to all genes of the individual.
More detailed information about the various EAs discussed in this section can be found in [12]. We have utilized the JCell Framework [13] for the implementation of the various evolutionary algorithms discussed in this paper. The information about the nine evolutionary algorithms used inthis paper and their configuration is summarized in Table  1.

Genetic Representation
Each individual in the population is a solution to our problem; basically it is a representation of a multiple alignment of the participating sequences. A chromosome (individual) is formed of several genes and each gene is an integer representing an offset into the sequence in the range of[0, n-9] where n is the length of the sequence. This offset is used to calculate the start point of the core nonamer in the respective sequence. " Figure. 2" illustrates the genetic representation.

Objective Function
We utilized a sum-of-pairs objective function (1) to be maximized by the evolutionary algorithm. The scores are calculated using the BLOSUM62 substitution matrix. For each individual in the population the value of the genes are used as an offset into the sequences to obtain a list of nonamers. For each possible pair of nonamers a score is calculated and the summation of all scores is the overall fitness of individual.
The function sub() calculates the pairwise substitution scores between each residue in the nonamers S i , S j :

Termination Criteria
We have set the termination criteria to 400 generations or 1000,000 evaluations of the objective function for all of our evolutionary algorithms whichever is reached first with the exception of memetic algorithms which are given 2000,000 evaluations due to the local search step which adds to the number of evaluations done.

Vectorization and Normalization of Nonamer Instances
The output from the EA techniques mentioned above is a list of nonamers that are known binders. This list is combined with a list of randomly selected nonamers from known non-binders. The vectorization is done by calculating the pairwise score between each pair of nonamers in the combined list using the BLOSUM62 substitution matrix, a technique similar to the technique of Liao et al. [14]. This is followed by normalization of each vector in the range[-1,1].

Support Vector Machines
Support vector machine (SVM) is a well-established supervised learning technique that has been used extensively in many fields including bioinformatics. The SVM provides binary classification for linearly non-separable data by mapping this data to a higher dimensional space where the data is likely to be separable and then finding the hyperplane with the largest margin in this high dimensional space that can be used for differentiating positive from negative samples. The SVM classification is done in two steps: a training step where the SVM is trained on a sample of the data with known class labels to produce a classification model and test step where the constructed model is applied to the rest of the data to produce the classification result.
We have utilized SVM with the Radial Basis Function (RBF) Gaussian Kernel. The data of each allele is composed of 60 instances, 40 of which were used as training samples and 20 for test. The RBF kernel hyper parameters (C, γ) were determined by using 5-fold cross validation and grid search on all the data (60 instances) [15]. We have utilized LibSVM library for our SVM work [16].

Algorithm Performance
We have plotted the fitness value of the best individual against the number of generations performed by each EA for all sets of data to demonstrate the convergence speed of each algorithm. In terms of speed, the steady-state genetic algorithm has achieved the apparent optimum fitness on 4 of 12 datasets faster than the other 8 algorithms we used. This is followed in ranking by adaptive cellular genetic algorithm (achieving optimum in 3 out 12 sets faster than the other algorithms). Table 2 shows fastest converging algorithm variant for each allele.
On the other end of the scale, the memetic algorithm with the local search restricted to the neighborhood (maNeigh) shows the worst performance, failing to achieve the apparent optimum on 11 out of 12 datasets within the bounds of the stopping criteria. This is followed in ranking by generational genetic algorithm (failing 10 out 12 datasets). Table 3 shows the algorithms failing to reach the apparent optimum on each allele.

Prediction Accuracy and Area under ROC Curve
We have split each dataset into 40 instances used for training and 20 instances used for test and we have calculated the prediction accuracy and area under the receiver operating characteristic curve. On the IEDB dataset we have achieved an average accuracy of 90.46% and an average area under the curve of 0.9654. Table 4 lists the accuracy values and the Table 5 lists the ROC values achieved.
In terms of prediction accuracy the steady-state genetic algorithm shows the best performance on the IEDB dataset with average prediction accuracy of 93.75% but in terms of AUC the cellular genetic algorithm shows the best performance with an average AUC of 0.9808 on the same dataset.

Performance Comparison to State-of-the-Art
Techniques We compared our work to the techniques included in a recent study by Wang et al. [17] in which the authors compared the performance of 7 approaches for the prediction of MHC Class-II epitopes on similarity-reduced data. In their paper Wang et al. compared the performance of Average Relative Binding matrix (ARB) [18], PROPRED [19], SMM-Align [20], combinatorial library, NN-Align [21], a consensus of all previous 5 methods and a consensus of 3 methods (NN-Align, SMM-Align, Combinatorial Library or PRO-PRED). Table 6 shows a comparison of these methods against our 9 approaches using the AUC.  Source code, data used, additional performance data is available via emailing the author at mostafa.omara@gmail.com.

Conclusions
We have developed a new technique, EpiGASVM, for the MHC-II epitope prediction problem. The prediction accuracy and AUC achieved on 12 similarity-reduced datasets shows that EpiGASVM is reliable and accurate. According to our knowledge, the accuracy and AUC achieved with EpiGASVM is the best in the field for the MHC-II prediction problem and it's our hope that this new technique will be an addition to the arsenal of tools available for researches for the rational design of new epitope based vaccines and immunotherapeutics.