Construction Path analysis Model from DNA Microarray Data

A fundamental problem in human health is to predict the effect of genes that cause disease; this is an important step to diagnosis and treatment. Also prediction gene functions are still a challenge for biologists in the post-genomic era. DNA microarray simultaneously monitors expression levels thousands of genes; the massive gene expression data provide us with unique opportunities to analyze the functional and regulatory relationships among genes. In this paper, new approach to estimate the relationship among genes and the effects of each gene on diseases is proposed. This approach consists of four main steps: First, extract the subset of high informative genes. Second, constructing genes network; thus, we propose to predict a gene’s functions according to its context graph, which is defined as the gene interaction network composed of the genes interacting directly and indirectly based on focal gene. Third, path analysis model is used to estimate the effect of genes on the disease and among each other. Fourth, obtains the relation among genes and the target (disease) by applying logistic regression analysis for predict new samples. This approach evaluated by lung cancer Microarray dataset. The proposed path diagram fit the subset of top ranked genes expression dataset, the Goodness of Fit Index (GFI) was > 0.832. The value of this approach is that it not only tackles the measurement problem by path analysis but also provides a visualization of the relationship among genes. The proposed approach also is useful for feature reduction, due to evaluate all genes simultaneously vs. lung state. * Prof. Dr. in Sadat Academy for Management Science (SAMS), Department of Computer science, badr_senousy_arcoit@yahoo.com † Dr. in Modern University for Technology and Information (M.T.I), Department of Computer science, hmeldeeb14@yahoo.com ‡ Military Technical College (M.T.C), Department of Computer science, khaledBadran@hotmail.com § Military Technical Collage (M.T.C ), Department of Computer science, ibrahim.alkhlil@gmail.com Proceedings of the 8 ICEENG Conference, 29-31 May, 2012 EE184 2


Introduction
A DNA microarray technique allows to simultaneously observing the expression levels of thousands of genes during significant biological processes and across collections of related samples [1], microarray data applied to make possible drug and therapeutics improvement, disease diagnosis, and comprehensible basic cell biology. The datasets from microarray analysis enables the measurement of molecular signatures of diverse cells, becomes an important application of data mining, artificial intelligence and machine learning techniques to provide bioinformatics knowledge [2].
A few numbers of genes are highly related with diseases, these genes are called informative genes, which have expression pattern strongly correlated with the diseases. By our the previous study [3], the subset of top ranked genes that gave the high classification accuracy was extracted by two techniques of Attribute selection technique, "Support Vector Machine-Recursive Feature Elimination (SVM-RFE) [4,5] and Information Gain Attribute Selection IGAS" [6]. 10 top ranked genes gave high classification accuracy by the most popular classifiers, which belong to six main categories, (1) Bayes classifiers ( Bayesian Network and Naive Bayes), (2) lazy classifier (K-nearest-neighbor), (3) rule base classifiers (PART and Decision Table, (4) function base classifiers (SVM and Artificial Neural networks ANN), (5) meta classifier ( AdaBoost(c4. 5) and Baggin(c4.5)), (6) single Decision Tree (C4.5).
Gene nets, is a general class of gene nets where any set of genes may have connections. Theoretically genes that interact directly or indirectly may have the same or similar functions in the biological processes in which they are involved and together contribute to the related cancer diseases. The complicated relations between genes can be clearly represented using network theory [7,8]. Thus Pearson's correlation or Euclidean distance criterions are used to measure the similarity between genes for constructing genes network [9]. But Pearson's correlation or Euclidean distance are not able to completely capture the relationship among all candidates genes expression profiles and lung cancer simultaneously [10,11]. Therefore, to identify interactions between genes and to estimate the relationship between genes and lung cancer path analysis model is used.
The rest of this paper organized as follows: Section 2 describes path analysis problem and related materials. Section 3 describes lung cancer microarray dataset, Section 4 presents the proposed model to constructing genes network. Section 5 describes constructing genes network in practical. Section 6 analysis and discussions. Section 7 presents a comparison between single decision tree model and proposed path analysis model. Paper concluded in section 8.

Path Analysis Problem
Let X 1 ,X 2 , . . . ,X m be exogenous variables, correlated and have both direct and indirect effects (through endogenous variables X m+1 ,X m+2 ,…,X n ) on dependent variable Y.
Let M be the correlation coefficient matrix among these variable.
It is required to construct graph model G (Path diagram) depending on underlying relationship between these variables Path analysis function calculates the path coefficient between exogenous and endogenous variables, to predict the effect variables on the dependant variable Y.
An example of the different models of trait effects on output y is illustrated in fig. 1. Fig. 1 (A). shows a multiple regression model where each trait operates simultaneously on fitness y. Fig. 1  In the model fig. 1A, P YX1 , P X3X1 , P X3X2 , … are path coefficients and r X1X2 the correlation between the exogenous variable X1 and X2, e X3 and e Y is uncorrelated error related to the endogenous variable X3, Y respectively.
Where M the correlation matrix among variables that described as: From given or calculate correlation matrix the path coefficients can be compute for each path in the graph.

Relations among variables (Genes) and Output
The results of path analysis model are path coefficients which are continuous values, but in case of target variable is a binary variable such as (lung cells stat). It's necessary to convert the path analyses output into binary value to obtains the probability of target value belong to one category. Thus Logistic regression is suggested to convert the continuous output into categorical.

Logistic regression (general Model)
If x 1 . . . x k are a collection of independent variables and y is a binomial outcome variable with probability of success = p, then the multiple logistic regression model is given by: Logistic regression: is a regression model for Bernoulli-distributed dependent variables. It is a linear model that utilizes the logit as its link function. Logistic regression has been used extensively in medical and social sciences [12,13].
The logit model takes the form: Where i=1,2,..n; n number of the sample, p i =Pr (y i =1), y i the target variable, fig. 2. shows the logistic function, with y on the horizontal axis and f(y) on the vertical axis

Pearson Correlation Coefficient for Measure Genes Similarity
Pearson's correlation coefficient measures linear relationship between two variables by comparing their strength and direction. Relationship between two variables is expressed by -1 to +1. If the variables are perfectly linear related by an increasing relationship, the Correlation Coefficient gains the maximum value i.e. +1. On the other hand, if the variables are perfectly linear related by a decreasing relationship, the correlation value gains -1. And a value of 0 expresses that the variables are not linear related by each other. In general, if the correlation coefficient is greater than 0.8, it expresses strong correlation between variables.
Let X and Y are interval or ratio variables. They are normal distribution and their joint distribution is bivariate normal. So the formula of Pearson's Correlation Coefficient is: Where sum of all the X scores is is, is sum of all the Y scores, is square of each X score and then sum of them. is square of each Y score and then sum of them. is multiply of each X score by its associated Y score and then add of the resulting products together. This is also called cross product. n refers to the number of "pairs" of data [14].

Lung Cancer Microarray Dataset
The dataset where used in this study primary lung cancer specimens, was collected from European Bioinformatics Institute (EBI) available online. http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-31908 , the dataset published on 2011-09-05, contains from 40 normal samples vs. 60 primary lung cancer. Each sample in the dataset contains from 22646 genes.

Proposed Model to Constructing Genes Network
As we mention above Microarray dataset contains thousand of genes unrelated with diseases, thus we can't deal with a huge numbers of genes in single network, in addition to the redundant genes don't provide any additional information, thus, for the simplicity we will dealing with microarray signature (most important genes).
The proposed genes network constructed as the follows: 1. Extract the microarray signature (subset of 10 top ranked genes) which ranked by information Gain Ratio (IGR) criterion. 2. Assign the gene that take higher IGR value as focal gene. 3. Calculate Pearson's correlation coefficient (PCC) between focal gene and each other genes, Only those genes with the absolute PCC value higher than (0.7) thresholds are linked to the focal gene. 4. Assign the gene that gave next IGR as focal gene. Return form 3.
Finally, a Directed networks are then created from these gene sets. In the network, a vertex set V = {g 1 , g 2 ,…,g i } represents the genes.
T} is the edge set that represents the relationship between the genes, where T is the given threshold of PCC, and P ij is the PCC value of g i and g j . Fig.3. shows the gene network.

Practical Constructing Genes Network
To extracting the subset of 10 top ranked genes according to information gain ratio criterion [3], ELBRAV algorithm [15] was used. This subset of genes gave the higher classification accuracy on majority of cancerous microarray datasets.
Software package SPSS 16.0 [16] was used to calculate the Pearson's correlation to measure the similarity among genes and lung cancer "lung cells state". Genes that have strong correlation coefficient have the same function and linked with same biological pathway [17]. Thus the genes those have Pearson's correlation value greater than |0.7| are linked together. The 10-top ranked genes that extracted from the dataset labeled as being in the microarray. Fig.3. shows the path diagram (genes network) that constructed based on Pearson's correlation value.
In the path diagram Fig.3. The rectangles represent genes (variables) and the directed arrows represent the linkage between genes (interaction between two genes or effect of genes on lung cell state).

Model Hypothesis
The genes network that generated from subset of top ranked genes assigned as null path analysis model. (cf. Fig.3.) The main objective here the question; is the genes network fit the subset of lung cancer microarray dataset?
-Ho: The genes network is assigned as null path model of hypothesized causal pathways among measured genes expression profiles and between lung cancers.
-R: is the gene network fit the subset of lung cancer microarray dataset?

Hypothesis Testing
To estimates the path coefficient (regression weights) for model (Fig.3.) Maximum likelihood estimation function (MLF) [18] was used. The MLF exists in the AMOS (v16) software package [19], fFig. shows standardized path coefficient among genes and lung cells state that calculated based on MLF.

Model Properties and Evaluation
The model (Fig. ) properties as the follows: The model in fFig. Contains several path coefficient values is not significantly different from zero at the p-value 0.05, these paths are organized in Table 1.
Due to the path diagram in (Fig. ) contains several paths that are not significantly different from zero at the 0.05 p-value; the path diagram simplified by eliminating these paths and variables (genes) that insignificantly affected on lung cells state, the modified path diagram shown in fFig. Fig. ) Fig.3.)

, where SE (Standard Error of Regression Weight), CR Critical ratio for regression weight) and P (Probability Value)
The path coefficients of the simplified path diagram were re-estimated as it's shown in Error! Reference source not found. Its notice from Error! Reference source not found. all variables (genes) are significantly affected among each other, as well as on lung cells state. Also the goodness of the simplified model (Error! Reference source not found.) is GFI=0.832 (better than the original model). The path coefficients (regression weights) for path diagram (Error! Reference source not found.) Organized in Table 2, Table 3 presents the total effects in the model Error! Reference source not found. In summary: the genes network that generated from subset of top ranked genes is fit the data with GFI =0.821, also we simplified the genes network by eliminating insignificant path coefficient, the resulted simplified model fit the subset of the dataset GFI =0.832 and better than the original model.
From two models the interaction between candidate genes were estimated and the standardization path coefficient reflect the quantitative of these interaction.

Single Decision Tree Model vs. Proposed Path Analysis Model
In addition to path analysis model measure the effect of variables (genes) on the target variable (Lung cell stat) also it can used as classification model as shown in the comparative below.
The output of traditional classification model such as single decision tree (c4.5) is a set of roles that partition the dataset according to higher informative attribute. Tested samples are classified according to these roles. As shown in the real practical example fFig.
Fig. the microarray dataset is the input of the C4.5 classifier, the output is set of rules, in testing stage tested samples are input of rules, and the output is category of the new samples.
Pearson correlation between genes that shared in the model ( Fig. ) and lung cells were calculated as shown in Table 1, its notice the higher informative gene (225721_at) is also obtains higher correlated with lung cancer, but others genes (225035_x_at and 200077_s_at) in the model are uncorrelated with Lung Cancer, Also these genes are uncorrelated with the higher informative gene (225721_at) that mean no interaction with it. We conclude the C4.5 built model from uncorrelated and no interacted genes.  In Path analysis model the gene network firstly constructed from correlated and interacted genes (cf. Fig.3.). Path analysis model allows estimating the path coefficient of all paths in the model and visualizing these effects.
The value of path analysis model in addition to estimate the regression weights of all paths simultaneously, its work as a classifier by convert the output (target) value into domain between [0 and1] that represent probability of the target value belong to one of category.  Lung cell state calculated as equation (5) for the path analysis model in Fig. and as equation (6) for path analysis model in Error! Reference source not found.
By Applying the Logistic regression to obtain the probability if new samples are tumor or normal The path analysis modeling gave 95% by the model (Fig. ) and model (Error! Reference source not found.) vs. 90% accuracy obtained by single decision tree model ( Fig. ). And it's useful to mention, samples that incorrect classified were the same in two path analysis models ( Fig. and (Error! Reference source not found.) So, in this case path analysis modeling can be used as attribute reduction.

Conclusions
We presented new approach to predict genes inference based on gene-gene interaction network that constructed from gene expression data. To elucidate this approach, a real lung cancer microarray dataset was used.
In this approach, information gain ratio attribute evaluation is used to extract the signature of cancerous microarray dataset, and Pearson's correlation criterion was used to measure the similarity between genes to construct genes network according to the strength correlation between focal gene and other candidate genes, where the high correlated genes linked together and theoretically do the same genes function. In addition to, estimate the effect of each gene on lung cancer (path coefficient) using path analysis model.
The value of this approach is that it not only tackles the measurement problem by path analysis but also provides a visualization of the relation among genes. In addition to its ease of use, this approach effectively addresses the genes reduction problem. For instance; the genes that gave approximately equal information gain ratio value or Pearson's correlation value, not gave the same path coefficient on lung cancer using path analysis model, due to in the genes reduction methods or correlation methods dealing with each gene and the disease label individually but path analysis modeling analysis all genes simultaneously.
Path analysis modeling can be used as classifier but it is need additional steps, such as data standardization and logistic regression function. In this case path analysis modeling gave the probability of the target (class label) belong to one class, this is additional advantage of path analysis modeling vs. other classification modeling, where path analysis in this study gave 95% classification accuracy vs. 90% that given by C4. 5.
This approach provides path diagram that explains the direct and indirect effect of each gene on lung cancer and shows the genes that provide significantly and insignificantly effect on lung cancer, maximum likelihood estimation function is used to calculate the path coefficients, and the proposed path diagram fit the subset of the microarray dataset, hypothesis testing were used to measure the goodness of the proposed path diagram, where the Goodness of Fit Index (GFI) was 0.832.