BiRange:An Efficient Framework for Biclustering of Gene Expression Data Using Range Bipartite Graph

Biclustering is a vital data mining tool which is commonly employed on microarray data sets for analysis task in bioinformatics research and medical applications. There has been extensive research on biclustering of gene expression data arising from microarray experiment. This technique is an important analysis tool in gene expression measurement, when some genes have multiple functions and experimental conditions are diverse. In this paper, we introduce a new framework for biclustering of gene expression data. The basis of this framework is the construction of a range bipartite graph for the representation of 2-dimensional gene expression data. We have constructed this range bipartite graph by partitioning the set of experimental conditions into two disjoint sets. The key benefit of this representation is that, it leads to a compact representation of all similar value ranges between experimental conditions. Based on this problem formulation, an efficient algorithm is proposed that searches for constrained maximal cliques in this range bipartite graph, in order to extract a set of biclusters. Our technique is scalable to practical gene expression data and can produce different types of biclusters amid noise. The experimental evaluation of this technique also reveals its accuracy and effectiveness with respect to noise handling and execution time in comparison to other similar techniques.


Introduction
Clustering is co mmonly used to reveal biological meaningful pattern in data arising fro m microarray experiment, which is called gene expression data. Further, it is the most common tool for interaction identification, as similar objects form a cluster. Clustering is unsupervised classification [1], also known as cluster analysis, which discovers grouping(s) of a set of patterns, objects or po ints. It is prevalent in any discipline that involves interaction identification. Unfo rtunately, clustering is difficult for most data sets due to its diverse shapes, densities, sizes and background noise. Gene exp ression data are usually arranged in a 2-d imensional mat rix, where rows represent genes and colu mns represent samp les or experimental cond it ions. Each element of this matrix represents the expression level of a gene under a specific condition, and is represented by a real number. In this case, genes that are similar may share a common b iological pathway and the groupings of predictivegenes can be of interest to biologist. Conventional clus ter ing techniques are based on similarity between genes across all samples or experimental conditions. However, genes may be co regulated under some specific experimental conditions and shows weak similarity beyond these conditions. Therefore, a group of genes forms cluster under a subset of conditions. This technique of t wo-way clustering referred to as biclustering, in wh ich both genes and conditions are clustered simultaneously.

Related Work
Several biclustering algorith ms have been proposed in different application scenarios but we concentrate on graph theoretic approaches. In order to extract b iclusters, these algorith ms usually employ heuristic or probabilistic model. An illustrative discussion on many of these algorith ms can be found in [4,5].
Cheng and Church [2] identify b iclusters with the help of mean squared residue score, wh ich is a measure of the coherence of rows and columns in the bicluster. Here the user has to input a value of mean residue score δ and the number of biclusters to be extracted. This method involve several iterations and each iteration produce only a single bicluster while prev iously identified biclusters are masked with random values. Ho wever they did not address the issue of noisy data, where as in this paper we concentrate on noisy data.
Tanay et al. [6] introduced SAMBA, in which the data are modelled as a bipartite graph with genes corresponding to vertices in one b ipartition and samples corresponding to vertices in other bipartition, where edges representing significant changes in expression. Edges and non-edges are weighted by likelihood scores derived from a probabilistic model for the b ipartite graph. A bicluster is defined as a heavy subgraph, where the weight of the subgraph is the sum of the weights of the corresponding edges and non-edges. It repeatedly finds the maximal h ighly-connected subgraph in the bipartite graph and perfo rms local imp rovement by adding or deleting a single vertex until no further imp rovement is possible. In order to avoid exponential runtime, they assumed that row vertices have d-bounded degree. However, our technique can handle graphs of arbitrary degrees.
Ahsan and Amir [3] identify biclusters by recursively removing noise with the help of crossing min imization technique. This method is based on binary representation of the bipartite graph corresponding to input data matrix. It is difficult to produce coherent biclusters, as this method use a static discretization of the input data mat rix.
Waseem and Asfaq [7] proposed cHawk, to identify b iclusters with the help of crossing min imization paradig m. This method employs the barycenter heuristic to arrange vertices in both layers of a bipartite graph. The similarity test is done based on bregman divergence. This approach is similar to our approach as we also employ bipartite graph for representation of gene expression data. The time co mp lexity of this technique is ( ) , where and are the number of rows and co lu mns of the input data matrix and d is the average degree of overlap among biclusters, which is slower than our approach.
Wang and Liu [8] proposed RMSBE, wh ich can identify optimal square biclusters with the maximu m similarity score. This method performs mu ltip le scans of the data matrix in order to co mpute similarity score, reference gene identification and bicluster identification. The t ime co mp lexity of this technique is ( ( + ) 2 ), where is number of ro ws and is number of co lu mns. Due to this cubic nature of complexity, it is not feasible for very high dimensional data.
Prelic et al. [9] proposed BiMax, which can identify constant biclusters. This method discretize the input expression matrix into a binary matrix based on a threshold value. Therefore it is difficult to identify coherent biclusters.
Berg mann et al. [10] proposed the iterative signature algorithm (ISA) that uses gene signatures and condition signatures in order to extract b iclusters with both up and down-regulated expression values. They identify several transcription modules (biclusters) by executing the algorithm on reference gene sets. The reference gene sets needs to be carefully selected for ext raction of good quality biclusters.
Zhao and Zaki [11] proposed Tricluster, for min ing coherent clusters in 3-dimensional gene expression data sets. They construct a range mult igraph and then searches for constrained maximal cliques in this multigraph, in o rder to extract a set of b iclusters. Ho wever, they do not address the issue of noisy data, where as our approach can effectively handle noisy data.

Let
= { 0 , 1 , ⋯ , −1 } be a set of genes and = { 0 , 1 , ⋯ , −1 } be a set of experimental conditions. Microarray data-set is a real valued × expression matrix = × = � �where ∈ [ 0, − 1 ] , ∈ [ 0, − 1 ] and each entry corresponds to the logarithm of the relative abundance of mRNA of a gene under a specific experimental condition . A bicluster corresponds to a sub matrix that exhibits some coherent tendency. Let be a sub matrix of dataset i.e Bicluster = × = � � where ⊆ and ⊆ , provided certain conditions of homogeneity are satisfied. We define the volume or size of a bicluster as the number of elements , such that ∈ and ∈ . Let be the set of all biclusters that satisfy the given homogeneity conditions, then ∈ is called a maximal b icluster iff there doesn't exist another bic luster ′ ∈ such that ⊂ ′ .
will be a valid bicluster iff it is a maximal bicluster satisfying the following conditions: be the geometric mean between two specified colu mn values for a given row and = ∑ be the weight of the row for this specified two column values. Let us consider = × � − � and = × � − � be the weighted difference of two colu mn values for a given row and respectively. We need that � , � − min� , � ≤ ; where is the mult iple of maximu m weight in the corresponding gene-set i.e × max ( ).
We also need that | | ≥ and | | ≥ , where and denote minimu m cardinality thresholds for each dimension.
We consider an edge as valid, when the weighted difference range for a condition pair satisfies the threshold value so that it generates a gene-set. In order to produce large enough clusters, the minimu m size constraints i.e , and are imposed.
is a scaling b icluster if = and = ; and − ≤ , where is a constant multiplicative factor.
is a shifting bicluster iff = + and = + ; and . Similarly is a constant column bicluster if = + or = × , where is a typical value within the b icluster; and are adjustment for ro w and Gene Expression Data Using Range Bipartite Graph column respectively. is overlap bicluster if is the sum or product of the contribution of different biclusters to which they belong.
Bipartite Graph: A graph ( , ) is called Bipartite if its vertex set can be decomposed into two disjoint subsets 0 and 1 (i.e. = 0 ∪ 1 ) such that every edge in joins a vertex in 0 with a vertex in 1 (i.e. 0 ∩ 1 = ∅).

Preprocessing
Gene expression data is usually noisy and may contain missing values. Illustrative discussion on prediction of missing value can be found in [12,13]. Therefore it is essential to condition the data before applying the clustering algorithm. In order to handle missing values, we have adopted the approach used in [14] i.e. replacing all missing values by zero. Before normalizing the dataset, data beyond a threshold value (three standard deviation), has been temporarily removed to reduce the effect of outliers in the data. Then, the gene expression data is transformed using z-score standardizat ion, where the transformed variables have a mean of 0 and variance of 1. Finally, the temporarily removed outliers that are below the mean value are replaced by the minimu m value, where as the outliers above the mean values are replaced by the maximu m value of the final normalized data. In order to handle outlier efficiently, we have partitioned the normalized condition data into unequal length intervals based on mean value. The motivation of considering unequal length intervals is due to the ineffectiveness of equal length intervals for extreme outlier values. Also the decision of number of intervals may lead to inappropriate interval boundaries, as it does not depend on the properties of data [15]. In this approach, we partition the condition column data into two halves with the mean value. Then, recursively each half is part itioned again into two halves with its own mean. Th is process proceeds until each condition colu mn has been partitioned into required nu mber of intervals. The number o f intervals for each condition depends on the data size . Further, to have balanced partition, it is assumed that = 2 , where is a positive integer and 2 × 35 ≤ , where 35 is the minimu m sample size for large sample procedures [16]. The values within the intervals are then smoothed by interval means. We found that this way of partitioning is very effective and can deal with outliers efficiently.

Constructi ng Weighted Range Bi parti te Graph
For a given dataset , the minimu m size threshold, and , and the maximu m weighted difference threshold , let and be any two condition columns of and let = × | − | be the weighted difference of the expression values of gene in colu mns and such that < , where ∈ [0, − 1]. In o rder to incorporate the idea of mutual importance between two colu mns, we have computed the weight of all rows for specified co lu mn pairs. A difference range is defined as an interval of d ifference values [ , ℎ ] , with < ℎ . Let ([ , ℎ ]) = { : ∈ [ , ℎ ]} be the set of genes, whose difference w.r.t. co lu mns and lie in the given weighted difference range. A difference range is called valid iff max ( ℎ , ) − min ( ℎ , ) ≤ , where is the mu ltip le o f maximu m row weight in the corresponding gene set. Normally, fo r microarray experiment data, genes and conditions are represented by 1 and 2 vertex sets respectively, and the edge weight represents the response of ℎ gene to ℎ condition. Ho wever, in order to have a very compact representation, in this paper, we construct the weighted undirected bipartite graph by partitioning the condition set into two disjoint sets called upper layer ( 1 ) and lower layer ( 2 ). The conditions that do not have any data values are not considered in the formation of disjo int sets. Here, each edge in the range bipartite graph has associated with it the rank and gene-set corresponding to the weighted difference range on that edge. Different bipartite graphs emerged for different threshold value, which is the mult iple of maximu m weight value in the corresponding gene-set. Consequently, we will have different types of biclusters. We have given priority to all valid edges that have large number of genes in the gene-set, by assigning rank ( ) to these edges. Ranks have been assigned on the basis of total nu mber of genes in the gene-set i.e. | ([ , ℎ ]) | . The gene-sets having highest cardinality have been assigned rank 1, the second highest rank 2, the third highest rank 3 and so on. The inclusion and deletion of edges depends upon the value of and order in wh ich we process these ranks and conditions. Let ′ and ′ be the set all valid edges and gene-sets respectively, in ascending order of ran k values.   (d) Weighted Difference between c0 and c1 Here we may have overlapping of different ranges. The algorith m for partit ioning condition set into 1 and 2 , for construction of bipartite graph is given in Algorithm I. Fro m Table 1, a maximal weighted range bipartite graph is constructed ( Figure 2). Let ′ be the set of columns with missing value in each ro w. Let ′′ be the set of conditions such that ′′ = − ′. We have taken weight of the edge as rank for the corresponding gene-set. This algorith m g ives priority to the edges having large rank in order to compensate the deletion of few valid edges, while part itioning the condition set into two disjoint sets. If there is a tie among rank values, then we randomly select an edge and start constructing the bipartite graph. As we deal with noisy data, additive and multip licat ive methods of finding clusters may not always lead to good results. Therefore, instead of comparing two colu mn values independently [11], we have computed weight of each row for any two specified co lu mn values. We build bipartite graph model of data, after properly conditioning the input data.

Experi mental Setup
We have implemented our proposed algorithm in C++ under windows environment on a computer with configuration of Core 2 Duo 2.2 GHz of CPU and 3 GB RAM. The accuracy and performance of this algorithm is evaluated using synthetically generated dataset and real dataset. For synthetic data generation, a technique parallel to a methodology proposed in [9] is adopted whereas for real dataset, the model organis m Yeast Saccharomyces Cerevisiae dataset is considered, since the yeast GO annotations are more extensive compared to other organisms. This dataset is provided by Gasch et al. [17], which contains 2,993 genes and 173 different stress conditions.

Bicluster Extracti on
The weighted range bipartite graph is constructed by partitioning the set of experimental conditions ′′ into two disjoint sets and . Each edge of this graph is associated with a rank value and the corresponding gene-set. The algorith m for ext raction of b iclusters fro m this graph by emp loying depth first search technique is given in Algorithm II. This algorith m requires the value of , and the weighted bipartite graph ′ as its input parameter. The output of this algorithm is a set of biclusters and the number of such biclusters depends upon the dataset.

Eval uation
For evaluation purpose, we need to co mpare our algorithm with other biclustering algorith ms. As different biclustering algorith ms deals with different problem fo rmulat ions and clustering criteria, it is difficult to have a co mparative study among such algorith ms. Therefo re, these algorithms work efficiently in certain situation and perform poorly in others. In view of this problem, we need to provide a common setting for such algorithms so that we can perform a fair co mparative study. Our main focus lies on validating our proposed algorith m for extraction o f constant, coherent and overlapped biclusters from noisy gene expression data with high accuracy in comparison to other similar algorith ms. In view co mparison, similar algorith ms like CC [2], ISA [10], cHawk [7], SAMBA [6], RMSBE [8] and BiMax [9] are considered. We have used the Bicluster Analysis Toolbox (BicAT) developed by Prelic et al. [18] for implementation of BiMax, CC and ISA. Further, for imp lementation of SAMBA, EXPANDER developed by Maron-Katz et al. [19] is used and RMSBE imp lementation was downloaded fro m [20].

Co mp lexity Analysis
Since we require to evaluate all pair of conditions, compute the weight of each gene, and find the weighted difference range to get the corresponding gene-set, the range bipartite graph construction step would take time ( 2 ). Here, we have considered the experimental conditions as vertices for the bipartite graph. These vertices are partitioned into two disjoint sets on the basis of rank of valid edges. This way of constructing bipartite graph leads to deletion of some valid edges and as a result, fewer edges need to be processed in co mparison to other graph approach for solving the same problem. Therefore, the running time can be significantly reduced for this representation of gene expression data using a bipartite graph. The bicluster extraction step depends on the value of input parameters and datasets. This step is most expensive as there can be exponential nu mber of clusters. Since the experimental conditions are only considered as vertices, and so me uninteresting edges may be pruned, the depth of the search in weighted range bipartite graph to extract biclusters is likely to be small in co mparison to mu ltigraph representation [11].

Synthetic Dataset
In order to evaluate implanted constant, coherent and overlap biclusters in synthetic data, we have used the technique proposed by Zimmermann et al. [9]. Fo r constant bicluster generation, we adopt the following steps: a. Generate a 100 × 100 matrix A with all elements 0 b. Generate ten biclusters (modules) of size 10 × 10 with all elements 1 c. Replace elements of biclusters with random noise val-ues fro m uniform d istribution (-σ,σ)

d. Implant the ten biclusters into A without overlap
For all experimentation, we set the noise level range fro m 0.0 to 0.25. In case of overlapping biclusters, we used 10 degrees of overlap ( = 0, 1,2,3,4,5,6,7,8,9) , where the size o f matrix and bicluster vary fro m 100 × 100 to 110 × 110 and fro m 10 × 10 to 20 × 20, respectively. The steps for evaluation of coherent biclusters are same as that of constant bicluster, but rows and columns in a bicluster have a 0.02 increasing trend. The parameter setting for different algorith ms is shown in Tab le 2. In order to validate the accuracies of different algorithms, we apply the gene match score proposed by Zimmermann et al. [9]. Let 1 and 2 be two sets of biclusters. The match score of 1 with respect to 2 is given by: where and are set of genes and a set of conditions in a bicluster respectively. Let represent the set of implanted biclusters and be the set of output biclusters of an algorith m. The score ( , ) represents the degree of similarity between extracted biclusters and the imp lanted biclusters, where as the score ( , ) represents how well each of the true biclusters extracted by the bicluster algorith m. Figure 3 illustrates the experimental results with respect to the accuracy evaluation of constant biclusters. As per the experimental results, in case of high noise level for extraction of constant biclusters; BiRange along with cHawk, ISA and RMSBE shows high accuracies; BiMax and SAM BA perform moderately, and CC perform poorly. Figure 4 illustrates the experimental results with respect to the accuracy evaluation of coherent biclusters. For coherent biclusters, BiRange has a co mparable accuracy with RMSBE and cHawk. Figure 5 illustrates the experimental results with respect to the accuracy evaluation of overlapped biclusters. In case of overlapped biclusters, BiRange is marg inally affected by the overlap degree of the implanted biclusters.

Real Dataset
For real gene expression dataset, provided by Gasch et al. [17], the performance of the proposed technique BiRange is evaluated with respect to other similar algorith ms based on the methodology used by Zimmermann et al. [9]. In order to evaluate extracted biclusters for their enrich ment level based on Gene Ontology (GO) annotations [21], a web tool called FuncAssociate [22] is also used. The adjusted significance scores (α) were computed using FuncAssociate and is shown in Figure 6. The experimental results for BiRange is compared with other algorith ms like BiMax, RMSBE, cHawk, SAM BA, ISA and CC based on this significance score. In this section the performance of the proposed BiRange algorith m is analy zed. We have synthetically generated datasets with sizes ranging from 2000 × 100 to 100000 × 500 and imp lant constant biclusters in this matrix. Figure 7 illustrates the performance of BiRange, cHawk and RMSBE with respect to execution t ime for different size o f dataset. As per our complexity analysis, the execution time of Bi-Range increases approximately linearly with the number of genes in a cluster, wh ile execution time for RMSBE increases at a much higher rate. Th is confirms the practical applicability of our proposed algorith m.

Conclusions
We have represented the gene expression data using a weighted range bipartite graph, by computing the weight of each gene for a specified condition pair. For the construction of bipartite graph, the set of experimental conditions are partitioned into two disjoint sets based on the rank of valid edges. The ran k value is co mputed fro m the cardinality of the corresponding gene-set, which is associated with the valid edge. The motivation of considering conditions as vertices of the bipartite graph is due to large number of genes in real gene expression data. The proposed algorithm has been evaluated for synthetic as well as real microarray data. The experimental results reveal the effectiveness of our approach over other biclustering approaches with respect to time.