Autotuned Multilevel Clustering of Gene Expression Data

DNA microarray technology has revolutionized biological and medical research by enabling biologists to measure expression levels of thousands of genes in a single experiment. Different computational techniques have been proposed to extract important biological information from the massive amount of gene expression data generated by DNA microarray technology. This paper presents a top down hierarchical clustering algorithm that produces a tree of genes called GERC tree (GERC stands for Gene Expression Recursive Clustering) along with the generated clusters. GERC tree is an ample resource of biological information about the genes in an expression dataset. Unlike dendrogram, a GERC tree is not a binary tree. Genes in a leaf node of GERC tree represent a cluster. The clustering method was used with real-life datasets and the proposed method has been found satisfactory in terms of homogeneity, p value and z-score.


Introduction
DNA microarray technology enables the biologists to monitor expression levels of thousands of genes in a single microarray experiment. There is a high demand of computational techniques to operate on the massive amount of expression data generated by DNA microarray technology to extract important biological informat ion. Due to the large number of genes and complex gene regulat ion networks, clustering is a useful exp loratory technique for analyzing such data. It groups data of interest into a nu mber of relatively homogeneous groups or clusters where the intra-group object similarity is min imized and the inter-group object dissimilarity is maximized. Problems of automatically classifying data arise in many areas, and hierarchical clustering can be a very good approach in certain areas such as gene expression data analysis because it can present a hierarchical organizat ion of the clusters.
Ext ractin g imp o rtan t b io lo g ical kn o wled g e fro m biological data is a difficult task. One very useful approach for prov iding ins ight into the gene exp ression data is to organize the genes in a hierarchy of classes, where genes in a class are more similar co mpared to its ancestor classes in the hierarchy. In this paper, we present a polythetic div isive hierarchical clustering algorithm that operates in two distinct steps. The first step generates a number of init ial clusters and in the second step, these initial clusters are further processed to fo rm a set o f fin er clusters . The algorith m advances clustered dataset. The method is much favored by many biologists and has become one of the most widely-used tools in gene expression data analysis. However, it suffers from lack of robustness, i.e., a small perturbation of the dataset may great ly change the structure of the hierarchical dendrogram. DHC is a popular density based clustering algorith m. DHC is developed based on 'density' and 'attraction' of data objects. In the first level, an attraction t ree is constructed to represent the relationship between the data objects in the dense area which is later summarized to a density tree. Another approach splits the genes through a divisive approach, called the Determin istic-Annealing Algorith m (DAA) [6]. Hierarchical clustering not only groups together genes with similar exp ression patterns but also provides a natural way to graphically represent the dataset allowing a thorough inspection. However, like UPGMA, a s mall change in the dataset may greatly affect the hierarchical dendrogram structure. Another drawback is its high computational co mplexity and vagueness of termination criteria. Biologists are not only interested in the clusters of genes, but also in the relationships (i.e. closeness) among the clusters and their sub-clusters, and the relat ionship among the genes within a cluster. A clustering algorith m, wh ich also provides some graphical representation of the cluster structure, is much favored by the biologists. To address this issue, this paper presents a hierarchical clustering algorith m GERC that generates a tree along with the set of clusters.

The GERC Algorithm
GERC is a polythetic divisive hierarchical clustering algorith m that operates in two distinct steps. This is an extended version of the article [7] where the method was introduced. In the first step of the algorithm, an in itial cluster is formed and this init ial cluster is further processed in the second step to form finer clusters. The algorith m accepts four input parameters i.e., reference gene, step down ratio, preferred node volume and MRD threshold. However, the last three parameters can be statistically co mputed from the expression data. The technique can operate on any high dimensional numeric domain.

Data Pre-processing
Often gene data available on the web are found to contain missing values. The quality of clusters largely depends on the handling of these missing values. Apart from missing value handling, pre-processing also involves normalization and discretization.

Handling missing val ues
We used the Local Least Squares Imputation method [8] to compute missing values in the datasets. There are two steps in the local least squares imputation method. The first step is to select k genes by Pearson correlation coefficient. The second step is regression using the selected k genes to estimate the missing values.

Normalizati on
The datasets are normalized using a co mmon statistical method that converts each gene to a normal distribution with mean 0 and variance 1. This statistical method of normalizat ion is often termed as Z score normalization [9] or Mean 0 Standard Deviat ion 1 normalization.

Discretization
The normalized matrix is discretized to a mat rix by comparing a value in a colu mn with the value in the next column of the same row. The normalized matrix consists of three discrete values 1 (if the next value is larger) , -1 (if the next value is s maller) and 0 (if the next value is equal). The normalized matrix G can be converted to the discretized matrix G d in the fo llo wing manner.

Proximity Measure
In this paper, we introduce a simp lified form of mean squared residue measure, i.e., Mean Residue Distance(MRD) to find mutual distance of two genes that aids in extracting the coherent patterns in the exp ression matrix. Like mean squared residue measure, M RD is a measure that works satisfactorily to detect coherence of constant valued genes, constant row genes, constant column genes and additive genes. The significance of these correlations in clustering of gene expression data is reported in [10]. Un like MSR, M RD can operate in mutual mode i.e., it can compute correlation between a pair of genes. Next we d iscuss the mean squared residue measure and then introduce MRD measure.

Mean S quared Residue
Mean squared residue is a measure that was used to find coherent objects in a data matrix by [11]. They tried to find out a subset of genes along with a subset of conditions which has mean squared residue less than a threshold δ. They termed such subspace clusters as δ biclusters.The measure is still considered a strong one to detect coherent objects if it is used carefully. Various subspace clustering algorithms use mean squared residue directly or with a bit of modificat ion. Mean squared residue of an element a ij in gene expression matrix is given by,  Fig. 1 presents visual interpretation of MSR for two gene expressions g1 and g2. Fro m the visual interpretation, it can be clearly stated that the measure can be effectively used to find mutual d istance between two genes under a defined set of conditions. The proposed MRD measure is based on this fact. However, considering the cost effectiveness without deteriorating the cluster quality we simp lify this measure by replacing the squaring operation with a modulus or absolute operation. Apart from this, the proposed measure is used in extracting finer subspace clusters from the init ial subspace cluster in step 2 of the proposed algorith m.

Mean Residue Distance
The mean residue distance of an element a i of gene g 1 =( a 1 , a 2 ,..., a n ) with respect to another element b i of gene Where a mean is the mean o f all the elements of g 1 and b mean is the mean of all the elements of g 2 . MRD of the gene pair g 1 and g 2 with respect to a subspace of conditions λ can be computes as, Following definit ions and theorems provide the theoretical basis and soundness of the proposed measure based on [12].
Definition 1: Coherent genes: Two genes are called coherent if similarity between the two genes is more than a given threshold in terms of a part icular pro ximity measure.

GERC Tree
Our algorith m results a tree called GERC tree. The leaves of this tree present the generated clusters of the algorith m. This tree can be used to derive additional biological informat ion fro m a gene exp ression dataset. The overall structure of the tree generated for mo re than one reference gene is presented in Fig. 2.

Dendrogram versus GERC tree
A dendrogram is binary tree that presents the hierarchical structure of the clusters generated by a hierarchical algorithm. In divisive hierarchical algorith m, dendrogram is obtained by recursively splitting a node containing a set of objects into two child nodes based on the similarity among the object pairs until all nodes have a single object. Conversely, in agglomerat ive approach, the nodes containing a single object are recursively merged until all the objects are in the root node. But in our algorith m, GERC tree is obtained by recursively splitting a single node containing the set of all objects into mult iple nodes(with possibly common objects) until number o f objects in all the processing nodes is less than or equal to a user given threshold, i.e. preferred node volume. Unlike Dendrogram, GERC tree is not a binary t ree and the structure of the tree is flexib le depending on the values of the set of input parameter. The structural difference between dendrogram and GERC tree is presented in Fig. 3.   Table 2 presents a general comparison of dendrogram and GERC tree. GERC t ree conveys a mo re exhaustive set of informat ion to the users. User can look for the nodes in the tree that is flourished in terms of their functional enrich ment. Moreover, we can trace the path of a particular gene from the root node to the leaf nodes which may be useful to find its co-me mbe rs in a particular functional category. This tree can also be useful to derive relat ionship in terms of their regulatory informat ion such as co-regulation.

Proposed Algorithm: GERC
Following definit ions, symbols and notations are used to describe the GERC algorith m. Step down ratio Neighbour β(gi) Neighbours of i th gene with respect to MRD threshold β Exp(gi ) Expression pattern of i th gene presented by i th row of the discretized matrix father(η(j)) father of j th node degree β(gi) Degree of i th gene with respect to MRD threshold β η(j).threshold MRD threshold of j th tree node η(1) The root node of the GERC tree containing all the genes Definition 7: Neighbour: Two genes g i and g j are said to be neighbours of each other with respect to a threshold β if dist(g i ,g j )≤β .
Definition 8: Degree: The degree of a gene g j with respect to threshold β is defined as the number of genes which are within the β neighbourhood of g j .
Definition 9: Initial Cluster: An init ial cluster is defined as a subset of genes S such that for any two genes g i , g j in S, Exp (gi) =Exp(gi) in terms of at least (Nc/2)+1 numbers of conditions. Definition 10: Finer Cluster: A finer cluster is defined as a set of genes S such that for any two genes gi, gj in S ⅰ Exp(gi) = Exp(gi) in terms of (Nc/2)+1 nu mbers of conditions. ⅱ dist(gi,gi)<= δ l xα , where l is greater than or equal to the level of the node to which the finer cluster is associated.
The GERC algorith m consists of two steps. In the first step of GERC, all genes which have expression pattern similar to the reference gene in terms of (Nc/2)+1 nu mber of conditions are put in an init ial cluster. In the second step of the algorith m, the init ial clusters are processed to produce finer clusters. While processing an initial cluster, all genes are put in the root node. Then iterative clustering is performed using a tuned MRD threshold on genes of each node to produce its child nodes. The process is repeated till all nodes in the tree contain a number of genes greater than user specified preferred node volume. The tuning of MRD threshold is governed by the user defined parameter step down ratio. The leaves of the tree represent the generated clusters. The algorith m can be used to produce a wide range of clusters by considering mo re than one reference gene. The detail of the algorith m is presented next.    The algorith m tries to find the initial cluster to which this gene potentially belongs to. While finding this init ial cluster it tries to locate all genes which have similar expression pattern with the reference gene in terms of at least (Nc/ 2)+1 number o f conditions. If we want to explo re the entire dataset we can use each of the available genes or a set of stochastically selected genes as reference genes. Once we discover the init ial clusters we move to step 2 of the algorith m. In the second step, we create a single node first with all genes of the initial cluster in it and then iteratively cluster each node(having genes more than preferred volu me umber of genes) of the tree until all the processing nodes in the tree have less than or equal to preferred volu me number of genes. The input parameter step down ratio is actually used to reduce the value of MRD threshold as towards leaf nodes of the tree the similarity among genes increases and require a s maller value of M RD threshold. e.g. if the M RD threshold provided by user is 1 and step down ratio is .5, then the first node will use 1 as its MRD threshold while clustering. On successful division of the node to its children, the successive level nodes will use thresholds .5(1*.5),.25(.5*.5) …. and so on. Finally the sub-trees (one sub-tree in case of single reference gene) generated from the initial clusters are combined to form the GERC tree. The roots of these subtrees are made children of a single node that contains the set of entire genes and this node becomes the root of generated GERC tree.

Complexi ty Analysis
Since GERC involves two distinct steps, hence the total complexity is the sum of the comp lexities of these two steps. Let a dataset contains n number of genes. In the first step of the algorithm, the total number o f co mparisons done to put all n number of genes in the init ial cluster is n-1. In the second step, if the nu mber of genes in the init ial cluster is p, the computation of distance matrix involves px(p-1)/2 operations. If the average number of genes in the non leaf nodes is m and the total number of non leaf nodes is l, creating the child nodes and hence the entire tree requires l*m operations. So complexity of the second step is O (p x(p-1)/ 2+ l*m).

Experimental Results
We implemented the GERC algorith m in MATLAB and tested it on four publicly availab le benchmark microarray datasets mentioned in Tab le 4. The test platform was a desktop PC with Intel Core 2 Duo 2.00 GHz processor and 512 M B memory running Windows XP operating system. Fig. 5, 6 and 7 present a part of the tree generated from an initial cluster for Dataset 2, Dataset 3 and Dataset 4 respectively.

Cluster Quality
The GERC algorith m was co mpared with various clustering algorith ms and the results were validated using average homogeneity score [18], p value [19] and z score [20].
Cluster Ho mogeneity: Ho mogeneity measures the quality of clusters on the basis of the defin ition of a cluster i.e., Objects within a cluster are similar while objects in different clusters are dissimilar. It is calcu lated as follows.
(a) Co mpute average value of similarity between each gene g i and the centroid of the cluster to which it has been assigned.
(b) Calculate average ho mogeneity for the clustering C weighted according to the size of the clusters. Dataset 3 Yeast Cell cycle [15] Sample input files in expander [17] 698 72 Dataset 4 Yeast Sporulation [16] http://cmgm.standford.edu/pbrown/sporulation 474 7 The homogeneity values for the GERC algorith m and some other algorithms are reported in Table 5. It can be observed that the homogeneity value for GERC is highest fro m which we can conclude that the coherence of the clusters produced by GERC are better than those produced by competing algorith ms. p val ue:The b iological relevance of a cluster can be verified based on the gene ontology (GO) annotation databasehttp://db.yeastgenome.org/cgi-bin/ GO/-goTermFinder. It is used to test the functional enrich ment of a group of genes in terms of three structured controlled ontologies, viz., associated biological processes, molecular functions and biological co mponents. The functional enrich ment of each GO category in each of the clusters obtained is calculated by its p-value.
The p-value is computed using a cumulative hyper geometric d istribution. It measures the probability of finding the number of genes involved in a given GO term (i.e., function, process and component) within a cluster. The genes in a cluster are evaluated for the statistical significance by computing the p-value for each GO category. This signifies how well the genes in the cluster match with different GO categories. The p-value represents the probability of observing the number of genes fro m a specific GO functional category within each cluster. A lo w p-value indicates the genes belonging to the enriched functional categories are biologically significant in the corresponding clusters.
To compute p-value, we used the software FuncAssociate [19]. respectively. Fro m the Tables, we can conclude that GERC shows a good enrichment of functional categories and therefore projects a good biological significance.

Discussion
We have presented here a clustering algorithm that generates a tree where the children of a node represent the clusters that are formed fro m the genes in that node. The leaf nodes of the tree will represent the desired clusters. We keep on reducing the threshold used in the clustering process as we move on to deeper levels. The structure of the generated tree is d riven by the input parameters. The input parameters T and δ controls the height/depth of the tree. If we set the value of T large the genes in the root node will split to the leaf nodes in a few nu mbers of levels. ascospore wall assembly spore wall assembly fungal-type cell wall assembly cell wall assembly cellular component assembly involved in morphogenesis cell differentiation sporulation resulting in formation of a cellular spore sporulation cellular developmental process developmental process involved in reproduction developmental process If we set the value of δ small, it will lead to fewer levels as the difference between the thresholds used in the subsequent levels will beco me larger. α (M RD threshold) should be carefully chosen. A small value of α may leave out some upper levels of the hierarchical tree. The effect o f these parameters on the tree structure is presented in Table 9. To decide the value of δ, we drew two graphs (Fig. 4), one for number of nodes and another one for depth of the sub tree generated from an init ial cluster against different values of δ. We used a trade off between these two graphs and decided to use 0.7 as the value of δ for yeast cell cycle dataset.

Conclusions and Future Work
In this paper, we present a top down hierarchical clustering algorith m that produces a tree of genes in the neighbourhood of a reference gene called GERC tree along with the generated clusters. The algorithm can be used to generate a wide range of clusters by considering mu ltiple reference genes. Clustering performed in the nodes at different levels of GERC tree adaptively chooses the values of threshold parameters. The co mplexity of our approach can be improved by using an appropriate heuristic method for estimating an effective set of parameter values which will guarantee for quality cluster results. The value of α and R can also be calculated statistically fro m the set of input genes. Work is underway for integrating prior bio logical informat ion to the clustering process to improve the results.