Transcriptional network structure assessment via the Data Processing Inequality

Whole genome transcriptional regulation involves an enormous number of physicochemical processes re- sponsible for phenotypic variability and organismal function. The actual mechanisms of regulation are only partially under- stood. In this sense, an extremely important conundrum is related with the probabilistic inference of gene regulatory net- works. A plethora of different methods and algorithms exists. Many of these algorithms are inspired in statistical mechanics and rely on information theoretical grounds. However, an important shortcoming of most of these methods, when it comes to deconvolute the actual, functional structure of gene regulatory networks lies in the presence of indirect interactions. We present a proposal to discover and assess for such indirect interactions within the framework of information theory by means of the data processing inequality. We also present some actual examples of the applicability of the method in several instances in the field of functional genomics.


Introduction
Hence DPI is thus useful to quantify efficiently the dependencies among a large number of genes because eliminates those statistical dependencies that might be of an indirect nature, such as between two genes that are separated by intermediate steps in a transcriptional cascade. We will outline an algorithmic implementation of the DPI within the framework of GRN inference and structure assessment.

Outline
• Introduction • Motivation • The gene network inference problem • The joint probability distribution approach (Guilt by association) • Information theoretical measures and the data processing inequality (DPI) • Applications • Conclusions and perspectives Most common pathologies are not caused by the mutation of a single gene, rather they are complex diseases that arise due to the dynamic interaction of many genes and environmental factors. To construct dynamic maps of gene interactions (i.e. GRNs) we need to understand the interplay between thousands of genes.
One important problem in contemporary computational biology, is thus, that of reconstructing the best possible set of regulatory interactions between genes (a so called gene regulatory network -GRN) from partial knowledge, as given for example by means of gene expression analysis experiments.

Outline
• Introduction • Motivation • The gene network inference problem • The joint probability distribution approach (Guilt by association) • Information theoretical measures and the data processing inequality (DPI) • Applications • Conclusions and perspectives Several issues arise in the analysis of experimental data related to gene function: • The nature of measurement processes generates highly noisy signals • There are far more variables involved (number of genes and interactions among them) than experimental samples.
• Another source of complexity is the highly nonlinear character of the underlying biochemical dynamics.

The gene network inference problem
Information theory (IT) has resulted on a powerful theoretical foundation to develop algorithms and computational techniques to deal with network inference problems applied to real data. There are however goals and challenges involved in the application of IT to genomic analysis.
The applied algorithms should return intelligible models (i.e. they must result understandable), they must also rely on little a priori knowledge, deal with thousands of variables, detect non-linear dependencies and all of this starting from tens (or at most few hundreds) of highly noisy samples.

The gene network inference problem
There are several ways to accomplish this task, in our opinion, the best benchmarking options for the GRN inference scenario, are the use of sequential search algorithms (as opposed to stochastic search) and performance measures based on IT, since this made feature selection fast end efficient, and also provide an easy means to communicate the results to non-specialists (e.g. molecular biologists, geneticists and physicians).

Outline • Introduction • Motivation • The gene network inference problem • The joint probability distribution approach (Guilt by association) • Information theoretical measures and the data processing inequality (DPI) • Applications • Conclusions and perspectives
The deconvolution of a GRN could be based on optimization of the Joint Probability Distribution of gene-gene interactions as given by gene expression experimental data could be implemented as follows:

The joint probability distribution approach
Here N is the number of genes, Φ i 's are interactions (i.e. correlation measures) and Z is a normalization factor (called a Partition function). The functional H is termed a Hamiltonian (in analogy with statistical physics) The joint probability distribution approach Estimating MI between gene expression profiles under high throughput experimental setups typical of today's research in the field is a computational and theoretical challenge of considerable magnitude. One possible approximation is the use of estimators. Under a Gaussian kernel approximation, the JPD of a 2-way measurement is given as: • Mutual information allows to distinguish different kinds of 2-way interactions (one particular case of interest is that of triplets): Hernández The Data Processing Inequality (DPI) for Markov chains Definition: Three random variables X, Y and Z are said to form a Markov chain (in that order) denoted X  Y  Z if the conditional distribution of Z depends only on Y and is independent of X. That is, if we know Y, knowing X does not tell us any more about Z than if we know only Y.
If X, Y and Z form a Markow chain, then the JPD can be written:

P(X,Y,Z) = P(X) P(Y|X) P(Z|Y)
The Data Processing Inequality Theorem: If X, Y and Z form a Markov chain X  Y  Z then