User:Pranjal.Patra/sandbox

=Chapter 1: Introduction= =Chapter 2: Background Knowledge= In this chapter, the essential biology required to understand the work done is briefly described.

Flow of information in biological systems
In this section an oversimplified description of a very complex biological process of how the genetic information flows from the genome of an organism to being expressed in the phenotype is provided.

Deoxyribonucleic acid (DNA)
Deoxyribonucleic acid or DNA is responsible for carrying the genetic information in almost all known living organisms. It contains instructions to control all our cell's activities. In humans it is three billion base pairs long. Nucleotides or bases are the building blocks of DNA and there are four types of bases in DNA. They are adenine, guanine, cytosine and thymine. A copy of the genome is present in almost every cell of the body. The DNA molecules have a double helix structure and this extremely long molecule is condensed in a DNA protein complex in the nucleus of the cell. DNA can be replicated or it can be transcribed into ribonucleic acid (RNA) which may code for proteins. RNA is also a nucleic acid like DNA but unlike DNA it is a single stranded molecule and does not contain thymine. Its fourth base pair is uracil.

This is an oversimplified description of a very complex biological process.

Gene
A gene is a unit of heredity in a living organism. A gene is usually responsible for influencing certain characteristic of the organism. It is normally a stretch of DNA that codes for a type of protein or for an RNA molecule that has a function in the organism. All living things depend on genes, as they specify all proteins and functional RNA molecules.

These genes only have an effect on the cell when they are expressed.

Gene expression
Gene expression is the process by which information from a gene is used in the synthesis of functional gene products such as proteins. The genetic code that is present in the DNA is decoded in this cellular mechanism and the information decoded gives rise to the organism's phenotype. The mechanism of gene expression consists of several steps. The length of the DNA that will transcribed and then translated is called the transcription unit. Each transcription unit has an sequence just above it on the strand which is called upstream. This special sequence defines when the transcription unit is going to begin. This special sequence is called the promoter region. The two most important steps are transcription and translation.

Transcription
Transcription is the process by which the information contained in a section of DNA (a gene) is transferred to a newly assembled piece of messenger RNA (mRNA). It is facilitated by RNA polymerase and transcription factors. In eukaryotic cells the primary transcript (pre-mRNA) must be processed further in order to ensure translation.

Translation
In translation, messenger RNA (mRNA) produced by transcription is decoded by the ribosome to produce a specific amino acid chain, or polypeptide, that will later fold into an active protein molecule.

So after a gene undergoes these steps an active protein molecule is produced. But not every gene is expressed in all the cells at all times. By controlling which genes are active, a cell can take on special characteristics. Muscle cells and neurons have the same DNA but perform different functions because they express different sets of genes. Transcription factors are one of the mechanisms that help cells regulate which gene to express. In the nucleus, the DNA is condensed in a DNA - proteins complex called chromatin. In places where genes are being expressed, there are often zones of naked DNA. Transcription factors bind to these naked DNA sequences and regulate gene expression.

Transcription Factors
As mentioned above, transcription factors play their part in the transcription stage of gene expression. Transcription factors are protein molecules that bind to a specific DNA sequence. Once bound to the matching DNA sequence, the transcription factor molecule can promote or block the transcription of the gene that follows the site where the transcription factor binds itself. Hence transcription factors regulate the level of gene expression by controlling the process of transcription. Some transcription factors perform this function alone while some perform this function with other proteins in a complex. by promoting (as an activator), or blocking (as a repressor) the recruitment of RNA polymerase (the enzyme that performs the transcription of genetic information from DNA to RNA) to specific genes.

Transcription factors are proteins that control which genes are turned on or off in the genome. They do so by binding to DNA and other proteins. Once bound to DNA, these proteins can promote or block the enzyme that controls the reading, or “transcription,” of genes, making genes more or less active.

Transcription factors are essential for the regulation of genes. For example, different genes are turned on in liver cells than in skin cells. Different genes are turned on in cancer cells than in healthy cells. Through the action of transcription factors, the various cells of the body, which all have the same genome, can function differently.

These proteins are so important to life that they are found in all living organisms. Roughly 8% of genes in the human genome encode transcription factors. They play important roles in development, the sending of signals within the cell, and the events in a cell that lead to division and duplication, known as the cell cycle. Several human diseases are linked to mutations in transcription factors, such as diabetes, autoimmune diseases, and cancer.

Biological Pathways
A biological pathway is a series of actions among molecules in a cell that leads to a certain product or a change in a cell. Such a pathway can trigger the assembly of new molecules, such as a fat or protein. Pathways can also turn genes on and off, or spur a cell to move. Some of the most common biological pathways are involved in metabolism, the regulation of genes and the transmission of signals.

ChIP-X Experiments
Several in vivo experiments such as ChIP-chip (Iyer, Horak et al. 2001), ChIP-seq (Johnson, Mortazavi et al. 2007), ChIP-PET (Wei, Wu et al. 2006) and DamID (Vogel, Peric-Hupkes et al. 2007) provide details about possible binding sites for transcription factors at a genome wide level. These four methods together are referred to as ChIP-X. The sites discovered using the ChIP-X methods are near gene coding regions and are found when the chromatin structure of a specific cellular state allows binding of a particular transcription factor. This means that unlike possible binding sites found using in vitro approaches, the possibility of these sites to have actual biological significance is much higher. Results from such experiments report the binding of specific transcription factors to DNA in proximity of target gene loci, commonly listing hundreds to thousands of potential regulatory interactions.

Gene Expression Profiling
Gene expression profiling is the measurement of the activity (the expression) of thousands of genes at once, to create a global picture of cellular function. These profiles can, for example, distinguish between cells that are actively dividing, or show how the cells react to a particular treatment. Many experiments of this sort measure an entire genome simultaneously, that is, every gene present in a particular cell. A DNA microarray (also commonly known as DNA chip or biochip) is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains segments of a specific DNA sequence, known as probes (or reporters or oligos). These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA (also called anti-sense RNA) sample (called target) under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence- labelled targets to determine relative abundance of the targets in the sample. RNA-seq refers to the use of high-throughput sequencing technologies to sequence cDNA in order to get information about a sample's RNA content. The technique has been rapidly adopted in studies of diseases like cancer.

=Chapter 3: Related Work= A lot of research has been done in the area of understanding the biological processes that are active in any given cellular state. This thesis is also in the same area. We perform statistical tests on the data obtained from high throughput gene expression profiling experiments to gain insight over which set of genes are more active than others. Several individual research groups have made important contributions to this field and this project will be building upon the work that has been done so far. In this section, we will discuss all those techniques and databases that are related to this project.

Gene annotation databases
Genes are functionally annotated based on their function, position and other characteristics. These annotations are stored in publicly available databases. For example, functional annotation for a gene can be its association with a particular function in a metabolic pathway. Any information about the function of this gene will be a functional annotation of the gene. Using these publicly available gene annotations it is possible to create gene sets by taking all the genes that have a common annotation and clubbing them together. Usually every biological pathway such as mechanical or signalling pathways is associated with certain genes.

Enrichment Analysis (Pathway Analysis)
High-throughput sequencing and gene/protein profiling techniques such as DNA microarray and RNA-Seq have become very popular because these techniques allows researchers to simultaneously measure the changes and regulations of genome-wide genes under certain biological condition. Experiments using these techniques usually generate a list of differentially expressed genes or proteins. The genes can be ordered in a ranked list according to their differential expression between classes. The genes that show over or under expression levels can be linked to the observed phenotype, which can help us gain insights in understanding the gene function. Enrichment analysis has a lower complexity and has a high explanatory power. Broadly, there are three different approaches for performing enrichment analysis. They are Over-Representation Analysis (ORA) Approach, Functional Class Scoring (FCS) Approach and Pathway Topology (PT)-Based Approach.(Khatri, Sirota, & Butte, 2012)

Over-Representation Analysis (ORA) Approach
The growth in the popularity of High-throughput sequencing and also the development of public gene set repositories such as Gene Ontology (GO) or Kyoto Encyclopedia of Genes and Genomes (KEGG) fueled the immediate need for functional analysis of microarray gene expression data. To tackle these problems the Over-Representation Analysis (ORA) approach was devised. From the expression levels observed in the sequencing experiment, a list of significant genes that were over-expressed or under-expressed is created. To create this list an arbitrary cut-off p-value is set. For example, a researcher may say that all genes that have fold change greater than 2, qualify as significant genes. Next for each pathway (or gene set) the numbers of genes that are present in this list are counted. Then a random set of background genes are generated and for this random set of genes, the numbers of genes present in the list are counted. This exercise is repeated several times. Then by using various statistical analysis techniques such as tests based on the hyper geometric, chi-square, or binomial distribution, the activity level of the entire gene set or pathway is calculated. This is a very simple technique but it sheds some light on the gene sets that are under or over expressed but at the same time, ORA has a few shortcomings too.

Limitations of Over-Representation Analysis (ORA) Approach
Even though the ORA approach is the most popular approach, it has several shortcomings. Firstly, the statistical tests (e.g., hyper geometric distribution, binomial distribution, chi-square distribution, etc.) ignore the measurements found for the genes in the enrichment analysis experiments. The list of significant genes is generated based on an arbitrary threshold. As this data is ignored, all the genes that make the list are treated equally despite their varying levels of expression. Secondly, the individual genes that do not make the threshold are discarded. Genes whose expression levels fall in the border of the threshold also have some significance but are totally ignored. This is the disadvantage of having a hard cutoff threshold. Third, by considering that all the genes are equal, ORA significantly reduce its ability to analyze complex biological interactions that include several gene products. With enrichment analysis, some researchers also try to find how the interactions between various gene products manifest as the levels of expression change. Since the ORA techniques consider all genes as equal and independent, it fails to provide any insight in this regard. Fourth, ORA approaches works with the assumption that all the pathways are independent to each other, that is not true. For example, in signaling pathways in KEGG, presence of growth factors activates the MAPK signaling pathway. This signaling pathway in turn activates the cell cycle pathway. ORA methods do not account for such inter-pathway interactions and dependences.

Functional Class Scoring (FCS) Approach
In most biological systems, significant effects on pathways can be caused by large changes in individual genes but they can also be caused by weaker coordinated changes in the expression levels of several functionally related genes. By clubbing such related genes into a gene set such that a gene set represents a biological pathway, we can detect such effects. Almost all the FCS based methods has mainly three steps:

Step 1: Calculate Gene Level Statistic
First a gene level statistic is computed from the molecular measurement data obtained from high-throughput expression analysis experiments such as DNA microarrys or RNA-Seq. This is done by calculating differential expression for each of the genes. Several statistical methods such as correlation of molecular measurements with phenotype (Pavlidis, Qin, Arango, Mann, & Sibille, 2004), Q-statistic (Goeman, Van De Geer, De Kort, & Van Houwelingen, 2004), signal-to-noise ratio (Subramanian et al., 2005), t-test (Tian et al., 2005) or Z-score (Kim & Volsky, 2005) can be used here to represent the expression levels.

Step 2: Calculate pathway level statistics
Next, gene level statistics for all genes in a given gene set are aggregated into a single pathway level statistic. There are several statistical methods to do this but some of the more common once are Kolmogorov - Smirnov, Wilcoxon rank sum or take the sum, mean or median of the gene level statistics. Whatever method is chosen to implement this the power can depend on factors such as the proportion of the genes present in the pathway that were differentially expressed, the actual size of the pathway i.e. the number of genes present in the pathway and the amount of correlation that exists between the various genes in the pathway. Even though multivariate statistics should show better results as they also account for interdependencies among genes, it has been observed that the for higher cut-offs (p≤0.001), the univariate statistics shows more power and for less stringent cut-offs (p≤0.05) the univariate statistics show equal power.

Step 3: Assessing statistical significance of the pathway level statistic
Here the statistical significance is computed by using a null hypothesis. There are mainly two ways to do the testing and they are: Competitive null hypothesis testing: In this method, class labels (i.e. phenotypes) for each sample are permuted and comparison is made between the set of genes in a given pathway with itself. Self-contained null hypothesis testing: In this method, gene labels are permuted for each pathway and comparisons are made between the set of genes that are in the pathway with the set of genes that are not in the pathway. The size of the gene sets remain the same.

Advantage of using FCS
The limitations described above related to using the ORA approach has been addressed in the FCS approach. This helps FCS provide better results and deeper insight into the underlying biology of any given condition. For example:

FCS does not require any arbitrary cut-off threshold for dividing the genes into significant and non-significant groups. It uses all the available molecular measurements for its analysis.

FCS uses the molecular measurement information to detect coordinated changes in expression of genes in some pathways. By detecting such coordinated changes, FCS can give us information about dependence between genes.

Limitations to FCS
In general FCS performs better than the ORA approach but there are still some limitations with this method. FCS analyses each pathway independently hence a problem arises when a single gene is a part of multiple pathways. In such a case, a given gene might be over-expressed because it is playing an important role in a particular pathway but this expression level will be considered while evaluating the pathway level statistic of other pathways that the gene is a part of. Another limitation arises when the statistical method used to implement FCS is a rank based method. In such a case, the value obtained in the experiment is not considered in the analysis but only the rank assigned is considered.

Pathway Topology (PT)-Based Approach
Pathway topology is the newest technique available for performing enrichment analysis. It is similar to the functional class scoring approach but the difference lies in the way pathway topology based approaches computes the gene level statistics. Several publicly available pathway knowledge bases hold information about gene products that interact with each other, how they interact and where they interact in a given pathway. ORA and FCS do not utilize this knowledge. An example of PT based approach is ScorePAGE proposed by Rahnenfuhrer et al. (Rahnenfuhrer, Domingues et al. 2004) ScorePAGE computes similarity between each pair of genes in a pathway. The similarity is measured by calculating the correlation or covariance between the two genes. The similarity score is comparable to the gene level score in FCS based approaches. Then these scores are averaged to arrive at the pathway level statistics. However, ScorePAGE divides the similarity score with the number of reactions needed to connect the two genes in the pathway. This strategy assigns varying weights to the pair wise similarity scores.

Limitations
Some of the common limitations with this strategy are:
 * Pathway topology depends upon the cell type and condition being studied. Hence this information is not readily available and is usually fragmented in various knowledge bases. As the annotation becomes more comprehensive and complete, these approaches are expected to perform better.
 * No existing PT-based approach can collectively model and analyze high-throughput data as a single dynamic system.

ChEA: ChIP-X Enrichment Analysis
ChEA is a software tool that utilizes ChIP-X experiments data for linking transcription factors to gene expression changes by computing over representation of transcription factor targets in an input list of genes. ChEA essentially counts the number of targets in a list and compares with the number of targets that were identified in the database i.e. an ORA approach of transcription factor targets on an input list of genes. (Lachmann, Xu et al. 2010)

A database was manually curated from the literature reporting ChIP-X experiment results. In this database each record contains a list of genes potentially regulated by a specific transcription factor under a specific condition. Then this database was used as the prior knowledge base to analyze mRNA expression data where enrichment analysis was performed. The current database has the following statistics:
 * Transcription Factors: 206
 * Publications: 233
 * Genes: 47153
 * Total Entries: 474676

ChEA is commonly used after a genome-wide gene expression profiling study is performed. The steps that follow are:

First a list of genes that significantly changed their expression levels is prepared and given as an input to the ChEA software. One can assume that these genes play a special role in the particular condition the cell was in.

Next the software computes over-representation for targets of transcription factors from the ChIP-X database. To compute statistical enrichment, ChEA implements the Fisher exact test with the Bonferroni’s correction, where the proportions for the test are the number of genes in the input list, the number of genes identified in the ChIP-X experiment, the genes that are shared among the two lists and the number of overall targets in the ChIP-X database.

Finally, ChEA reports a ranked list of ChIP-X experiments that show statistically significant overlap with the input list. Identified genes from the input list, potentially regulated by a specific transcription factor, are also connected and visualized as a network using known protein–protein interactions.

Inferring condition-specific transcription factor function from DNA binding and gene expression data (McCord, Berger et al. 2010)
In this study, the research group have developed a novel algorithm, called “CRACR” (Combination Rank-order Analysis of Condition-specific Regulation; pronounced “cracker”), which derives information about condition-specific gene regulation and transcription factor (TF) activity by combining comprehensive, condition-independent protein binding microarray (PBM) data for a given TF with gene expression microarray data under a variety of biological conditions. Specifically, CRACR searches for conditions in which the genes present downstream of those intergenic regions (IGRs) which exhibit significant TF binding in PBMs, are enriched among differentially expressed genes. In contrast to earlier studies, CRACR integrates PBM-derived experimental TF binding site data with gene expression data without imposing arbitrary cut-offs that define which IGRs are “bound” or which genes are “differentially expressed”. In addition, CRACR uses rank order statistics, which facilitates comparison of gene expression data from different microarray platforms.

Prediction of the condition specific functions of yeast Saccharomyces cerevisiae's transcription factors

 * First, the team collected 1327 publicly available gene expression microarray data sets for Saccharomyces cerevisiae. Each of these refers to a specific cellular condition.
 * Next, for each of these conditions, the genes were ordered based on their expression fold change levels. At the top were the genes that were highly induced and at the bottom the once that were repressed.
 * Parallel to the previous step, ranks were assigned to the genes according to the PBM P-values of transcription factors binding to their upstream IGRs.
 * Then using rank based statistical test, a comparison was made between the PBM defined ranks of similarly expressed genes within a sliding foreground window to the ranks of a length-matched background set of genes outside this window.
 * The result of this statistical test yields a value which is referred to as the ‘enrichment score' and represents the degree to which PBM-derived target genes of a given TF are significantly enriched within each window of similarly expressed genes.
 * The statistical significance of the maximum enrichment in a condition is derived by permutation testing.

Results
Using the method described above, CRACR can list expression conditions in which predicted transcription factor target genes show statistical significance in expression levels. From such a list of cellular conditions, one can hypothesize about the functions of the transcription factor.

=Chapter 4: Description of the approach= How we got the data - how we got the encode data, how we processed the data, how we got the pvalue for the relation between tf and gene sets. expression analysis. what is the approach used.. pseudocode for the encode part and the microarray analysis. and then how we integrated. why we select the avarage of the pvalue. give the details about implementation(little bit about the implementation eg C++ was used)

Also include what is new in the approach.

Describe the data for example say what is encode. what encode did to generate the data. what they did to generate the data, then talk about what we did with bedtools like take 3000 bp before.

where we get the gene sets.

BINDING DATABASE CREATION
The goal of this part of the project is to understand which transcription factor controls which gene set and by what degree. To achive this goal we used the data that was generated by the ENCODE  project. The human genome contains 20,000 protein coding genes which constitute to only 1.5% of the entire human genome. The primary goal of the ENCODE project is to catalog all the functional elements in the human genome and understand what part does the remaining genome play. (cite wikipedia encode article)

From the data that was generated in the project, one of the conclusions drawn was tfbs.

Explain ENCODE project

Explain the ChEA database

From these two projects we collect data about the TFBSs for entire human genome for all known TFs.

From ensembl, we get details of the gene locations

Once we knew the locations of the TFBSs and Genes, we were able to determine which TF possibly effects which gene.

From ChEA we also get the intensity level and using this intensity levels we run FCS to arrive at pValues for each tf.

COMBINING TWO P-VALUES
=Chapter 5: Experiment and the results= How we test it? How we validate it?

=Chapter 6: Conclusion= summary of what we did. // Small may be 3 pages. Summary of what was done.

=Bibliography= Goeman, J. J., Van De Geer, S. A., De Kort, F., & Van Houwelingen, H. C. (2004). A global test for groups of genes: testing association with a clinical outcome. Bioinformatics, 20(1), 93-99.

Khatri, P., Sirota, M., & Butte, A. J. (2012). Ten years of pathway analysis: current approaches and outstanding challenges. PLoS computational biology, 8(2), e1002375.

Kim, S.-Y., & Volsky, D. J. (2005). PAGE: parametric analysis of gene set enrichment. BMC bioinformatics, 6(1), 144. Pavlidis, P., Qin, J., Arango, V., Mann, J. J., & Sibille, E. (2004). Using the gene ontology for microarray data mining: a comparison of methods and application to age effects in human prefrontal cortex. Neurochemical research, 29(6), 1213-1222.

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A.,. . . Lander, E. S. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550.

Tian, L., Greenberg, S. A., Kong, S. W., Altschuler, J., Kohane, I. S., & Park, P. J. (2005). Discovering statistically significant pathways in expression profiling studies. Proceedings of the National Academy of Sciences of the United States of America, 102(38), 13544-13549.