User:Kinkreet/MGE

DNA packaging
The genome for any organism is extremely long and must be packaged to be able to fit inside the nucleus or cell, and also to avoid tangles and breaks. DNA is packaged by forming complexes with proteins; these proteins are highly condensed, viscous complexes (much more than plasmid DNA, which are only present in 1-10mg/ml) which are present in about 10-100 mg/ml (from prokaryotes to eukaryotes).

Prokaryotes
Prokaryotes have ~20 kDa histone-like proteins which packages DNA. Integration host factor (IHF; from E. coli), HU (B. stearothermophilus) and TF1 (B. subtilis) exists in dimers. Each HU consists of an alpha-helical body responsible for dimerization, which is then connected to β-sheet arms via a flexible linker region. The β-sheet arms can then fold into the minor groove of DNA. Similar structures are found in IHF, which also contains a flexible arm. HU binds to DNA non-specifically, whereas IHF and TF1 binds to defined consensus sequences. The concentration of these proteins is relatively low, and binds only every kilobase, and so not as dense as nucleosomes. Upon binding, these proteins bends the DNA.

Archaea
Crenarchaeota are archaea that lack histones, and euryarchaeota have histones. In euryarchaeota, there are 1-5 putative histone genes. The archaeal histones are generally smaller (as they lack H2A/H2B) than eukaryotic histones; subsequently, archaeal nucleosomes bind less DNA (67 bp instead of 147 bp for eukaryotes. Archaeal histones lack the N- and C-terminal tail domains in eukaryotes, but still contain a similar histone fold.

Eukaryotic
DNA in eukaryotes are packaged around histones, to form the nucleosome. DNA wraps itself periodically around histone complexes to form 11nm fibres, which further arranges in a solenoid-like 30nm fibre.

Nucleosomes and 11nm fibers
The nucleosome core is a heterooctomer of H2As, H2Bs, H3s and H4s. Two H3s and two H4s come together to form the H3-H4 tetramer; in parallel, one H2A and one H2B come together to form the H2A-H2B dimer. Finally, one H3-H4 tetramer complexes with two H2A-H2B dimer to form the complete histone octomer. The C- and N-terminal tails of histone subunits are flexible and stick out of the core, but after DNA binding will wrap around the DNA to hold it in place.

Using micrococcal nuclease to digest purified nuclei results in distinct DNA fragments, which can visualized by electrophoresis. Initial digestion results in a 166 bp complex called the 'chromatosome particle', which contains the nucleosome core, histone H1 and 20 bp of linker DNA; further digestion of this particle leads to a fragment of 146 bp, which is those bound to the nucleosomal core.

The nucleosome interact with the DNA using five distinct interactions: hydrogen bonds and ionic interactions, non-polar contacts, intercalation of arginines to contact phosphates across the groove (ideally A-T basepairs), amide/phosphate bond interactions, histone helix dipoles to spatially fix single phosphates.

When the DNA interacts with histones, it is distorted. The helix may twist tighter or looser, depending on the position of the DNA on the histone, and can vary between 9.4 and 10.9 base pairs per turn. Because the DNA has a width of ~2.5nm, when it is wrapped around a circular histone molecule, the face of the helix in contact with the histone have a shorter distance to cover than the face on the outside; consequently, the grooves on the DNA are opened up.

The nucleosome-DNA complex is dynamic, and the 'linker DNA' between the nucleosomes are free to move (this is known as DNA breathing); the parallel strands of DNA that is interacting with the histone may also move apart from each other transiently (known as gaping); one strand of the DNA may also dissociate completely, and become available for transcription or replication.

Despite being dynamic, the interaction is stable. H3 and H4 recover very slowly after photobleaching, as compared to H2B, which still took 2 minutes to recover 3%, and 130 minutes to recover 40%. This suggests that the inner core of the nucleosome (H3/H4) is very stable, whereas the H2B on the surface is able to be exchanged more rapidly. H1 almost completely recover after photobleaching after only 4 minutes, showing that H1 do not associate with DNA as tightly as the core histones.

However, there are many variants for the nucleosome particle, in terms of composition and conformation. The composition and stoichiometry is sometimes temporarily changed during certain biological activities, such as transcript elongation, where RNAPII requires access to the DNA. For example, tetrasomes lacks H2A and H2B, which is similar to the archaeal nucleosome. Other notable nucleosomes particles include hexasomes, which has one of the H2A/H2B dimer missing, a hemisome, which is H2A/H2B/H3/H4, and a split nucleosome is two hemisomes that are still associated. All these structures are structurally feasible.

There is a small degree of sequence-specificity in the positioning of nucleosomes. Because the DNA must bend around the histone molecule, more flexible parts of the DNA are more easily packaged. A/T-rich regions are more flexible (probably because there are only two hydrogen bonds between A and T, compared to 3 between C and G), and so many nucleosome binding sites contains periodic distributed AA and TT dinucleotides. This is only a preference and not a set rule. These nucleosome positioning sequences can be predicted by bioinformatic softwares to look for these periodic AA or TT dinucleotides.

Sites for transcription factor binding tend to be away from these nucleosome positioning sequences. Note that the TATA box is also an A/T-rich region, which is also flexible (as it is kinked drastically by TBP), but the TATA box has a longer sequence (TATAWAWR). The TATA is capable of one large kink, whereas nucleosome positioning sequences is capable of many small bends; thus only sequences with many nucleosome positioning sequences may wrap around the nucleosome. The nucleosome positions of a genome can be determined directly. As before, the DNA is digested using micrococcal nucleases, which digest only the link DNA and not DNA associated with nucleosomes. The fragments are then sequenced using Solexa, a high-throughput DNA sequencing technology which allows for the sequencing of millions of DNA fragments in parallel. The DNA sequences associated with nucleosomes are protected and so there will be more fragments containing their sequences; in contrast, sequences that is part of the linker DNA are digested and so less fragments contain their sequence. After all the fragments have been sequenced, they are put together, with many overlaps. The sequences which overlap the most are the most protected, and thus highly-likely to be nucleosome sites. Results show that in expressed genes, nucleosome positions are tight and highly periodic upstream of the transcriptional start site (TSS), but variable at the TSS, possibly due to the fact there can be multiple TSS. In non-expressed genes, this periodicity is lost, and there is a single nucleosome at the TSS, which shuts down transcription.

30nm solenoid
The 11 nm fiber can be packaged further into 30 nm fibers, which are transcriptionally inactive. Histone H1 mediates this packaging by binding to the entry and exit sites of DNA on nucleosomes, bring the strands together, and stabilizes the interaction between nucleosomes. ; this way, all the DNA is protected, meaning no transcription can occur. Histone H1 packages the 11 nm fiber into a solenoid structure, whose in vivo relevancy is still debated. This solenoid structure is stabilized because the positively-charged N-termini of the histones can bind to DNA on neighbouring nucleosomes. In H1-low DNA, the core proteins (especially H3 and H4) are highly acetylated; this is not observed in highly packaged 30 nm fiber. Thus, high levels of histone H1 is associated with the lack of gene transcription, and vice versa. From the study of mouse mammary tumor virus (MMTV) promoters, those that are active (either on the basal level or induced with hormones) have lost their H1 and have become associated with transcription factors such as Oct1 and TBP, and transcription machinary such as RNA Pol II, and have acetylated H3 tail domains.

Just as there are alternate nucleosome structures, alternate solenoid structures may exist. It is not entirely certain whether these solenoid structures exists at all in vivo, some have suggested they form globules instead. The mammalian sperm chromatin is the most condensed form of eukaryotic DNA, and approaches the physical limits of packaging. It is around 6 times as dense as mitotic chromosomes and it's so dense that it is essentially crystalline. In this instance, the DNA is so densely packaged that no transcription, DNA replication or repair is possible. In the sperm chromatin, proteins called protamine have gradually replaced histones to allow for this tighter packing. Protamines contain a large proportion of basic amino acids, which positive charges complements the negatively-charged sugar-phosphate backbone; there are also cysteines which aid compaction by inter- and intra-protamine crosslinking. The histone-to-protamine transition is regulated by ubiquitin system as well as (de)phosphorylation events. The hyperacetylation of histone H4 and maybe other core histones Between the removal of the histones and the deposition of the protamines, transition nuclear proteins (TP1 and TP2, which constitute 90% of the protein component present at this stage ) are present to stabilize the DNA. The spermatids start off as a round shaped cell, which elongates; the histones are then replaced with TP1 and TP2, which allows for the condensation of the chromatin; as the spermatid matures further, TP1 and TP2 are replaced with protamine 1 and preProtamine 2, which subsequently becomes Protamine 2. Because the exchange of histone with protamine requires extensive chromatin remodelling, essentially no transcription can take place during this transition. In the protamine-associated sperm chromatin, the DNA is no longer supercoiled. In human sperm, 10-15% of the core histones remains in the chromatin structure, although there are no H1. The amino acid length of protamines ranges from 32 to 65, but all have at least 50% (upto 75%) basic amino acids. The N-terminal conserved region of vertebrate protamine P1 shows a conserved sequence of AR(Y/G)RXXRSRSRS. Protamines exhibit a random coil conformation in solution, but when complexed with DNA, the positively-charged arginines interacts with the negatively-charged sugar-phosphate backbone and the protamines adopts a certain degree of secondary structure. One protamine is bound for each turn of dsDNA. Further packing of DNA-protamine into toroids (doughnut shpaes) further condenses the DNA; about 60 kilbases of DNA is packaged into each toroid. Uncoiled DNA links different toroids together. The toroid structure of sperm DNA packaging has been shown in a laser trapping assays.

Within 5-10 minutes after fertilization, the protamines in the pacakaged DNA in the sperm is completely replaced back into nucleosomal DNA, where the nucleosomes are provided by the egg; in Xenopus oocytes, there are enough histones to package 10000-nuclei worth of DNA. Protamines are degraded through proteolytic degradation before decondensation takes place,and reaches its maximum extent before decondensation was completed. This process is mediated by a set of chaperones, including nucleoplasmin (Np, which uses ATP to transfer DNA onto bound core histones, primarily H2A-H2B) and N1 (chaperones H3-H4).

Nucleoplasmin form a pentamer, which dimerize into a decamer. Each side of the pentameric decamer can bind to the H2A/H2B dimers component of the core histones The distal face of nucleoplasmin is the most probable site for binding protamines, with phosphorylation of positively-charged residues playing a part in neutralizing positive charges in order for better binding with protamines (negatively-charged). The protamines remain bound to nucleoplasmin after the histones are released.

Nuclear matrix (tertiary level)
The composition of the matrix can be elucidated by purifying nuclei by breaking the plasma membrane and releasing the cytoplasmic components, and then treating the nuclei with DNase I followed by extraction of histones. The resulting structure lacks chromatin and DNA, but will retain the matrix component, in a close morphology to the native matrix. Results from these experiments show that three major proteins - lamin A, B and C - dominate in the nuclear matrix, followed by topoisomerase II, matrins D, E, F, G and 4, and SC1 and SC2. RNA is also a major constituent and treatment with RNase breaks down the matrix. The exact composition of nuclear matrix is dependent on the type of cell.

Loops
The chromatin (either as the 11nm or 30nm fibers), attaches to the nuclear matrix periodically to form chromatin loops. In heterochromatin (tightly packaged), the loops are smaller (~558-762 bp, Q loops) than euchromatin (~90 kbp, R loops). 50-100kB -several average-sized genes. Differential loop sizes represents the state of the chromatin, and contributes to the banding patterns seen in different karyotypes of chromosomes, where heterochromatin are more darkly stained.

DNA within the same loops shares similar microenvironment, such as all can be acted on by enhancers within the loop. Different loops may also serve as different replication units during replication.

The site on the DNA which attaches to the nuclear matrix is termed the scaffold/matrix attachment region (S/MAR, a.k.a. SAR or MAR). Distance between these S/MAR sites determine the loop size; thus the primary sequence of DNA may encode for secondary and tertiary structures also. S/MAR sites can be identified by purifying out the nucleus, remove the histones using LIS (binds to histones) and digitonin, and digests the remaining DNA using DNases. These DNases are able to cut DNA where there is a loop, but unable to cleave DNA where they are attached to the nuclear matrix. The cleaved fragments and matrix-associated fragments can be separated via differential centrifugation, and the DNA sequenced to identify regions of S/MAR sites. S/MAR sites are probably also how insulators isolate sections DNA from another. Thus, S/MAR sites found near the 5' end of a gene is associated with transcriptionally active genes, whereas S/MAR sites within a gene are associated with silenced genes.

Multi-AT-hook protein (MATH) contains many AT-hook motifs which binds to the S/MAR sites and aids in its association with the nuclear matrix. Each AT-hook domain contributes a small force in holding the DNA to the matrix, but multiple hooks over a short distance makes the association a strong one.

The identification of S/MARs by bioinformatics have been difficult as they do not seem to have a consensus sequence. They are characterized generally as being A/T-rich (with exceptions), which makes sense as there will be a kink at the position of association, and being A/T rich allows the DNA to be flexible. Experimental approaches falls into two categories: ‘Nuclear Halo’ assay and DNA-binding studies on biochemically purified nuclear matrix.

Interdigitating Layers
Recent evidence from cryo-electron microscopy also appears to support the existence of interdigitating layers of irregularly organized nucleosomes, particularly in metaphase chromosomes

Territories
RNAP is not ubiquitously located throughout the nucleus, but are arranged in clusters of ~80 nm in diameter. Many template DNA encoding different genes would enter this cluster of ~30 active polymerases (marked by hyperphosphorylation) to be transcribed. This allows sequences that are megabases away to interact with each other in these clusters, which are also known as transcription factories. The promoter region of a gene comes to the RNAPII complex, and the DNA is spooled through the RNAPII complex to be elongated; this process is powered by the energy released by the hydrolysis of nucleotide triphosphates during elongating RNA synthesis. Transcription factories are thought to remain attached to the nuclear matrix and each other, even when no active transcription occurs.

The composition of these transcription factories include only a small proportion transcription proteins, with the majority being RNA-processing proteins, which makes sense as nascent RNA originate from these factories.

FISH allows us to visualize different chromosomes. Shows that different chromosomes occupy a different territory of the nucleus. Chromosomes with a low gene density tends to be located more towards the nuclear edge, while chromosomes with a high gene density are located more internally.

On a side note, GC-rich, gene-dense regions (located more internally and more highly expressed) tend to replicate early in S phase, whereas AT-rich regions that are usually gene-poor replicate late, although there are exceptions.

Transcription factories are also the site of translocations of DNA from one chromosome to another, and this can cause cancer. BCR and ABL are fused in chronic myeloid leukaemia; PML and RARA in acute promyelocytic leukaemia; MYC and IGH, IGK or IGL in B cell leukaemia/lymphoma; and IGH and CCND1, BCL2 or BCL6 also in B cell leukaemia/lymphoma.

Post-translational modifications
Post-translational modifications of DNA sequences and histones is a branch of epigenetics which stable and heritable.

The N-termini of histones can be post-translationally modified. Acetylation of histones leads to a more open form of chromatin, conversely, deacetylation and methylation leads to a more compact form of chromatin.

Acetylation of histones are mediated by histone acetyl transferases (HAT) and deacetylated via histone deacetylases (HDAC).

In cases of polyglutamine repeat expansion neurodegenerative diseases, where the action of HATs are inhibited by those repeats, knocking down the levels of HDACs or increase the levels of HAT activators will compensate for the lack of acetylation. Inhibitors of HDACs includes trichostatin, SAHA and butyrate.

RNA polymerase
RNA polymerization begin on the 3' end of the template strand. A primer hybridizes with the 3' end of the template strand (5' end of the primer strand) and incoming nucleotides are added onto the 3' end of the primer strand by forming a new phosphodiester bond with the C3 OH group. Thus, RNA polymerase work in a 5' → 3' direction in relation to the primer strand. Thus, the end which exits the RNA polymerase first would be the 5' end of the transcript.

Bacterial RNA polymerase
where α2 and ω acts by holding the whole enzyme together. The active site is at the bottom of a deep cleft made by the ββ' subunits, which contains an essential Mg2+ metal ion.

Archaeal RNA polymerase
Archaeal RNAP is recruited to promoters through the actions of three basal trasncription factors - TATA-binding protein, TFB (archaeal homolog of TFIIB), and TFE (archaeal homolog of TFIIE).

Archaeal RNA polymerases are like the eukaryotic RNAPII.

Eukaryotic RNA polymerase
The different polymerases are not named in a chronological manner from the time it is discovered. Rather, it was named after which fraction was eluted first in a chromatography technique.

Structure
All eukaryotic RNAPII (also called RNA polymerase b, Rpb, or Pol II) characterized so far have 12 subunits, amed RPB1-RPB12. The two largest subunits - RPB1 and RPB2 - are homologous to the bacterial RNAP β and β' subunits, and forms a cleft to which the catalytic center is located at the bottom, again with two magnesium ions located there. RPB1 has a long, flexible, 650Å (and 280Å linker) C-terminal tail (CTD), exclusive only to RNAPII, which consists of a 52 (26 in yeast) heptapeptide repeats of YSPTSPS , which phosphorylation state (mediated by TFIIH and occurs principally on Ser2 and Ser5, but also on tyrosine (Y) and threonine (T)) correlates to the RNAP state; where hypophosphorylated CTD correlates to the initiating RNAPII, whereas the hyperphosphorylated state faciliates promoter clearance and correlates to the elongating RNAPII. CTD phosphorylation also regulates the co-transcription events of 5' capping, assembly of spliceosomes and the binding of the cleavage/polyadenylation complex, probably by allowing other factors to land on the CTD so they are ready to process the RNA as soon as it exits.

RPB3 and RPB11 serve a similar role to the bacterial RNAP α2 dimer, and along with RPB10 and RPB12, holds the whole assembly in place. Five subunits (RPB5, RPB6, RPB8, RPB10 and RPB12) are also common in the other two eukaryotic RNAPs. RPB6 is similar to the bacterial ω subunit.

Duplex enter the main cleft of RNAPII and are gripped by 'jaws' and is unwound by the zipper prior to the active site. This separates the template strand from the non-template strand, the template strand is directed towards the catalytic site, and the non-template strand directed elsewhere, by a fork loop; eventually, another fork loop would direct these two strands back together the DNA is then bent at nearly 90° to the incoming duplex DNA, this exposes the DNA and allows for the addition of new nucleoside triphosphates. A funnel leading to a negatively-charged pore in the RNAPII structure allows free nucleotides to diffuse into the active site and elongate the transcript. Incoming nucleotides are known to enter through this funnel/pore structure as blocking it using a 21-residue antibacterial compound called Microcin J25 (an unusual lariat-protoknot structure) inhibited transcription in bacterial RNAPs. The RNA/DNA hybrid is bound at the cleft between RPB1 and RPB2, and is bent only at the active centre. At least the first 9 residues of the RNA transcript is seen to hybridize with the DNA; the RNA then leans against a structure call the rudder, which feeds it into another structure called the lid, which forces between the template DNA and transcript, thereby separating the DNA/RNA hybrid. After it has been separated, it will exit the RNAPII.

Crystal structures shows that the enzyme can exist in two states - open or partially closed. The state depends on a 50 kDa region called the clamp. The clamp is a mobile region of RNAPII which closes over the active site and traps the DNA template once it has entered. This clamp is normally opened when no transcript is present. The clamp is able to swing over the active site. A set of protein loops at the base of the clamp appear to act as pivots for DNA movement.

The active site contains aspartic acid (D) and glutamic acid (E) residues which holds two magnesium ions in position. Each metal has a coordination number of 6 at the state where the incoming nucleotide is bound. Metal B is coordinated by conserved residues D475 and D654, the main-chain oxygen of A476 and the three phosphate groups of the incoming nucleotide. Metal A is also ligated by D475 and D654, but also by the 3' OH of the primer strand, the α phosphate of the incoming nucleotide and two free water molecules. Metal A activates the 3' OH of the primer strand, which then nucleophilically attack the α-phosphate group of the incoming nucleotide and incorporates it into the transcript. Metal B then stabilizes the negatively-charged β- and γ-phosphates.

After the addition of the last nucleotide, the bridge helix is responsible for the translocation of the RNA/DNA hybrid so that the next free nucleoside triphosphate may be incorporated. The bridge helix is conserved across kingdoms and is located between RPB1 and RPB2. It is unsual for a helix to be this long to exist naturally. The bridge helix contains the residues threonine and alanine, which makes hydrophobic contacts with the bases of the template DNA ; A822 of the bridge helix is conserved in all species. From modelling the bridge helix and the DNA template using X-ray diffraction data, it was originally thought that A822 is conserved because its small side chain allows DNA to slide pass. However, A822R and A822K mutations actually increased activity, although by only 20-40%. This suggests that the RNAP structure is more flexible than first imagined, which cannot be visualized using X-ray diffraction results.

After nucleotide addition, a kink region near the C-terminus of the bridge helix (defined as residues where proline substitution, which increases helical instability, increased catalytic activity), and its neighbouring trigger loop undergoes coperative conformational change which pushes the DNA/RNA hybrid along. After translocation, the bridge helix may revert back to its native conformation. Using X-ray diffraction, it is difficult to obtain structure of the short-lived/energetically unstable states, and so the theory must also be based on high-throughput site-directed mutagenesis and study of molecular dynamics to investigate protein interactions and peptide flexibility. Using molecular dynamics modelling, it identified a conserved resideu, L824, to be responsible for pushing the pyrophosphate (waste product) away from R766, which stabilizes the pyrophosphate along with the Mg2+ ion.

The incoming nucleotide is incorporated into the transcript in two steps - initial binding and selective addition. The incoming nucleotide binds to an entry (E) site beneath the active center in an inverted orientation; the nucleotide is then rotated to face the template strand and is incorporated if it base pairs with the template strand. If there is an error in the the addition of the correct nucleotide, the RNA transcript is able to backtrack through the pore so that it can be cleaved by TFIIS, ensuring no mistranscribed transcript survive to be translated into malfunctioning proteins.

Other structures such as the wall and zipper exists.

Electrostatic potential aids in the stabilization of DNA/RNA by RNAPII. The internal surfaces are highly positively-charged, and aids in binding of negatively charged nucleic acids; the external surface is highly negatively-charged, repelling other DNA which are not near the catalytic site.

Core Promoter
A core promoter is the minimum DNA region required for drive a basal level of activator-dependent RNA polymerization by RNAPII in vitro, it can typically extends 35 bp either way of the transcriptional start site, where each core promoter may contain multiple elements. There may also be additional gene-specific elements and enhancer regions further away (is this only a feature of eukaryotic promoters)

In bacterial, the core promoter region is consistently at the -10 and -35 positions, and thus the promoter region can be identified easily. The core promoter in eukaryotes are highly diverse in structure and the core promoter elements may not be present all at once in any particular core promoter (so some core promoters may have elements X and Y, whereas others may have X and Z).

The TATA box is the most prominent and well-studied core promoter element, but TATA-like elements make up only ~24% of all promoters, and of these only ~10% have the consensus sequence of TATAWAWR. It is missing in ∼76% of human core promoters. Instead, these TATA-less promoters are G/C-rich, contains many Sp1-binding sites. In these TATA-less promoters, two motifs - M3 (SCGGAAGY) and M22 (TGCGCANK) - have shown to be prominent and may be a core promoter element. It is thought that TATA-less genes are mainly 'housekeeping' genes, whereas TATA-containing genes are more highly regulated.

TATA box
The TATA box is a cis-regulatory core promoter element found in archaea and eukaryotes. It it approximately 25 to 30 bp upstream of the transcriptional start site (TSS; +1) and thus is not transcribed. The TATA box consists of the consensus sequence of TATAWAWR flanked by G/C-rich regions which allows a clear distinction of the TATA box. Eukaryotic viruses, such as the adenovirus major late promoter, have some of the most active TATA box sequences.

The bacterial promoter contains one -10 and one -35 element; the -10 element is also AT-rich, but it is not functionally equivalent to the archaeal/eukaryotic TATA box. The bacterial -10 region lies inside the RNAP, and is there to facilitate DNA-strand separation, whereas the TATA box is much further away and do no contribute to the opening of DNA.

The TATA box has been shown to determine the initiation site (TSS). Deletion of DNA sequences between the TATA box and the original transcription start site saw the TSS remained at around 25 to 30 bp downstream of the TATA box, irrespective of the sequence at the TSS.; however, different sequences do produce TSS at slightly different positions, and multiple TSS may exist.

Other elements
Since the TATA box is missing in over three-quarters of RNAPII-transcribed genes, other elements exists to 'compensate' for this. Initiator element (Inr), DNA replication-related element factor (Dref) and downstream promoter element (DPE; with consensus sequence 5' A/G-G-A/T-C/T-G/A/C 3'), MTE, DCE, and XCPE1 are to but name a few. Inr is a very defined motif around the transcriptional initiation start site, where as Dref is upstream of the start site, and its distribution is much broader. The distribution of DPE is also broad, but it is also interesting because it is part of the transcript. The more defined elements (those with a higher positional bias) tends to be employed more as core promoter elements. These elements are recognized by TBP-associated factors (TAFs; part of TFIID mentioned later) - TAF2 recognizes Inr and TAF6 recognizes DPE.

The elements already described are general motifs, and there are gene-specific regulatory elements also.

In any particular gene, not all elements of the core promoter have to be present, and is often an incomplete mix of different elements.

TFIID and TBP
TFIID is a complex consisting of 10-12 protein subunits, which exact composition is species-dependent. The TATA-box is specifically recognized by the TATA-binding protein (TBP) subunit of TFIID. TBP has a near-rotational-symmetry horseshoe/saddle shape, with β-sheets lining the inner edge of the horseshoe and α-helices on the outer edge; the TBP is able to bind to the TATA box alone, without any other factors. The rest of the complex substituents are termed TBP-associated factors (TAFs), and the whole complex employs a horseshoe shape, although the exact arrangement of TAFs around TBP uses a lot of guesswork.

It is the β-sheets which makes many contacts with the DNA and binds to it. TBP also induces a kink in the DNA but this is energetically favourable due to two factors: highly-conserved phenylalanine side chains emergeing from TBP interacts hydrophobically with the minor groove of DNA, whose bases are now exposed due to the kink, to stabilize the kink conformation. And also because the T/A-rich region is more flexible than G/C-rich regions, and so the TATA-box is more permissible to kinks. In a study of AdMLP-related TATA boxes, some TATA boxes exhibit an intrinsic kink as much as 19°, but thus increased to 43-76° upon TBP-binding. Furthermore, guanines have amine groups which protrudes from the minor groove, which impedes on the contact between the saddle surface and the minor groove surface, and this is absent in the G/C-free TATA-box.

The specificity of binding of the TBP to the TATA box comes in two levels - the primary sequence of the TATA box can bind specifically to TBP, and secondly, the TATA box induces a higher-order structure of DNA by bending it slightly, which atacts TBP to bind to the TATA box. These two factors allow TBP to bind specifically to TATA boxes, while allowing for heterogeneity of the TATA box sequences.

After TFIID (containing TBP) has bound to the TATA box, TFIIA and TFIIB binds to opposite ends of TFIID which stabilizes TFIID binding. TFIIA's role is likely to be just stabilizing TFIID. TFIIB contains a helix which extend into the groove of the DNA to further stabilize the interaction, but there are also TFIIB Recognition Element (BRE) upstream (BREu; with consensus sequence 5' G/C-G/C-G/A-C-G-C-C 3') or downstream (BREd; with consensus sequence 5' G/A-T-T/G/A-T/G-G/T-T/G-T/G 3') of the TATA box, to which TFIIB recognizes and binds, adding sequence-specificity to the DNA binding. The BREu element interacts with a helix–turn–helix motif; in the BREd-interacting motif, two residues mediates this interaction - G153 and R154. G153 mediate base-specific contact and R154 exhibited water-mediate interaction with the minor groove. The whole interaction is stabilized by flanking residues of Lys152, Arg154, Ala155, and Asn156.

(Negative) Regulation
TBP autoregulates itself by homodimerization, restricting its ability to bind promoter DNA. Thus, when there are too much TBP in the nucleus, it neutralizes itself to prevent overexpression. Point mutations at the dimer interface of TBP generated in yeast exhibited inability to dimerize, which results in increased gene expression. These mutants were also observed to be quickly degraded by the cell, suggesting even wild-type TBP can only be sustained for a short while before being degraded, ensuring no uncontrolled expression of genes take place.

Negative cofactor 2 (NC2, a.k.a. Dr1-DRAP1; different names because they were discovered by different groups at the same time) is an eukaryotic conserved two-subunit negative cofactor which directly interacts with the TBP-DNA complex and inhibits the binding of TFIIB and thus prevents the assembly of a productive transcription initiation complex. In the crystal structures of NC2-bound TBP-DNA and TFIIB-bound TBP-DNA shows that NC2 and TFIIB would potentially overlap each other, and thus cannot both bind to TBP at the same time.

As previously discussed, only ~10% of TATA boxes have the consensus sequence of TATAWAWR; off-consensus TATA elements do not bind to TBP as strongly, and subsequently affects TFIIA and TFIIB binding and initiation complex formation.

Furthermore, most TATA-like sequences are made inaccessible by being packaged into chromatin structures.

Pathology
The horseshoe/saddle structure of TBP is highly conserved in all species, however, an additional N-terminal extension may also be present in eukaryotes and plants, but most prominently in higher eukaryotes. In these higher eukaryotes, poly-glutamine (polyQ) stretches, encoded by CAA or CAG codons, are also observed. The polyQ region is most prominent in humans and higher eukaryotes (such as mouse and Drosophila) seem to contain a larger polyQ stretch; however, polyQ expansion has major health implications. The normal range of polyQ stretches ranges from 25 to 42 units, the expansion of the polyQ stretches to >42-44 glutamines typically results in spinocerebellar ataxia type 17 (SCA17), a neurodegenerative disorder that resembles Huntington disease. It has been shown that the polyQ region interacts with TFIIB, and overexpansion leads to abnormal interaction and reduction in TBP-DNA binding, leading to neurodegeneration. The brain, but predominantly the cerebellar, undergoes atrophy which can be detected using MRI scans; inclusion bodies containing ubiquinated entities as well as aggregated TBP were detected in the nuclei of disease brain post mortem. This results in clumsiness and loss of co-ordination, balance, gait, reflexes, eye co-ordination, hand and tongue co-ordination. In mice models, mutants with high number of polyQ repeats (71 or 105) are poorly groomed and have bits of their ears missing; the lifespan of these mice shows a negative relationship to the number of repeats, in both the male and female mice. In humans, cerebellar ataxia can result from genetic inheritance or spontaneous mutation; it is an autosomal dominant disease with a variable age of onset (3-75). The effect does appear to be gradual, as there is a rough inverse correlation between the number of CAG repeats and the onset of disease.

Alzheimer's disease (AD) is a late onset neurodegenerative disease. AD is characterized by an accumulation of misfolded proteins into neurofibrillary tangles (predominantly composed of the microtubule associated protein tau) and amyloid plaques (predominantly β-amyloid protein fragment). TBP with high number of glutamine repeats often have its C-terminal DNA-binding domain missing, and aggregates in the nucleus; they may be entangled in neurofibrillary tangles and contribute to AD. The cause of the missing C-terminus is often a premature stop codon; for example, the TBPex3 splice variant contains exon 3 which contains the stop codon TGA. The TBPex3 splice variant is highly-conserved in all higher euakryotes, and only when it is abnormally expressed does it cause disease. In AD patients, the TBPex3 splice variant is expressed abnormally high in the middle temporal gyrus, which is involved in the recognition of known faces, and accessing word meaning while reading.

TFIIF
After TFIID has bound to the TATA box and TFIIA and TFIIB has stabilized the complex, TFIIF enters and mediates the binding to RNAPII.

TFIIE and TFIIH
After RNAPII has bound, TFIIE and TFIIH associates with the RNAP

Regulating RNA polymerases
α-Amanitin is a planar, cyclic peptide of eight amino acids - a found in several species of the Amanita genus of mushrooms, including the death cap. α-Amanitin is toxic because it is able to bind to RNAP and inhibit its function, preventing gene expression which can lead to death. α-Amanitin has no effect on RNAPI, possibly due to RNAPI being in a protected environment inside the nucleolus.

Super Core Promoters
Super core promoters are promoters which are designed to be very efficient in driving expression. Super core promoters have more promoter elements, and most elements conform to the consensus sequence. The first designed super core promoter, SCP1, contained a TATA box, an initiator (Inr), a motif ten element (MTE) and a downstream promoter element (DPE); SCP1 showed a high level of frequency of initiation and expression both in vivo and in vitro than cytomegalovirus (CMV) IE1 and adenovirus major late (AdML) core promoters. The elements are taken from different entities (Drosophila and cytomegalovirus, adenovirus) to select the best consensus sequence; this means that the different subunits of TFIID bind to the super core promoter strongly.

Super core promoters are useful in recombinant protein production, ensuring high levels of protein production.

RNA interference (RNAi)
RNA interference is the silencing of genes using double-stranded RNA. This phenomenon was first observed in petunias, where growers of petunias wanted their flowers to have a more intense colour. And so they approached scientists which proposed to introduce more mRNAs of the coloured produc - chalcone synthase (anthrocyanin pigment gene) - into the cells of the flowers so that there are more transcripts to be translated, giving a more intense colour. However, after adding the RNA, not only are the transgene introduced silenced, but the endogenous genes also became silent. They also discovered that using dsRNA was most effective in this gene silencing. This effect was coined cosuppression because the dsRNA silenced both the transgene and the endogenous gene. Similar experiments done in C. elegans shows that gene silencing is much more effective in dsRNA than either of sense and anti-sense RNA alone, because dsRNA led to the degradation of endogenous gene transcript. This is again observed in Trypanosoma brucei, where dsRNA of α-tubulin mRNA 5′ untranslated region (5′ UTR), led to a FAT phenotype. Furthermore, only the mature tubulin transcript was degraded but pre-mRNA was not degraded, thus, the degradation of endogenous gene might be sequence specific to the injected mRNA. Concurrent to the silencing of genes, short (~25nt) ssRNA sequences also appear with the silencing. These short ssRNA sequences are known as small interfering RNAs (siRNAs), which are now known to be between 21-25 nt in length; these siRNAs act as guides that binds to endogenous transcript and induce its degradation.

When the dsRNA enters into the cell, it is cut into smaller (21-25bp, with 2nt 3' overhangs) fragments of dsRNA (siRNA) using the ATP-dependent Dicer complex. The siRNA duplex binds to the RISC degradation complex, where the RISC hydrolyses ATP to melt the RNA and mediate the degradation of the sense-strand siRNA. The anti-sense strand then sequence-specifically binds to the endogenous RNA and a member of the Argonaute family known as slicer (part of RISC) cleaves the dsRNA.

The Dicer family consists of dsRNA-specific endonucleases which cleaves dsRNA into 20-30bp segments. Mammals have one Dicer which cleaves to produce both siRNA and miRNA, whereas Drosophila have Dicer-1 for miRNAs and Dicer-2 for siRNAs. Dicer cleaves at precise locations, to ensure the cleaved RNA are of a set length. The dsRNA binding site lies at the C-most-terminal of the Dicer protein (dsRBR) and there is a PAZ domain which binds to the end of dsRNA and determines the length of the produced siRNA. There are two RNase III domains which dimerize so that both strands of the RNA is cleaved. Because the two RNase III domains do not cleave at the same position, the resulting siRNA will always contain a 2nt 3' overhang. The DEXD/ATPase damain hydrolyses ATP required to provide energy for this cleavage.

Dicer can also cleaves miRNA, which binds to target mRNA but does not induces its cleavage, but does block initiation of translation. miRNAs do not need to be 100% complementary to their targets, in contrast to siRNA. miRNA have an important role in the development of higher eukaryotes.

There may be many different RISC complexes, but Argonaute is in everyone and defines the RISC. Baker's yeast contain only one AGO protein, C. elegans has 27, Drosophila has 5, humans have 8; all Argonaute proteins have PAZ domains (like Dicer) and PIWI domains, and dsRNA is bound between these domains in the 3D structure. The PIWI domain is an RNase H domain which has nuclease or polynucleotidyl transferase activity, and is used to cleave the dsRNA. For the mRNA to bind, the siRNA/mRNA duplex must be in the A-form conformation (instead of the native B form). PAZ has the siRNA guiding strand bound and guides the target mRNA into the AGO complex and pass the PIWI cleavage site; the inner surface of the complex is lined with positive charges to neutralize the negative charges of the target mRNA. Cleavage occurs 11 nucleotides pass the 3' end of the guiding siRNA strand. The released mRNA can then bind to RISC to be hybridized with endogenous mRNA to encourage further slicing. RNA interference can be amplified by RNA-dependent RNA polymerase (RdRP). In C. elegans, RdRP binds to the siRNA/mRNA complex that is about to be cleaved by Argonaute, and uses the siRNA transcript to transcribe, de novo, more siRNA fragments. These ssRNA can then again be loaded onto Argonaute to be amplified again. Alternatively, in plants, RdRP binds to the cleaved target mRNA fragments and elongate from the template to generate dsRNA; these fragments tend to be longer than 21-25bp, and so is cleaved by Dicer into smaller, secondary siRNAs. These secondary siRNA are derived from the target mRNA, because a single-nucleotide mutation in the primary siRNA produced secondary siRNA lacking this mutation. In fact, the secondary siRNA population constitute the majority of the siRNA population, thus confirming this siRNA silencing amplification effect. These secondary siRNA produced by RdRP are likely to be either side of the original (primary) siRNA, and so siRNA can be transitive to neighbouring regions of the transcript, and silence neighbouring genes on the same transcript. If there are transcripts which overlaps, silencing can spread to those genes as well. In fact, in C. elegans, the siRNA can be induced into C. elegans simply by soaking the worms in a solution containing dsRNA, or feed the worms with a dsRNA-expressing E. coli strain, where the dsRNA is expressed as a stem loop where the same sequence is reversed to face each other and expressed using opposing promoters). The siRNA which are introduced through the skin and through the lumen of the intestine is able to migrate to the rest of the body. This is an addition factor to the natural protection siRNA provides against viral and parasitic genomic transposition, because the siRNA generated in response to infection can be spread through the whole organism, providing them with immunity against the infection that has not reached them yet.

Small hairpin RNA (shRNA) and microRNA (miRNA) can also be cleaved by dicer to produce siRNA. These can be produced naturally by convergent transcription, duplication and inversion of gene, read-through transcription of transposons in inverted orientation, directional transcription and trans-interaction - basically anything that causes RNAs to hybridize with each other.

RNAi is a natural mechanism to prevent parasitic, viral and exogenous pathogenic nucleic acids from entering into the cell and induce ectopic expression. It also suppresses transposable elements and repetitive sequences, and is required for the gene imprinting process.

Transposable elements are RNAs and DNAs which is transferred from one site of a genome to another site. Retrotransposons are RNAs which requires a reverse transcriptase to reverse transcribe it into DNA, which can then be inserted into the genome. DNA transposons contains terminal inverted repeats, or TIRs, flanking the sequence to be transposed, these TIRs can bind to each other and excise out the transposon, which can move to another site to be incorporated. Helitrons are another transposable element which duplicate using a rolling-circle mechanism. It has been found that genes next to transposable elements (TEs) are often expressed in a variegated pattern. This is because the cell forms heterochromatin around the transposable elements to prevent mutations, but these heterochromatin can spread to neighbouring cells and silence them. Because many DNA transposons contain inverted repeats (TIRs), it often forms stem loops containing dsRNA. These shRNAs can be bound by RISC to be degraded, thus silencing transposable elements. RNAi can also silence transposable elements by direct chromatin modifications; siRNAs can be recognized by different complexes such as RITS and hybridize onto DNA, which cleaves the RNA but also recruits the CLRC that contains histone methyltransferases to methylates chromatin on the histone 3 lysine 9 (H3K9) position, which is a classic mark for heterochromatin. The H3K9me is then bound by Swi6 or heterochromatin protein 1 (HP1), which maintains the DNA at the silenced state. Cells without RNAi shows reduced localization of HP1 on the DNA as well as less H3K9 methylation, showing that RNAi is involved for heterochromatin generation. Alternatively, the small RNAs can also recruit DNA-dependent methyltransferases which instead of methylating histones, methylates the CpG dinucleotides on gene to silence them instead. This might be a natural negative feedback mechanism where the transcription of a gene leads to the weak repression of the same gene, to prevent overexpression.

Piwi-interacting RNA (piRNA) is the largest class of small non-coding RNA molecules that is expressed in animal cells. piRNAs form RNA-protein complexes through interactions with piwi proteins. Results from a two-hybrid screen has shown that Piwi interacts strongly and specifically with HP1a (and not HP1b nor HP1c), and vice versa (HP1a only interacts with Piwi and not other RNAi components).

RNA interference can be inherited from the parents, and RNAi is present in both the maternal and paternal germline.

RNAi have became a fundamental tool in genetic studies and is used to see the effect of knockdown of genes have on the organism. The whole genome can be knocked down using RNAi to curate the whole genome with functionalities of the different genes, searching for genes affecting Huntington's disease , obesity , and susceptibility to viral infection

Care must be taken that the introduced dsRNA is small (<30bp), as longer dsRNA (>500bp) activates dsRNA specific protein kinase (PKR) that induces an interferon response because the cell will think it is under attack by an RNA virus, and will inhibit global protein synthesis, giving non-specific effects. Short dsRNA do not trigger the interferon response.

siRNA is able to shuttle between the nucleus and the cytoplasm. Normally, it is exported from non-nucleolar regions of the nucleus into the cytoplasm, a process mediated by exportin-5. In the cytoplasm, it primarily silence genes through mRNA degradation, whereas in the nucleus it primarily silence genes through recruiting methylases to methylate genes.

mRNA spreading from one cell to another
Specific mRNAs can be transported from one plant cell to its neighbour From studying the transport of sucrose to the rest of the body, Frommer and co-workers identified sucrose is loaded onto sieve element cells. They also identified the sucrose transporter protein (SUT1) on the plasma membrane of the sieve element cell, which is surprising because these cells are enucleate (lacking a nucleus), which means it cannot make mRNA. In fact, following in situ mRNA hybridization experiments, it was found that the mRNA for the SUT1 is made in the sieve element companion cells and then transported to the sieve element cell to be translated. It was found in separate experiments that mRNA from one part of a plant can be transported to every part of the plant.

Test frequency of transcript initiation
Extract DNA, add in components of the pre-initiation complex, give it time to assemble. rNTPs are added to allow the RNAPII complex to initiate. Further initiation events are prevented by adding Sarcosyl 10 seconds after the rNTPs are added. The experiment is then stopped after a certain time to access the amount of transcript.

Elucidating interactions of different subunits of RNAP
If we can assemble a functional RNAP from purified subunits, then we can better understand how those different subunits interact. Each subunit is individually produced as recombinant proteins in E. coli using an expression plasmid; for each subunit, one wildtype and one mutant is produced. After the recombinant proteins are produced, they are treated with 6M urea to denature the protein, the proteins are the mixed together, the urea dialysed out to allow the protein to renature, refold and associate with each other. This process can now be automated to identify how single amino acid mutations affect function.

High-throughput mutagenesis
Alanine scanning is a technique used to determine the contribution of a specific residue to the stability or function of given protein, it works by sequentially substituting every residue of a protein and observe the effect of this mutagenesis on the function of the protein. Proline scanning is another technique which is good for structural studies because proline distorts the secondary structure (especially of α-helices) as rotation between the alpha-carbon and the amide group is not possible. However, with higher computing power and high-throughput techniques, it is now able to systematically substitute every amino acid of the protein with all 19 other essential amino acids.

With so much information, heatmaps are a good way of representing this large data set.

Computer Modelling
X-ray structures of proteins at a particular state are not always obtainable, due to its instability or the short-lived nature of that conformation. Computer modelling allows for the simulation of macromolecules by calculating the motion of all atoms in the macromolecule and their interactions. The limitation to this method is computing resources and sophistication of the software, but this allows molecular motions which may last for a short time to be visualized. Modelling also allows the macromolecule to be visualized in different conditions (aqueous or in organic solvent etc). Computer modelling not only allows for the modelling of the wildtype molecule, but also be able to visualize effects of mutants (which may have been identified from mutagenesis studies) have on the structure or dynamics of the macromolecule. Popular softwares include those by YASARA (Yet Another Scientific Artificial Reality Application).

In vitro
First, the nuclear components of the cell are extracted. DNA is then purified using differential centrifugation followed by chromatography (ionic exchange chromatography with a negative column, as that mimics DNA. Different resins (P-11, DE-52, Heparin) and salt concentration provides the different conditions for different steps of purification). Different techniques can then be used to elucidate the binding site of the transcription factors identified. The gene will be cloned and transcribed in vitro using free nucleotides and highly purified transcription factors, and observe for the presence of the RNA transcript; or it can also be inserted into organisms as a reporter linked to the promoter, and test for the presence of the gene product.

There are two methods to quantify the RNA originating from that particular DNA fragment - primer extension assay and 'run-off' assay. In primer extension assay, a radiolabelled DNA oligonucleotide primer hybridizes to the RNA and is reverse transcribed. DNA and RNA are then ran on an electrophoretic gel and Northern blotted to identify the radiolabelled sscDNA. The intensity of the band correlates with the amount of RNA and thus correlates to the efficiency of the transcription factor in using that promoter to initiate transcription. In 'run-off' assay, DNA is mutated to remove all C-residues; different transcription factors are then introduced to induce transcription, and the transcript they produce will contain no G residues. RNase T1 is an RNase which specifically hydorlyzes RNA at G residues, and thus all RNA fragment except for our G-less cassette will be degraded. The remaining RNA can then be analyzed and quantified on a gel.

The advantages of an in vitro system is everything can be striped down to the basics; however, with increasing purity, it moves further away from the more realistic in vivo condition. Furthermore, even if transcription factors are identified to bind to the DNA fragment of interest, or the presence of the reporter gene, it does not necessarily mean that the fragment contains the promoter of interest, owing to non-specific interactions.

Electrophoretic mobility shift assay (EMSA)
Electrophoretic mobility shift assay (EMSA) works on the bases that free DNA fragment will migrate faster in a gel than the same DNA fragment bound to a protein. Thus, to test what factors bind to a particular DNA sequence, one can incubate the DNA sequence with different factors, and run each result on a different lane on the EMSA. Factors which bind to DNA will slow the migration and is represented by a gel shift.

Disadvantage of this technique is that the binding site cannot be directly determined, the stoichiometry of binding cannot be determined also.

DNase I footprinting
The DNA to of interest is cloned and labelled at one end of the DNA with a radioactive molecule. The clones are then incubated in the proteins and then digested with DNase I. DNase I cuts DNA molecules at random locations. The resulting mixture will be made up of radioactive and non-radioactive fragments of various length. The mixture is then separated via electrophoresis and visualized as an autoradiogram.

Where the protein is bound, not cleavage can occur - cleavage can only occur either side of the binding site. Because the DNA clones are labelled only at one end, only the fragment associated with this end appears, the non-radiolabelled fragments (resulted from the cleavage) do not show on the autoradiogram. Thus, the autoradiogram will show fragment lengths starting from 0 all the way to the full length of the clone (provided no cleavage occurs), but where the proteins was bound, it will show no fragments, thus creating a footprint.

The transcription factor binding sites can be elucidated using DNase I footprinting, however, because the TF itself may impinge on the action of the nuclease, the size represented in the footprint is likely to be larger than the actual binding site, and the location is approximate at best. Furthermore, it does not show how many proteins or how these proteins bind if they bind to the same element.

DNA affinity chromatography
DNA can be used to select out factors which it binds specifically. Cells are lysed and passed through a general heparin agarose column. Heparin is very negatively-charged and selects out only the proteins which is able to bind negatively-charged DNA; this step may be omitted or modified as heparin also repels proteins with negatively-charges on their surfaces. The remaining proteins are then passed through a DNA affinity column, where DNA sequences are attached to the beads. Proteins which can bind to this sequence specifically is retained and eluted sequentially with higher salt concentrations.

In vivo
Promoters/enhancers are fused with reporter genes (such as luciferase, β-galactosidase, chloramphenicol transferase (CAT), GFP), and transfect into the cell by incubation between 1 to 3 days. The levels of the gene product and/or reporter mRNA are measured as quantification of the efficiency of the promoter/enhancer elements.

If the host do not have the promoter of interest, a second expression plasmid can be introduced, which uses a promoter that is used by the host, to express the transcription factor which is lacking in the host, to drive the expression of the reporter plasmid. Examples includes human Sp1, which is lacking in Drosophila.

In vivo transfection assays are closest to the physiological condition, but cannot be done on a large scale. Transfection itself may induce a secondary cellular (stress) response which can alter the cell's transcription machinary (it might focus its attention on transcribing other genes, or stop transcription in response to the stress).

in vitro
Random nucleotide sequences can be mixed with known transcription factors; those that binds to the factors are washed, pooled and sequenced with short read technology. The resulting sequences can then be analysed (via alignment) to determine the consensus sequence, if any, of the target site for the transcription factor.

Enhancer deletion analysis differentially deletes different parts of the enhancer region, and then insert these sequences into plasmid containing a constitutive promoter and a reporter gene. Cells are then transfected with this plasmid and incubated for 1 to 3 days. The cells are then harvested and the levels of the reporter transcript and reporter proteins measured. Deletions which did not affect the level of reporter transcript are likely to not be an enhancer module, and vice versa.

Enhancer linker scanning inserts a small piece of DNA at various locations approximate to where the enhancer modules are. If this linker is inserted into an enhancer module, it is likely to lower the affinity of binding for the enhancer proteins, and thus reduce rate of transcription. If inserted into a non-enhancer sequence, the activity will not be affected.

Chromatin immunoprecipitation (ChIP)
Chromatin immunoprecipitation (ChIP) can be used to determine interactions between proteins and DNA. First, the protein is cross-linked to the DNA, the cells are then lysed and the DNA cut at random locations. The DNA-protein fragments are then passed through an affinity column, selective for a particular protein (e.g. histones) while other sequences flows through. The bound sequences are eluted and analysed quantitatively using q-PCR, microarrays (ChIP-chip), or are sequenced (ChIPseq)

In q-PCR, one must have a DNA sequence in mind already, and must design primers for the PCR reaction, but it is quantitative. ChIPseq is also in theory quantitative, but requires many sequences. ChIP-chip is the least quantitative.

ChIP is restricted by the availability of a strong antibody. If we want to find nucleosome associated DNA sequences, we use an antibody against histone; if we want to find S/MARs, we use antibodies against lamin. We can also find sequences which are post-translationally modified by using antibodies against epigenetic marks.

Micrococcal nuclease digestion (MNase) and chemical cleavage
DNA which exists between nucleosomes are more exposed to nucleases than those attached. Thus, micrococcal nuclease digestion (MNase) or chemicals can be used to cut DNA on linker DNA, leaving DNA that is attached to histones. The two groups of DNA can be separated by differential centrifugation and anaylsed, again, either by q-PCR, microarrays or sequencing.

Sequences which are most common on the nucleosome-attached group represents sequences which are associated with nucleosomes, and those in the other fraction suggests linker DNA. Nucleosome positions have preferential binding sites and are stable irrespective of the state of the organism, and do not shift around. Remodelling, when it does occur, occurs locally. However, this must not be too heavily relied upon, as studies have indicated that nucleosome phasing sequences only fine tunes the position of nucleosomes, and the major contributing factor lies beyond just the DNA sequence. The locations of nucleosomes can now be mapped to the single-base-pair resolution.

DamID
Make a fusion protein between dam methylase and DNA binding protein of interest. Which genomic regions become dam methylated?

Chromatin capture techniques
Determine which DNA sequences are interacting with each other in the nucleus Cross link proteins to DNA Shear DNA Ligate known sequence to interacting sequences

in silico
In silico techniques takes siginificantly shorter period of time, especially with higher computing powers, high throughput analysis and the availability of whole genomes.

In genome/gene databases, the exons and introns, direction of transcription and other basic information are already annotated, either manually (curated, based on experiments characterizing cDNAs, expressed sequence tags (ESTs), RNASeq and cross-genome comparisons) or automatically. The sequence can also be further analysed using bioinformatics software to identify target sites for gene-specific transcription factors, promoter sequences and other regulatory sequences such as scaffold/nuclear matrix attachment regions (S/MARs) and nucleosome phasing sequences.

Target sites for transcription factors
The binding site for transcription factors can be elucidated because some transcription factors bind only to specific sequences. However, while many sequences match the binding sequence, they can be packaged into a way which makes it inaccessible for the transcription factor to bind; thus, there is a high likelihood of there being many false positives. Furthermore, although there may be conserved residues, variation in the other residues will still allow for binding; although the variation is likely to affect the affinity of binding.

There may also be 'decoy' binding sites, where the DNA sequence matches that of a transcription factor binding site, and transcription factors can bind, but once bound the TF is not able to carry out its function. Decoy binding sites act to compete for the binding of TF, keeping them away from functional enhancers/promoter regions. This ensures that surplus transcription factors are not degraded but stored, to ensure as little fluctuation in transcription factor levels as possible.

Possible binding sites may be obtained from sequence alignment programmes. However, binding sites are usually short in length, leading to high level of redundancy - many PSSMs for typical eukaryotic gene-specific transcription factors shows a positive result approximately every 4 kilobases, most of which will be false positives.

Furthermore, a consensus sequence is often not enough for recruitment of a transcription factor. In certain cases, the transcription factor do not bind directly with the DNA sequence, but with other DNA-binding proteins, or must be associated with other proteins in order for it to bind.

Sequence alignment
There are various types of sequence alignment: global and local, and pairwise and global.

Global
Global alignment is where the whole length of the query sequence is compared with the whole length of the sequences in the database. It is the simplest alignment and is useful at looking at proteins which we know are homologous, such as the same proteins from different species; this allows us to find differences between the species. In drug design, this may allow for us to identify a tissue-specific target for the drug.

Only sequences which are adequately similar should be placed into the alignment, as non-similar sequences may alter the alignment of the whole set, provided it was a multiple sequence alignment.

Local
Local alignment aligns only sub-sequences within the database which align with the query sequence. This means only part of the sequences need to be homologous, allowing for sequences containing substantial non-related (unaligned) subsequences to still show a hit. This is useful for looking at, or identifying, proteins with similar domains, or a conserved motif/sequence of nucleotides.

cDNA of known proteins can be BLASTed against the genome to find matches; this is the basis of gene annotation. The cDNA will align with the exons of the genome, with gaps where the introns are.

Pairwise Sequence Alignment
Pairwise sequence alignment takes a query sequence, and compare it individually to another sequence, or a whole database of sequences, to identify sequences which have high similarities.

Multiple Sequence Alignment
Multiple Sequence Alignment (MSA) takes all the sequences in the query and try to align them all. It will introduce gaps where appropriate to help the overall alignment. However, gaps will induce penalties and so it might be better to take a sequence out of the query if it does not align well with the rest.

BLAST
BLAST (Basic Local Alignment Search Tool) is a pairwise, local alignment program, which works with nucleotide and peptide sequences, and also calculates the significance of the hits. It is based on the Smith-Waterman algorithm. It compares a query sequence against individual entries in a database to look for similarities, and return any sequences which the program deems significant. BLAST can also be downloaded to run queries against your own database, this is useful when the genome of the species you are studying is not on NCBI.

There are many types of BLAST searches, each comparing sequences of one form to another.

nucleotide blast and protein blast are the fastest BLAST searches. tblastn is slower because the nucleotide needs to be translated into a protein sequence using all 6 open reading frames (3 in each codon, times 2 because of two strands of DNA). tblastx is even slower, as it needs to translate 6x6 open reading frame combinations, although tblastx is the most sensitive.

The results page will only show the sequences which are homologous, other regions are not shown. And so when interpreting the results, one must consider whether the hit is homologous on the whole-protein level, or just at that region/domain.

Scoring/Substitution Matrices
Substitution matrices take proteins that we know are the same and compare their sequences to identify differences, more specifically looking at residues which are important and cannot be changed/substituted, along with residues which are important but can be changed. The latter will give us an idea of how similar two amino acids are.

From many comparisons, we can build up a scoring table (or matrix) which details the likelihood of one residue being substituted for another. This probability is combined with the abundance of the amino acid and represented numerically as a score, with a positive score meaning the substitution can occur without changing the structure and thus function of the protein, and a negative score if this is unlikely. For example, according to the BLOSUM62 matrix, a glutamate substituting to a glutamine have a score of 2, meaning it is not unlikely; in contrast, substitution of trytophan to aspartate have a score of -4, and is thus highly unlikely. Amino acids substituting for themselves always have a positive score, with rare amino acids such as tryptophan and cystine given higher scores.

Examples
PAM(Point accepted mutation) is one of the earliest set of amino acid substitution matrices, developed by Margaret Dayhoff in 1978 using multiple sequence alignment based on 1572 observed mutations in 71 families of closely related proteins. The number associated with the matrix corresponds to the expected number of point mutations (per 100 residues) to have occurred between the query and the database. Thus a higher number implies more tolerance to sequences which are separated by a larger evolutionary distance.

BLOSUM(BLOck SUbstitution Matrix) is used for more divergent sequences as it is generated from the multiple-alignment of evolutionarily divergent proteins. It looks at blocks where there are good alignments to see which regions are conserved. By being conserved, it is implied that it must have important function, and that the amino acids within this conserved clock must have significance. Thus, the BLOSUM matrix bases its scoring on these conserved blocks of sequences, rather than take the whole sequence.

This (BLOSUM 62) is the default matrix used by BLAST.

Continued
During an alignment, the total score of the alignment is calculated by adding the individual scores calculated by the substitution matrix. The BLAST program effectively try every alignment and return the alignments with the highest scores. Of course, this is computationally demanding and timely, and so the program takes a few shortcuts to quicken the process; in some cases, this might mean some homologous sequences are missed.

Apart from sequence identity/similarity, the BLAST program also takes into account gaps. Gaps means there is either an insertion or deletion in one of the sequences (possibly both). This is heavily penalized on the score of the alignment; and the penalty is proportional to the size of the gap.

In the results page, identity means the percentage of the residues which were identical; positive means the percentage of residues which were identical plus conservative substitutions.

The e-value is the most significant when interpreting results; it describes the likelihood that this sequence would have appeared by chance given the length of the query, the length of the alignment, and the size of the database (the expect value will tend to be better with longer queries and smaller databases, however the database should be as large as possible to ensure comprehesiveness). It can be viewed as the background noise and increases exponentially to the score.

The closer the score is to 0, the more significant the hit. The value for the threshold of the e-value can be changed to allow the results page to display sequences of lower similarity.

Limitations
There are different evolutionary pressures on different parts of the protein, or for different protein families. For example, an essential domain is less tolerant to mutations than accessory domains; mutations in the active site is less tolerated than mutations in the transmembrane domain. BLOSUM is generic and assumes equal evolutionary pressures, and thus equal rate of mutation. A more accurate scoring mechanism would be one which focuses on locations within the protein which are important, such as the active site. PSSM (Position-specific scoring matrix) takes this into account.

Patterns
BLAST is suitable for aligning sequences, but the sequences do not tell us anything about the function of the protein. Often structurally similar proteins share little sequence homology. As structure determines the function of a protin, finding similarities in structure might be a more accurate homology search.

Each domain or structure usually have a defined pattern in sequence, for example, α-helical bundles usually have hydrophobic residues every 3 residues, so that the helix would have a hydrophobic face which associates with other helices.

Therefore, instead of searching for sequence homology based on residue substitutions, a search can be made to find whether a pattern exists within a sequence. The pattern can be pre-defined, or it can be obtained from a MSA. From a MSA, we can identify regions within the sequence which is conserved; if we assume that these regions are of importance to the protein, then we can also assume that they form an important domain or structure within the protein. The sequences can be isolated and identified for patterns, looking for which residues are absolutely conserved, and which are general (e.g. any hydrophobic or aromatic residues will do). From this a pattern is built.

A pattern is written in notations: [] means any residues within the square bracket is equally likely X means any residue is possible is a multiplier {} means any residues but those within the bracket

Patterns are suitable for small motifs with relatively high sequence similarity. Both patterns and profiles can be searched online atPROSITE, a database of protein domain, families and structural motifs.

Position Specific Scoring Matrices (PSSM) and PSI-BLAST
Amino substitution is not uniform due to different evolutionary pressures, as mentioned before. Different evolutionary pressures on different segments and residues means a general substitution matrix would not be accurate.

PSSM is a scoring matrix which provides scores specific to the position of the residue. It first takes sequences which are known to be homologous, and then producing a multiple sequence alignment based on this. It gives a score of the relative frequency at which a particular residue appear at each position of the multiple sequence alignment; the higher the frequency (and thus probability), the higher the score. Also, the more conserved a residue, the lower its score if it is substituted. Because the scores are position-specific, the same side chains substitution can have different scores at different positions.

Using this profile (another name for PSSM), we can search from random sequences from a database, to identify sequences which are homologous. PSSM differs from BLAST because its score is biased towards important residues and domains, thus will not be affected by a background score from residues which can be more freely substituted. Thus, it might be able to find less obvious sequences missed by BLAST.

PSI-BLAST is a program which uses BLAST for to BLAST an initial sequence, from the significant hits it then builds a multiple sequence alignment from which a profile is derived. It will then work like PSSM in subsequent iterations to find new sequences. Any significant findings are incorporated into the existing profile and searched again. The search is considered finished when there is convergence, or that no new sequences from the database can be found; this usually take 3-4 iterations. PSI-BLAST is most effective if a broad range of homologous proteins from many species are used as the starting profile.

Hidden Markov Models (HMM)
We have seen a general substitution matrix, where the score is based on amino acid regardless of position, we have then seen PSSM which score is based on amino acid and position; we then also looked at patterns which score is based on amino acids relative to each other.

The Hidden Markov Model (HMM) takes all these into account, plus a consideration for gaps, to generate the most sensitive homology search. First, it takes homologous sequences and perform a multiple sequence alignment. It then builds a probability map (from the frequencies of the MSA) of the likelihood of each possible transition from one residue to the next, taking into account preceding and following amino acids. For example, from position 102, a serine, there is 50% that it will be followed by a gap, 20% it will be followed by a hydrophobic residue, and a 30% chance it will be a deletion. Sequence similarity is generated from traversing the query through the probability map from the beginning of the sequence to the end of the sequence, multiplying each probability. If the probability is above a threshold, it can be deemed as homologous.

HMMER is a program used to generate and/or search HMM databases.

Overview
BLAST is the fastest but the least sensitive, patterns and profiles are more sensitive, but HMM is the most sensitive and able to find more distantly related sequences.

Identification of biologically relevant transcription factor binding sites
Simple PSSM sequence alignment often leads to many false positive results. However, transcription factors often bind cooperatively, in a cluster with other transcription factors. Identifying transcription factors that occur within clusters, along with PSSM alignment, will give a more biologically-relevant transcription binding sites. There are also tissue-specific regulators, such as Mef-2, SRF and MyoD binding sites are highly active in muscle cells.

Sequence logo
Consensus sequences are often represented by sequence logos, a graphical representation of the sequence conservation and variation of nucleotides (in a strand of DNA/RNA) or amino acids (in protein sequences). A sequence logo is constructed from a collection of aligned sequences, and consists of a stack of letters at each position of the sequence. The relative sizes of the letters indicates their frequency in the aligned sequences, thus representing the sequence conservation. The total height of the letters depicts the information content of the position, in bits. A sequence logo can be viewed as a graphical representation of a scoring matrix.

Phylogenetic footprinting
Bioinformatics based on the primary sequence allows us to characterize regulatory sequences based on the primary sequence, and predict the secondary structure to a certain extent, but it cannot yet reliably consider how higher order structures, such as nucleosomes, chromatin, chromatin loops, nuclear matrix attachment sites (S/MARs) affect regulation. However, the mode of regulation for some genes are conserved in other species, and so phylogenetic tools may allow us to identify complex regulatory sequences on the notion that highly-functional sequences will be more conserved than non-functional sequences. One highly employed phylogenetic model is to compare the genome of humans with mouse, any common regulatory regions is likely to have a function; however, it is often difficult to align them due to genetic drift, and the results will not be comprehensive because ~32-40% of functional regulatory sites in humans are not present in rodents, and vice versa. Thus, phylogenetic footprinting is best employed with closely-related species.

cis-regulatory elements
A cis-regulatory element or cis-element is a region of DNA or RNA that regulates the expression of genes located on that same molecule of DNA (often a chromosome). A cis-element may be located upstream, within, or downstream of a gene, including the untranscribed regions, exons as well as introns.

DNA and proteins/RNAs must interact specifically; they do this through a electrostatic interactions, steric complementarity, hydrogen bonds, van der Waal forces and hydrophobic interactions.

Proteins which binds negatively-charged DNA usually have a positively-charged groove to bind to DNA strongly, and a negatively-charged outer surface to detract other DNA; a good example of this is the Papilloma virus E2 protein and human TATA-binding protein (TBP). Electrostatic interactions provides general stability, but sequence specificity comes from

Binding of transcription factors to DNA do not unravel the DNA, instead, the TF must access the DNA sequence from outside the double helix. In a B-form DNA helix, there are the 6Å-wide minor groove, and the 12Å-wide major groove; a particular sequence can be accessed via either. Most transcription factors bind to the major groove more favourably because there will be less repulsion by the sugar phosphate backbone. If factors read the genetic code through the minor groove, the electron acceptors and donors are arranged so that they appear symmetrical, and so it cannot distinguish between A/T and T/A, nor G/C or C/G; Recognition via the major groove uses assymmetry to distinguish between A and T, and G and C.

Activation of transcription
Enhancers, Insulators, Enhancer modules, techniques

The rate of transcription of a particular gene is dependent upon gene-specific transcription factors. In eukaryotes, because the DNA is packaged into nucleosomes and further packaged into 30nm fibers, the genes it encode is usually already repressed; thus, most gene-specific transcription factors in eukaryotes are transcriptional activators. These activators usually act by unfolding the chromatin packaging to allow for better access of the transcriptional machinary, removing repressors, recruiting the transcriptional machinary, amongst others. In bacteria, there are a much higher proportion of transcriptional repressors.

The interaction of the transcription factors are often regulated by post-translational modifications and/or by differential intracellular localization.

Gene-specific transcription factors can activate gene trasncription by recruiting other factors to the initiation complex, or stimulate the phosphorylation of RNAPII-CTD domain and facilitate RNAPII promoter clearance.

Enhancers
Enhancers are DNA sequences which binds to gene-specific transcription factors. Enhancers are typically several hundred base pairs in length and is a cluster of transcription factor binding sites. Some enhancers promote transcription of neighboring genes non-specifically and so are generic enhancers; examples include Sp1. Other enhancers are tissue/cell type-specific. There are proximal enhancers which are no more than 1kb upstream of the transcriptional start site (TSS), and distal enhancers which can be over 50kb upstream of the TSS. Distal enhancers are able to affect transcription a long way away in the primary sequence due to chromatin looping, where the distant sequence would loop back and become spatially adjacent to the enhancer site. However, this might mean that one enhancer may be able to activate another gene, depending on the way the chromatin is looped. There are generally three rules which most enhancers adhere to - the gene in the closest proximity to the enhancer will be activated most, the enhancer can only activate genes to which the enhancer and transcription factors are compatible, and the presence of insulators prevents the enhancer from affecting sequences further away from the insulator. These insulators insulate transcription units from the enhancers by looping the enhancer and the gene it modulates together, while separating other enhancers into another chromatin loop. Insulators can also flank a sequence of DNA to isolate it from taking up the conformation of neighbouring DNA. For example, a highly active gene can be found within a highly inactive DNA segment because while the rest of the DNA is highly packaged into 30nm fibers, the sequence flanked by the insulators are as the 11nm fiber, and so allows enhancer-promoter interaction. Insulators are found to function in T-cell differentiation; before T-cell becomes mature, it must choose between bearing a γδ TCR, or an αβ TCR. The enhancer module for the δ chain is in close proximity to the promoters of both the δ and α chains, and so can promote the expression of either. But we know this not to be desirable, and so there is an insulator module between the two promoters, meaning the δ-chain enhancer can only affect the δ-chain promoter.

The SV40 enhancer is a generic enhancer which contains many tandem repeats of GGGGCGGGGC, which are situated between the TATA box (if present) and ~200bp upstream of the TSS. For the SV40 enhancer to function, it must be bound by Sp1, which is expressed essentially in all cell types. In vertebrates, the Sp1 binding sites are usually located at the proximal enhancers, and so usually only activate the corresponding gene.

Enhancer regions can be detected experimentally by inserting a reporter gene at the appropriate location in relation to the enhancer, if the sequence is in fact an enhancer, then the reporter gene will be expressed and detected.

There can be multiple enhancers for the same promoter. Each enhancer has multiple binding sites, different transcription factors can bind at these sites to express the gene. Depending on the trans-elements present, the gene may be switched on or off. Expression patterns of certain genes during embryonic development are due to differential binding of transcription factors to these enhancers. The even skipped (eve) gene is expressed in a striped pattern along the anteroposterior axis. Where each enhancer module is active on the anteroposterior axis is determined by inserting a reporter gene at different locations in relation to the promoter, and observing in which region(s) are the reporter gene most highly expressed. Results shows that enhancer modules often act independently of each other.

Medical Implications
Repeats of the codon CAG (coding for glutamine) in a gene can cause diseases such as Huntington's disease (affects the brain), Kennedy’s disease (affects the spine), Machado-Joseph disease, dentatorubral-pallidoluysian atrophy, and several spinocerebellar ataxias. The type of polyglutamine-based neurodegenerative diseases depends on the location of the repeats, although their effects are similar. Proteins with polyglutamine repeats cluster and form insoluble aggregates and inclusion bodies, which have toxic effects, although some suggests protective roles of inclusion bodies. The polyglutamine repeat may also be cleaved into toxic fragments, which subsequently aggregate to produce similar toxic effects. These aggregate bodies occur within the nucleus (and not the nucleolus). Furthermore, polyglutamine repeats can interact with a range of transcription factors such as the histone acetyltransferases CREB-binding protein (CBP) and p300/CBP-associated factor (p/CAF), p53 (anti-oncogene), Sp1 (and its co-activator TAFII130) and PQBP-1, meaning their level in the nucleus will decrease. GST pulldown assays shows that the polyglutamine section of huntingtin binds to the acetyltransferase domain of CBP and p/CAF and inhibits their function. Usually, the larger the size of the repeat (above the threshold), the earlier the onset and severity of the disease, although the the size of the repeats appears to have no correlation with the rate of progression.

In yeast models, polyglutamine repeat expansion (75 repeats, 23 was used as control) lead to the induction of chaperones and heat-shock factors transcription, but repression of other genes involved in transport of small metabolites and metabolism (to the same profile as mutants with a histone acetyltransferase complex Spt/Ada/Gcn5 acetyltransferase (SAGA) deletion) were observed in nuclear polyglutamine repeat expansion. SAGA contains GCN5, a key histone acetyltransferase (HAT), which acetlyates chromatin to promote transcription; this support the notion that the polyglutamine repeats may also interacts and inhibits SAGA, leading to repression of genes.

Kennedy’s disease is caused by CAG repeat expansion in the androgen receptor gene, leading to a polyglutamine repeat on the N-terminus of the androgen protein receptor (AR). AR is a transcription factor which after being translated, stays in the cytoplasm. It binds to androgenic hormones such as testosterone or dihydrotestosterone which allows it to translocate into the nucleus to regulate gene expression. However, the polyglutamine repeats prevents the AR from interacting with coregulators, affecting gene transcription.

Recent studies have suggested that AR might also have a role in the cytoplasm by inducing the MAPK signal cascade, to promote rapid activation of kinase-signaling cascades and modulate intracellular calcium levels.

In Huntington's disease (HD) is an autosomal dominant neurodegenerative disease caused by CAG repeat expansion in the huntingtin gene (at 4p16.3), leading to loss of neurons in the striatum and cortex (Polyproline repeats are also found to the 3' of the polyglutamine repeats). In patients with HD, the number of CAG repeats range from 37 to 100, compared to 10 to 34 in normal individuals.

But polyglutamine diseases are not the only amino repeat which can cause disease. Repeats of CGG (coding for arginine) causes fragile X syndrome, repeats of GAA (coding for glutamic acid) causes Friedreich's ataxia, and CTG repeats (coding for leucine) causes myotonic dystrophy.

TBP polyglutamine disease/neurodegeneration

TFIIH (Xeroderma pigmentosum, Cockayne’s syndrome, and trichothiodystrophy)

Oncogenes

p53 and VP16 interactions with TAFs

High throughput assays
SAGE: colorectal cancer transcription

Tissue microarrays/laser capture dissection: molecular pathology

DNA microarrays - identify lung cancer

Artificial transcription factors
Artificial transcription factors can be engineered and introduced into cells to alter expression of genes. If a disease is caused by the overexpression of a gene, a repressor for that gene, or an activator of its opposing gene, can be introduced to the cell to bring the level down back to normal. We can use this method to shut off oncogenes to treat cancer and shut off viral genes to treat viral infections. If a disease is caused by under- or null-expression of a gene, a transcription factor can be introduced to reinitiate transcription. β-thalassemias is an hereditary blood disorder in which the beta chains of haemoglobin is missing; it can be treated by reactivating the fetal γ-globin gene compensates for the lack of the healthy adult haemoglobin chain. Duchenne muscular dystrophy is caused by mutations in the dystrophin gene, which encodes for dystrophin - an important component of the structure of muscle. Utrophin is a structurally- and functionally-similar protein which expression may compensate for the lack of dystrophin.

Epigenetics
Chromatin modification can include covalent histone modifications, DNA methylation and CpG dinucleotides; and also the more specific modifications to the amino terminal tails of histones (extends away from the core) which includes acetylation, methylation and phosphorylation, which alters the availability of the gene spatially-associated with the modification. Histone acetyl transferases (HATs) acetylate histone and make the gene more accessible; in contrast, histone deacetylases (HDACs) deacetylate histone and make the gene less accessible.

Nucleosome remodelling
Nucleosomal remodelling alters the structure of chromatin and subsequently alters gene expression. In (constitutively) actively transcribed genes, the nucleosome is a away from the promoter, creating a nucleosome depleted region (NDR, +/- 150 bp either side of the TSS), allowing activating transcription factors to bind; in an inactive gene, the nucleosome is on the promoter region, meaning transcription cannot be initiated until remodelling has occurred, which can be through factors binding to enhancer regions and/or the TATA box. A regular array of nucleosome promotes the formation of higher-ordered structures such as the 30nm fiber, leading to tight packing of DNA, reducing the rate of transcription of the genes in between the nucleosomes. Histone 1 stabilizes the condensed form of DNA, and thus high levels of H1 is associated with low levels of transcription. SWI/SNF remodelers slide or eject histones at promoter regions to allow access for activators to bind, activating transcription. ISWI chromatin remodelers spaces out nucleosomes into regular arrays, repressing transcription. SWR1 remodelers inserts H2A.Z variant, which is unstable and easily ejected, leading to easier and more rapid gene activation.

Heterochromatin used to be defined as a tight packing of nucleosomes and euchromatin a looser packing; this definition is too simple and is now defined as the presence or absence of specific molecular markers. Furthermore, to simply say genes are active in euchromatin and inactive in heterochromatin is an over simplification, as some transcription is facilitated by heterochromatin. Further, insulators are able to isolate a section of DNA within heterochromatin to allow for its transcription.

Heterochromatin
There are different types of heterochromatin:

Constitutive heterochromatin are irreversibly silenced

Facultative heterochromatin are able to be transcriptionally active again. An example of facultative heterochromatin is the X chromosome; as we know, only one X chromosome is active at any one time, and so the non-active X chromosome must remain as a heterochromatin state, but can be reactivated again.

Black chromatin - near the nuclear pores and on the nuclear lamina

The centromeres and telomeres are basically heterochromatin

Centromere
The chromatin state of the centromere is important for its function, as the correct state is required for the kinectochore to connect and separate the chromosomes. In S. pombe, the centromeres are surrounded by repetitive DNA elements (cen) which are transcribed by RNA pol II during S phase to produce noncoding cenRNAs. The centromere DNA sequences as well as the cenRNA recruits histone methyltransferases to mehylate at the H3K9, which subsequently leads to binding by HP1 and/or other complexes, to maintain the centromere in heterochromatin state. At the centromere, HP1 recruits cohesin to promote sister-chromatid cohesion. The knockdown of the RNAi pathway leads to an accumulation of the cenRNA transcripts which cannot be separated, they thus cannot bind to centromere DNA sequences and induce their silencing. Additionally, H3K9 methylation is reduced and this impair the centromere's function and thus cannot separate.

Histone variants
Variant histones have evolved which serve specialize function distinct from the canonical core histones. The most common example is the H3.3 and H3.1; H3.3 is associated with active gene transcription, and H3.1 is associated with gene silencing. macroH2A is associated with inactive X chromosome (Xi) and gene repression in general; in the Xi, macroH2A is enriched in a region known as the macrochromatin body (MCB) during S phase and at the perinuclear structure centered at the centrosome.

Histone variant exchange is mediated by chaperones, each having a different specificity for the histone variant. First, a chromatin remodelling complex (e.g. FACT complex) destabilizes the nucleosomes, the chaperones then exchange the exposed histone components. HIRA is a chaperone which recruits and exchange H3.3 into the nucleosomes; similarly, SWR1 specifically recruit and exchange H2A.Z. The chromatin assembly factor-1 (CAF1) mediates the deposition of the H3.1 major core histone in a process that is coupled to DNA replication.

As histone variants serve specialized functions, they are likewise distributed along the chromosome at different places to serve its function. H3.3 is found on the telomeres, pericentric heterochromatin and actively transcribed gene body. CENPA facilitates kinectochore assembly and thus is found at the centromere. H2A.Z is easily ejected and thus is found near the promoters and enhancers of actively transcribed genes. H2AL is found at pericentric heterochromatin.

ncRNA
Most (>90%) of the genome is transcribed, although an increasing proportion (~98%) of RNAs identified do not appear to code for proteins; rRNAs, tRNAs are the most typical examples. Small nuclear RNAs (snRNAs) are part of the spliceosome and involved in splicing of the pre-mRNA; small nucleolar RNAs (snoRNAs) guide chemical modifications of other RNAs, mainly rRNAs, tRNAs and snRNAs; telomerase RNA acts as a template for telomere elongation.

Other small ncRNA (<200nt) includes microRNA (miRNA) and Piwi-interacting RNA (piRNA), and longer ncRNA (lncRNA).

ncRNAs may be preferential to proteins because of their specificity to DNA sequences by base-pairing, but still have structural form so other factors may interact with it.

miRNA
miRNA provides an extra level of regulation between the transcript level and the protein level.

lncRNA
Majority lncRNA transcribed by RNAPII. lncRNA can interact with regulatory protein complexes and are involved in dosage compensation – X chromosome inactivation, imprinting (some of the genes inheritedfrom the father behave differently than those from the mother – depending on the imprinting)

lncRNAs can interfere with transcription, induce chromatin remodelling and histone modifications, affect splicing, generate endo-siRNAs, modulate protein activity, alter protein localization, or act as a primer RNA.

Evf2 is a lncRNA transcribed from the DLX-5/6 intergenic region and specifically expressed in the brain that regulates transcription in the developing forebrain. It activates gene transcription by recruiting DLX-2 to the i and ii enhancers, inducing the expression of the neighbouring gene clusters DLX5 and DLX6.

lincRNA
Long intergenic noncoding RNAs (lincRNAs) have historically been shown to be involved in the regulation of homeostasis and development. LncRNAs can act as molecular guides for protein complexes to dock on, or act as a decoy to diverge these complexes away; the result is the inhibition of transcription ; as a temporary storage for miRNAs; and can facilitate the action of enhancers by promoting chromatin looping nearer to the enhancers, resulting in the activation of transcription.

HOTTIP
HOTTIP is a lincRNA found on the 5' end of the HOXA gene, it has been show to act locally to activate transcription.

5C analysis of control and HOTTIP-depleted cells do not show any significant difference in higher-order chromatin structures – both shows similar chromosomal looping. Thus, we can assume that HOTTIP do not act by encouraging looping. However, di/trimethylation of H3K4, associated with activation, of local genes was lost in HOTTIP-depleted cells; and the methylation of H3K27, associated with inactivation, was increased. Thus, HOTTIP activate neighbouring genes by promoting methylation of H3K4.

It has been known that H3K4 methylation is carried out by the Myeloid/Lymphoid Leukemia (MLL) family; the most relevant genes for HOX expression are MLL1 and MLL2. The main component of MLL complexes is the Histone-lysine N-methyltransferase HRX. These complexes interact with other proteins, such as WDR5, for substrate recognition and specific genomic targeting. Indeed, at MLL1 and its associated protein WDR5 are found in abundance at the 5’ genes of distally-derived fibroblasts, in concert with increased transcription.

It is proposed that HOTTIP is transcribed and the HOTTIP RNA are brought into close proximity with its target genes through existing chromatin looping. HOTTIP RNA binds to WDR5-MLL1 complexes and promote trimethylation of H3K4 of target genes, leading to transcription activation.

Thus, lincRNAs should not be viewed as a silencing mechanism, but also as an activation mechanism.

DNA methylation
Modifications to DNA tend to be more stable compared to proteins, such as histones.

DNA methylation is an inheritable epigenetic mark which stabilizes the repressed state of DNA if located at the transcriptional start sites of genes. Methylation do not occur before silencing of the gene, the gene is silenced first and methylation stabilizes this silenced form. DNA methylation correlate more to histone methylation than with the primary DNA sequence, suggesting that DNA methylation do not primarily depends on the primary sequence, but other factors takes presidence.

MBD methylates DNA at (and only at) cytosine bases that are located 5' to a guanosine in a CpG dinucleotide. This dinucleotide is not as common as first imagined, because methylated cytosine has shown to be mutagenic, but is abundant in CpG islands - 0.5–4 kb length of high density CpG dinucleotides. Once methylated, the corresponding gene is repressed. The CpG island can also repress genes by recruiting polycomb proteins, which remodel chromatin such that epigenetic silencing of genes takes place.

The state of methylation can be determined by bisulfite sequencing. Treatment of DNA with bisulfite converts cytosine residues to uracil, but leaves 5-methylcytosine residues unaffected; the difference between the untreated and treated sequence will give single-nucleotide-resolution data on the methylation status of the genome.

Methylation blocks transcription initiation because methylated sequences do not bind to enhancers. H3K4me3 and H2A.Z are antagonistic to DNA methyltransferases, and the nucleosome depleted region (NDR) prevents methylation as there is nothing for the methyltransferase to dock on.

Defects in the correct methylation pattern can lead to genetic diseases such as Beckwith-Wiedemann syndrome, Silver–Russell syndrome, Angelman syndrome and Prader-Willi syndrome.

Histone post-translational modifications
Often a transcription factor will bind to the DNA through a DNA-binding domain (DBD) whiles recruiting other factors through its activator domain (AD). For example, Gcn4 binds to DNA specifically and recruits the histone acetyltransferase Gcn5, which hyperacetylate histones on the N-terminal tails. Conversely, these factors can also act to repress the gene; Ume6 recruits the histone deacetylase Sin3 complex to hypoacetylate histones on their N-terminal tails. Apart from acetylation, histones are subject to other post-translational modifications such as methylation, phosphorylation, ubiquitylation, biotinylation, SUMOylation, ADP ribosylation, deimination, proline isomerization, primarily on the N-terminal tails. These PTMs can be detected using mass spectrometry.

Heterochromatin protein 1 (HP1) is a conserved protein across species, and consists of a chromodomain and a chromoshadow domain, joined by a hinge. The chromodomain recognizes H3K9me3 (hallmark of heterochromatin) and maintains the DNA at heterochromatin levels.

and H4K20me3 - OFF - fairly evenly distributed

H3K9me upstream of the TSS inactivates the gene, whereas if it is downstream it activates the gene.

H3K27me inactivates Histone modifications associated with heterochromatin can be bound by proteins such as PRC1, a Polycomb complex, and HP1 to stabilize the condensed structure. Other modifications can be read by proteins with recognition domains for those modifications; examples include the chromodomain, Tudor domain, MBT domain, PhD finger, bromodomain and 14-3-3 domain. Methylation of H3K4, H3K36, H3K79, H4K20 - OFF

Clr4 methyltransferase complex (ClrC) is responsible for nucleation and spreading of heterochromatin. ClrC components are distributed throughout heterochromatic domains. To nucleate heterochromatin, Rik1, a WD domain–containing subunit of ClrC, is loaded onto the transcribed repeats via RNAi machinery including the RNA-induced transcriptional silencing (RITS) complex.

TFIID mediates H3K4me3 - ON

H3K4me2, H3K4me1, H3K9me1, H3ac, H4ac - ON

H3K9Ac, H3K14Ac - ON

SUMOylation - OFF

Acetylation is associated with activation because acetylated histones cannot pack together as tightly, giving a looser structure for transcription factors to bind; furthermore, the looser structure also means acetylated histones are more easily displaced. Acetylation of H4K16 inhibits formation of compact 30nm fibers

Some PTMs only occur at certain places whereas others are pretty much ubiquitous throughout the gene. Acetylation of H3 and H4 (+) occurs upstream of the TSS, H3K27 methylation (-) occurs around the TSS, and histone SUMOylation (-) can occur ubiquitously throughout the whole gene and surrounding sequences.

One PTM can influence, or lead to, another. For example, Histone H3K4 demethylation is negatively regulated by histone H3 acetylation in Saccharomyces cerevisiae because histone H3 lysine 14 acetylation downregulates a histone demethylase called Jhd2. When H3-specific histone acetyltransferases were knocked down, the levels of H3K4 methylation was depleted, probably due to the action of Jhd2.

Histone modifiers can be attached to RNAPII and modify the histones as the gene is being transcribed

Nucleosome phasing sequences
Different cell types and activity states means that the distance between nucleosomes in eukaryotic chromatin is not constant. It has also been found that diferent nucleosome spacing produces different chromatin folding, and thus there is likely to be a tight relation between the two. Since chromatin folding also affects gene expression, differential nucleosome spacing may lead to differential expression. However, it was found that the DNA sequence has only "fine tuning" effects on nucleosome positioning and the grander effect is established by other factors.

In vitro (cell-free), nucleosome spacing is determined by the electrostatic forces between the DNA and the histones. The linker DNA have its charge neutralized by free cations, or by the flexible arms of the histones.

Abortive transcription cycle
Depending on the type of promoter, only 1-5% of initiated transcripts will go into completion and escape the RNAPII, most (95=99%) of all initiated transcripts are aborted. After each abortion event, transcription must be initiated from the +1 TSS again. Most aborted trasncripts are between 2-11 nucleotides in length, and once beyond the 11th base, almost all transcripts are transcribed to completion. This feature is due to a feature called promoter scrunching.

During initial transcription, RNAP remains attached the promoter region and pulls in downstream DNA without releasing the promoter DNA. This causes tension as the DNA unwinds and is compacted, which can be used to push the RNAP away from the promoter and its associated factors, allowing for promoter escape. However, this tension can also cause a jam in the active site, and thus under these circumstances, the RNAP will abort, leaving aborted transcripts.

DNA sequence direct transcription
Using hepatocytes from an aneuploid mouse strain carrying human chromosome 21, and observing transcription factor–binding locations, landmarks of transcription initiation, and the resulting gene expression, it was found that despite the difference in epigenetic machinery, cellular environment, and transcription factors, the transcription profile of the human genes in mouse hepatocyteswere very similar to the profile of those in human hepatocytes. Thus, the primary DNA sequence is the primary director of transcription, and not the local environment.

DNA-binding domains
Domains which binds to DNA must be complementary to a certain degree with respect to shape and charge amongst other factors. The helix-turn-helix is a prominent DNA-binding motif in bacteria, used by the lac repressor, CAP protein, bacteriophage λ repressor and others. HTH motifs are found less prominently in eukaryotes, mostly found in factors expressed during embryogeneisis and regional differentiation. The HTH motif consists of five alpha helices, in which the middle alpha-helix is known as the recognition helix, which fits into the major groove of DNA and uses its amino acid sequence to recognize specific nucleotide sequences. However, there are many variations on this simple structural model

Another motif is the leucine zipper DNA-binding domains. Proteins which contain these domains are generally involved in regulation of cell proliferation. Because of this property, many genes encoding for these HLH-containing proteins are proto-oncogenes, as their mutations is likely to convert a normal cell into a cancer cell; notable examples include the c-jun/c-fos heterodimer (AP-1). This dimer contain two long alpha-helices arranged in a Y-shape, with the top of the Y binding to the major groove of DNA, stabilized by positively charged arginine and lysine side chains. The name 'leucine zipper' comes from the fact that the two alpha-helices are held together by hydrophobic interactions from regularly spaced (usually every 7 resides, or ~1 turn) leucines.

Helix-loop-helix (HLH) is DNA-binding domain found mostly in prokaryotes and homeoproteins of eukaryotes, notable examples include the Max-Max homodimer, and Max/c-myc heterodimer. The structure of a HLH domain involves two alpha-helices on a cross path with each other, held on a single amino acid sequence through a loop in the middle; two such subunit then dimerize to form the HLH domain. DNA binding occurs through one side of the helix-loop-helix, and binding is stabilized, as with the leucine zipper, through positively-charged arginines and lysines.

The zinc finger is a DNA-binding domain found in eukaryotes but not prokaryotes. The zinc finger domain is made up of a beta-sheet connected to an alpha-helice through a turn, and held on the opposite end by a zinc atom, coordinated by cysteine and/or histidine residues, usually two cysteines and two histidines (C2H2). It is the alpha-helical portion of the domain which inserts into the major groove to allow for specificity. However, each zinc finger domain can only recognize at most a 3 nucleotide sequence, and so typically, zinc finger domains work in tandem by linking many domains together around the DNA, to increase specificity.

Most of the examples so far places an alpha-helix into the major groove for recognition; the ribbon-helix-helix motif place a beta-sheet into the major groove. RHH-containing proteins are involved in the uptake of metals, amino acid biosynthesis, cell division and controls plasmid copy number, the lytic cycle of bacteriophages etc.

Estrogen receptor
The estrogen receptor (ER) is a protein normally found in the cytosol of cells. At either terminal there are the activator domains (AF1 and AF2); going towards the 3' end from AF1, there is the DNA-binding and dimerization domain, the nuclear localization sequence and the hormone binding domain. ER is kept in the cytoplasm by heat shock proteins (HSPs) and cyclophilins, which masks the ER's nuclear localization sequence. When it encounters estrogen, the HSPs and cyclophilins are displaced and estrogen binds the receptor via the hormone binding domain. The estrogen-bound receptor now has its nuclear localization sequence exposed and can now translocate across the nuclear membrane, becomes phosphorylated and homodimerized, and ultimately induce transcription. This process is essential in the preparing the breasts to produce milk during and after pregnancy, prepare the uterus for the fetus, regulation of cholesterol production, and maintaining homeostasis of bone turnover. However, too high an estrogen level can lead to uncontrolled cell proliferation leading to cancer of the breast and uterus; too low a level can lead to brittle bone disease.

Estrogen-like antagonist, such as tamoxifen, is able to bind directly to the transcription factor ER and inhibits transcription directly; and thus tamoxifen has been used for over 20 years to treat breast cancers that depends on estrogen (~50-60%). Taking tamoxifen for five years significantly reduces both breast cancer recurrence (42%) and mortality (22%) in those cases. However, tamoxifen cannot be used to treat estrogen-independent cancers; and ~40% of patients treated with tamoxifen will eventually develop resistance due to mutations in the ER. Alternatively, cancers can be treated by reducing the levels of estrogens by aromatase inhibitors, such as letrozole, anastrozole, or exemestane.

Tamoxifen is larger than the native substrate - estradiol; thus when tamoxifen binds, it pushes out a portion of the ER called the signalling loop on the surface of the receptor. The signalling loop is part of the activator domain (AF2) and when this loses its shape (as when tamoxifen) binds, it loses its function. The ER goes from being a growth-promoting transcription factor (when estradiol binds), to a growth-inhibiting factor (when tamoxifen binds). Surprisingly, the N-terminal AD (AF1) is stimulated inhibitory effect on the expression of the gene encoding TGF-a (transforming growth factor a) in breast cancer cells, and a stimulatory effect on TGF-b3 required for bone maintenance.

Types of chromatin
The types of chromatin is no longer defined as heterochromatin and euchromatin, or defined as the tightness of packing. Now they are defined by the factors and epigenetic marks which are found on the chromatin. For ease of illustration, these groups are given colours as names.

Green chromatin is characterized by HP1, and are thus heterochromatin; green chromatin are found at the pericentric part of the chromosome and is associated with both active and repressed genes.

Yellow chromatin contains mostly active housekeeping genes, such as DNA repair genes, ribosomal genes

Red chromatin contains mostly active tissue-specific genes.

Blue chromatin is characterized by Polycomb proteins, which represses transcription and is employed in development to regulate transcription.

Black chromatin are transcriptionally silent, most probably associated with the nuclear lamina.

Nuclear transport
Ribosomal proteins, nucleolar proteins, histones, transcription factors, snRNPs, rRNPs, replication factors and viral genomes are produced or derived extranuclearly, and must be imported into the nucleus for action; conversely, ribosomal subunits, mRNAs, tRNAs, rRNAs, snRNAs and viral RNP's must be exported from the nucleus to function. Nuclear transport is facilitated via, and only via, the nuclear pore complex (NPC), which spans both the inner and outer nuclear membrane. Proteins less than 10kDa are able to pass freely through the pore through passive diffusion, the ability to diffuse across decreases with increasing size, with an absolute limit of ~30kDa.

The structure of the nuclear pore complex have been elucidated in yeast and humans, and is conserved across phyla. In yeast, the NPC is ~66MDa consisting of ~50 different subunits; in humans, the NPC is ~125MDa and consists of ~100 subunits. In either case, the NPC is asymmetrical and has a basket shaped structure (nuclear basket) on the nucleus side, while filaments (cytoplasmic fibers) around the pore extends into the cytoplasm. The pore itself is a 10-25 nm diameter hydrophobic channel, filled with a meshwork of fibers (central plug), which means small proteins can diffuse through, but larger proteins cannot pass as the meshwork is too packed. A single NPC can mediate translocation of 1000 macromolecules, or bu not equivalent to 100 MDa, every second and macromolecules travels as fast as 0.5 µm/sec. Macromolecules destined for the nucleus have nuclear localization signals (NLSs), which are recognized by adapter proteins to initiate transport (note the NLS do not grant access into the nucleolus, which are only permissible to RNAPI machinary). Examples of NLSs includes monopartite (PKKKRKV, recognized by Impα/Impβ complex), bipartite (KRPAAIKKAGQAKKKK, recognized by Impα/Impβ complex), HIV-1 Tat (arginine-rich, recognized by Impβ), and histone H1 (recognized by Imp7/Impβ complex). Generally, any highly-basic sequence can act as a good NLS.

Adapter proteins importin α, snurportin, RIP α all have a NLS binding site and an importin β binding site, forming a complex. Importin-α is made up of a series of repeating units known as ARM repeats

RNA products are exported using ribonucleoproteins.

Nuclear structure
The nuclear contents are enclosed by two membranes (inner and outer), and the only opening are through nuclear pore complexes (NPCs). The inner membrane is lined the nuclear lamina, mostly consisting of a 2D mesh of lamins - 60-75 kDa, a strictly nuclear intermediate filament - which forms the nuclear version of the cytoskeleton. During mitosis, it is phosphorylated and this leads to the break down of the nucleus, highlighting its importance in maintaining nuclear structure.