LECTURES
BIOINFORMATICS
Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.
## Analysis ##
1) Analysis of gene expression
The expression of many genes can be determined by measuring mRNA levels with multiple techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), RNA-Seq, also known as "Whole Transcriptome Shotgun Sequencing" (WTSS), or various applications of multiplexed in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise in high-throughput gene expression studies. Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and down-regulated in a particular population of cancer cells.
2) Analysis of protein expression
Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and the complicated statistical analysis of samples where multiple, but incomplete peptides from each protein are detected.
Contents
3) Analysis of regulation
Regulation is the complex orchestration of events starting with an extracellular signal such as a hormone and leading to an increase or decrease in the activity of one or more proteins. Bioinformatics techniques have been applied to explore various steps in this process. For example, promoter analysis involves the identification and study of sequence motifs in the DNA surrounding the coding region of a gene. These motifs influence the extent to which that region is transcribed into mRNA. Expression data can be used to infer gene regulation: one might compare microarray data from a wide variety of states of an organism to form hypotheses about the genes involved in each state. In a single-cell organism, one might compare stages of the cell cycle, along with various stress conditions. One can then apply clustering algorithms to that expression data to determine which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes can be searched for over-represented regulatory elements.
GENOMICS
Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genome. Advances in genomics have triggered a revolution in discovery-based research to understand even the most complex biological systems such as the brain. The field includes efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping. The field also includes studies of intragenomic phenomena such as heterosis, epistasis, pleiotropy and other interactions between loci and alleles within the genome. In contrast, the investigation of the roles and functions of single genes is a primary focus of molecular biology or genetics and is a common topic of modern medical and biological research. Research of single genes does not fall into the definition of genomics unless the aim of this genetic, pathway, and functional information analysis is to elucidate its effect on, place in, and response to the entire genomes networks.
## Analysis ##
1) Sequencing
Historically, sequencing was done in sequencing centers, centralized facilities (ranging from large independent institutions such as Joint Genome Institute which sequence dozens of terabases a year, to local molecular biology core facilities) which contain research laboratories with the costly instrumentation and technical support necessary. As sequencing technology continues to improve, however, a new generation of effective fast turnaround benchtop sequencers has come within reach of the average academic laboratory. On the whole, genome sequencing approaches fall into two broad categories, shotgun and high-throughput (aka next-generation) sequencing.
2) Assembly
Sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence. This is needed as current DNA sequencing technology cannot read whole genomes as a continuous sequence, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcripts.
3) Annotation
The DNA sequence assembly alone is of little value without additional analysis. Genome annotation is the process of attaching biological information to sequences, and consists of three main steps. ① Identifying portions of the genome that do not code for proteins, ② Identifying elements on the genome, a process called gene prediction, and ③ Attaching biological information to these elements.
Automatic annotation tools try to perform these steps in silico, as opposed to manual annotation which involves human expertise and potential experimental verification. Ideally, these approaches co-exist and complement each other in the same annotation pipeline. Traditionally, the basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on homologues. More recently, additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases rely on both curated data sources as well as a range of software tools in their automated genome annotation pipeline. Structural annotation consists of the identification of genomic elements, primarily ORFs and their localisation, or gene structure. Functional annotation consists of attaching biological information to genomic elements.
TRANSCRIPTOMICS
Transcriptomics is the study of the transcriptome - the all set of RNA transcripts which are produced under specific circumstances in one cell or population of cells - using high throughout methods such as microarray analysis.
## Analysis ##
A number of organism-specific transcriptome databases have been constructed and annotated to aid in the identification of genes that are differentially expressed in distinct cell populations. RNA-Seq is emerging as the method of choice for measuring transcriptomes of organisms, though the older technique of DNA microarrays is still used.
1) RNA-Seq
RNA-seq (RNA sequencing), also called whole transcriptome shotgun sequencing, uses next-generation sequencing to reveal the presence and quantity of RNA in a biological sample at a given moment in time. RNA-Seq is used to analyze the continually changing cellular transcriptome. Specifically, RNA-Seq facilitates the ability to look at alternative gene spliced transcripts, post-transcriptional modifications, gene fusion, mutations, SNPs and changes in gene expression. In addition to mRNA transcripts, RNA-Seq can look at different populations of RNA to include total RNA, small RNA, such as miRNA, tRNA and ribosomal profiling. RNA-Seq can also be used to determine exon/intron boundaries and verify or amend previously annotated 5’ and 3’ gene boundaries. Prior to RNA-Seq, gene expression studies were done with hybridization-based microarrays. Issues with microarrays include cross-hybridization artifacts, poor quantification of lowly and highly expressed genes, and the knowledge of the sequence. Because of these technical issues, transcriptomics transitioned to sequencing-based methods. These progressed from Sanger sequencing of Expressed Sequence Tag libraries, to chemical tag-based methods and finally to the current technology, NGS of cDNA.
2) DNA microarray
A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as proves. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA (also called anti-sense RNA) sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target.
EPIGENOMICS
Epigenomcis is the study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome.
{ Epigenetic modifications = genomic modifications that alter gene expression that cannot be attributed to modification
of the primary DNA sequence and that are heritalbe mitotically and meiotically are classified }
## Major two types of epigenemic modifications ##
1) DNA methylation
DNA methylation is the process of by which a methyl group is added to DNA by enzymes DNA methyltransferases (DNMTs) which are responsible for catalyzing this reaction. In eukaryotes, methylation is most commonly found on the carbon 5 position of cytosine residues (5mc) adjacent to guanine. DNA methylation patterns vary greatly between species and even with the same organisms.
2) Histone modification
In eukaryotes, genomic DNA is coiled into protein-DNA complexes called chromatin. Histones, which are the most prevalent type of protein found in chromatin, function to condense the DNA; the net positive charge on histones facilitates their bonding with DNA, which is negatively charged. The basic and repeating units of chromatin, nucleosomes, consist of an octamer of histone proteins. Many different types of histone modification are known, including acetylation, methylation, phosphorylation, ubiquitination etc. The DNA region where histone modification occurs can elicit different effects. Histone modifications regulate gene expression by two mechanisms : by disruption of the contact between nucleosomes and by recruiting chromatin remodeling ATPases.
## Epigenomic methods ##
1) Histone modification assay
The cellular processes of transcription, DNA replication and DNA repair involve the interaction between genomic DNA and nuclear proteins. It had been known that certain regions within chromatin were extremely susceptible to DNase I digestion, which cleaves DNA in a low sequence specificity manner. Such hypersensitive sites were thought to be transcriptionally active regions, as evidenced by their association with RNA polymerase and topoisomerase I and II. It is now known that sensitivity to DNAse I regions correspond to regions of chromatin with loose DNA-histone association. Hypersensitive sites most often represent promoters regions, which require for DNA to be accessible for DNA binding transcriptional machinery to function.
ChIP-Chip and ChIP-Seq
Histone modification was first detected on a genome wide level through the coupling of chromatin immunoprecipitation (ChIP) technology with DNA microarrays, termed ChIP-Chip. However instead of isolating a DNA-binding transcription factor or enhancer protein through chromatin immunoprecipitation, the proteins of interest are the modified histones themselves.
① Histones are cross-linked to DNA in vivo through light chemical treatment.
② The cells are next lysed, allowing for the chromatin to be extracted and fragmented, either by sonication or treatment with a non-specific restriction enzyme.
③ Modification-specific antibodies in turn, are used to immunoprecipitate the DNA-histone complexes.
④ Following immunoprecipitation, the DNA is purified from the histones, amplified via PCR and labeled with a fluorescent tag.
⑤ The final step involves hybridization of labeled DNA, both immunoprecipitated DNA and non-immunoprecipitated onto a microarray containing immobilized gDNA.
⑥ Analysis of the relative signal intensity allows the sites of histone modification to be determined.
2) DNA methylation arrays
Techniques for characterizing primary DNA sequences could not be directly applied to methylation assays. For example, when DNA was amplified in PCR or bacterial cloning techniques, the methylation pattern was not copied and thus the information lost. The DNA hybridization technique used in DNA assays, in which radioactive probes were used to map and identify DNA sequences, could not be used to distinguish between methylated and non-methylated DNA.
Non genome-wide approaches
The earliest methylation detection assays used methylation modification sensitive restriction endonucleases. Genomic DNA was digested with both methylation-sensitive and insensitive restriction enzymes recognizing the same restriction site. The idea being that whenever the site was methylated, only the methylation insensitive enzyme could cleave at that position. By comparing restriction fragment sizes generated from the methylation-sensitive enzyme to those of the methylation-insensitive enzyme, it was possible to determine the methylation pattern of the region. This analysis step was done by amplifying the restriction fragments via PCR, separating them through gel electrophoresis and analyzing them via southern blot with probes for the region of interest. Different regions of the gene were known to be expressed at different stages of development. Consistent with a role of DNA methylation in gene repression, regions that were associated with high levels of DNA methylation were not actively expressed.
This method was limited not suitable for studies on the global methylation pattern, or methylome. Even within specific loci it was not fully representative of the true methylation pattern as only those restriction sites with corresponding methylation sensitive and insensitive restriction assays could provide useful information. Further complications could arise when incomplete digestion of DNA by restriction enzymes generated false negative results.
Gemone wide approaches
DNA methylation profiling on a large scale was first made possible through the Restriction Landmark Genome Scanning (RLGS) technique. Like the locus-specific DNA methylation assay, the technique identified methylated DNA via its digestion methylation sensitive enzymes. However it was the use of two-dimensional gel electrophoresis that allowed be characterized on a broader scale. However it was not until the advent of microarray and next generation sequencing technology when truly high resolution and genome-wide DNA methylation became possible. As with RLGS, the endonuclease component is retained in the method but it is coupled to new technologies. One such approach is the differential methylation hybridization (DMH), in which one set of genomic DNA is digested with methylation-sensitive restriction enzymes and a parallel set of DNA is not digested. Both sets of DNA are subsequently amplified and each labelled with fluorescent dyes and used in two-colour array hybridization. The level of DNA methylation at a given loci is determined by the relative intensity ratios of the two dyes. Adaptation of next generation sequencing to DNA methylation assay provides several advantages over array hybridization. Sequence-based technology provides higher resolution to allele specific DNA methylation, can be performed on larger genomes, and does not require creation of DNA microarrays which require adjustments based on CpG density to properly function.
PROTEOMICS