LECTURES

BIOINFORMATICS

Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. As an interdisciplinary field of science, bioinformatics combines computer science, statistics, mathematics, and engineering to analyze and interpret biological data. Bioinformatics has been used for in silico analyses of biological queries using mathematical and statistical techniques.

## Analysis ##

1) Analysis of gene expression

The expression of many genes can be determined by measuring mRNA levels with multiple techniques including microarrays, expressed cDNA sequence tag (EST) sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel signature sequencing (MPSS), RNA-Seq, also known as "Whole Transcriptome Shotgun Sequencing" (WTSS), or various applications of multiplexed in-situ hybridization. All of these techniques are extremely noise-prone and/or subject to bias in the biological measurement, and a major research area in computational biology involves developing statistical tools to separate signal from noise in high-throughput gene expression studies. Such studies are often used to determine the genes implicated in a disorder: one might compare microarray data from cancerous epithelial cells to data from non-cancerous cells to determine the transcripts that are up-regulated and down-regulated in a particular population of cancer cells.

2) Analysis of protein expression

Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a snapshot of the proteins present in a biological sample. Bioinformatics is very much involved in making sense of protein microarray and HT MS data; the former approach faces similar problems as with microarrays targeted at mRNA, the latter involves the problem of matching large amounts of mass data against predicted masses from protein sequence databases, and the complicated statistical analysis of samples where multiple, but incomplete peptides from each protein are detected.

Regulation is the complex orchestration of events starting with an extracellular signal such as a hormone and leading to an increase or decrease in the activity of one or more proteins. Bioinformatics techniques have been applied to explore various steps in this process. For example, promoter analysis involves the identification and study of sequence motifs in the DNA surrounding the coding region of a gene. These motifs influence the extent to which that region is transcribed into mRNA. Expression data can be used to infer gene regulation: one might compare microarray data from a wide variety of states of an organism to form hypotheses about the genes involved in each state. In a single-cell organism, one might compare stages of the cell cycle, along with various stress conditions. One can then apply clustering algorithms to that expression data to determine which genes are co-expressed. For example, the upstream regions (promoters) of co-expressed genes can be searched for over-represented regulatory elements.

GENOMICS

Genomics is a discipline in genetics that applies recombinant DNA, DNA sequencing methods, and bioinformatics to sequence, assemble, and analyze the function and structure of genome. Advances in genomics have triggered a revolution in discovery-based research to understand even the most complex biological systems such as the brain. The field includes efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping. The field also includes studies of intragenomic phenomena such as heterosis, epistasis, pleiotropy and other interactions between loci and alleles within the genome. In contrast, the investigation of the roles and functions of single genes is a primary focus of molecular biology or genetics and is a common topic of modern medical and biological research. Research of single genes does not fall into the definition of genomics unless the aim of this genetic, pathway, and functional information analysis is to elucidate its effect on, place in, and response to the entire genomes networks.

## Analysis ##

1) Sequencing

Historically, sequencing was done in sequencing centers, centralized facilities (ranging from large independent institutions such as Joint Genome Institute which sequence dozens of terabases a year, to local molecular biology core facilities) which contain research laboratories with the costly instrumentation and technical support necessary. As sequencing technology continues to improve, however, a new generation of effective fast turnaround benchtop sequencers has come within reach of the average academic laboratory. On the whole, genome sequencing approaches fall into two broad categories, shotgun and high-throughput (aka next-generation) sequencing.

2) Assembly

Sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original sequence. This is needed as current DNA sequencing technology cannot read whole genomes as a continuous sequence, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcripts.

3) Annotation

The DNA sequence assembly alone is of little value without additional analysis. Genome annotation is the process of attaching biological information to sequences, and consists of three main steps. ① Identifying portions of the genome that do not code for proteins, ② Identifying elements on the genome, a process called gene prediction, and ③ Attaching biological information to these elements.

Automatic annotation tools try to perform these steps in silico, as opposed to manual annotation which involves human expertise and potential experimental verification. Ideally, these approaches co-exist and complement each other in the same annotation pipeline. Traditionally, the basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on homologues. More recently, additional information is added to the annotation platform. The additional information allows manual annotators to deconvolute discrepancies between genes that are given the same annotation. Some databases use genome context information, similarity scores, experimental data, and integrations of other resources to provide genome annotations through their Subsystems approach. Other databases rely on both curated data sources as well as a range of software tools in their automated genome annotation pipeline. Structural annotation consists of the identification of genomic elements, primarily ORFs and their localisation, or gene structure. Functional annotation consists of attaching biological information to genomic elements.

TRANSCRIPTOMICS

Transcriptomics is the study of the transcriptome - the all set of RNA transcripts which are produced under specific circumstances in one cell or population of cells - using high throughout methods such as microarray analysis.

## Analysis ##

A number of organism-specific transcriptome databases have been constructed and annotated to aid in the identification of genes that are differentially expressed in distinct cell populations. RNA-Seq is emerging as the method of choice for measuring transcriptomes of organisms, though the older technique of DNA microarrays is still used.

1) RNA-Seq

RNA-seq (RNA sequencing), also called whole transcriptome shotgun sequencing, uses next-generation sequencing to reveal the presence and quantity of RNA in a biological sample at a given moment in time. RNA-Seq is used to analyze the continually changing cellular transcriptome. Specifically, RNA-Seq facilitates the ability to look at alternative gene spliced transcripts, post-transcriptional modifications, gene fusion, mutations, SNPs and changes in gene expression. In addition to mRNA transcripts, RNA-Seq can look at different populations of RNA to include total RNA, small RNA, such as miRNA, tRNA and ribosomal profiling. RNA-Seq can also be used to determine exon/intron boundaries and verify or amend previously annotated 5’ and 3’ gene boundaries. Prior to RNA-Seq, gene expression studies were done with hybridization-based microarrays. Issues with microarrays include cross-hybridization artifacts, poor quantification of lowly and highly expressed genes, and the knowledge of the sequence. Because of these technical issues, transcriptomics transitioned to sequencing-based methods. These progressed from Sanger sequencing of Expressed Sequence Tag libraries, to chemical tag-based methods and finally to the current technology, NGS of cDNA.

2) DNA microarray

A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as proves. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA (also called anti-sense RNA) sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target.

EPIGENOMICS

Epigenomcis is the study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome.

{ Epigenetic modifications = genomic modifications that alter gene expression that cannot be attributed to modification

of the primary DNA sequence and that are heritalbe mitotically and meiotically are classified }

## Major two types of epigenemic modifications ##

1) DNA methylation

DNA methylation is the process of by which a methyl group is added to DNA by enzymes DNA methyltransferases (DNMTs) which are responsible for catalyzing this reaction. In eukaryotes, methylation is most commonly found on the carbon 5 position of cytosine residues (5mc) adjacent to guanine. DNA methylation patterns vary greatly between species and even with the same organisms.

2) Histone modification

In eukaryotes, genomic DNA is coiled into protein-DNA complexes called chromatin. Histones, which are the most prevalent type of protein found in chromatin, function to condense the DNA; the net positive charge on histones facilitates their bonding with DNA, which is negatively charged. The basic and repeating units of chromatin, nucleosomes, consist of an octamer of histone proteins. Many different types of histone modification are known, including acetylation, methylation, phosphorylation, ubiquitination etc. The DNA region where histone modification occurs can elicit different effects. Histone modifications regulate gene expression by two mechanisms : by disruption of the contact between nucleosomes and by recruiting chromatin remodeling ATPases.

## Epigenomic methods ##

1) Histone modification assay

The cellular processes of transcription, DNA replication and DNA repair involve the interaction between genomic DNA and nuclear proteins. It had been known that certain regions within chromatin were extremely susceptible to DNase I digestion, which cleaves DNA in a low sequence specificity manner. Such hypersensitive sites were thought to be transcriptionally active regions, as evidenced by their association with RNA polymerase and topoisomerase I and II. It is now known that sensitivity to DNAse I regions correspond to regions of chromatin with loose DNA-histone association. Hypersensitive sites most often represent promoters regions, which require for DNA to be accessible for DNA binding transcriptional machinery to function.

ChIP-Chip and ChIP-Seq

Histone modification was first detected on a genome wide level through the coupling of chromatin immunoprecipitation (ChIP) technology with DNA microarrays, termed ChIP-Chip. However instead of isolating a DNA-binding transcription factor or enhancer protein through chromatin immunoprecipitation, the proteins of interest are the modified histones themselves.

① Histones are cross-linked to DNA in vivo through light chemical treatment.

② The cells are next lysed, allowing for the chromatin to be extracted and fragmented, either by sonication or treatment with a non-specific restriction enzyme.

③ Modification-specific antibodies in turn, are used to immunoprecipitate the DNA-histone complexes.

④ Following immunoprecipitation, the DNA is purified from the histones, amplified via PCR and labeled with a fluorescent tag.

⑤ The final step involves hybridization of labeled DNA, both immunoprecipitated DNA and non-immunoprecipitated onto a microarray containing immobilized gDNA.

⑥ Analysis of the relative signal intensity allows the sites of histone modification to be determined.

2) DNA methylation arrays

Techniques for characterizing primary DNA sequences could not be directly applied to methylation assays. For example, when DNA was amplified in PCR or bacterial cloning techniques, the methylation pattern was not copied and thus the information lost. The DNA hybridization technique used in DNA assays, in which radioactive probes were used to map and identify DNA sequences, could not be used to distinguish between methylated and non-methylated DNA.

Non genome-wide approaches

The earliest methylation detection assays used methylation modification sensitive restriction endonucleases. Genomic DNA was digested with both methylation-sensitive and insensitive restriction enzymes recognizing the same restriction site. The idea being that whenever the site was methylated, only the methylation insensitive enzyme could cleave at that position. By comparing restriction fragment sizes generated from the methylation-sensitive enzyme to those of the methylation-insensitive enzyme, it was possible to determine the methylation pattern of the region. This analysis step was done by amplifying the restriction fragments via PCR, separating them through gel electrophoresis and analyzing them via southern blot with probes for the region of interest. Different regions of the gene were known to be expressed at different stages of development. Consistent with a role of DNA methylation in gene repression, regions that were associated with high levels of DNA methylation were not actively expressed.

This method was limited not suitable for studies on the global methylation pattern, or methylome. Even within specific loci it was not fully representative of the true methylation pattern as only those restriction sites with corresponding methylation sensitive and insensitive restriction assays could provide useful information. Further complications could arise when incomplete digestion of DNA by restriction enzymes generated false negative results.

Gemone wide approaches

DNA methylation profiling on a large scale was first made possible through the Restriction Landmark Genome Scanning (RLGS) technique. Like the locus-specific DNA methylation assay, the technique identified methylated DNA via its digestion methylation sensitive enzymes. However it was the use of two-dimensional gel electrophoresis that allowed be characterized on a broader scale. However it was not until the advent of microarray and next generation sequencing technology when truly high resolution and genome-wide DNA methylation became possible. As with RLGS, the endonuclease component is retained in the method but it is coupled to new technologies. One such approach is the differential methylation hybridization (DMH), in which one set of genomic DNA is digested with methylation-sensitive restriction enzymes and a parallel set of DNA is not digested. Both sets of DNA are subsequently amplified and each labelled with fluorescent dyes and used in two-colour array hybridization. The level of DNA methylation at a given loci is determined by the relative intensity ratios of the two dyes. Adaptation of next generation sequencing to DNA methylation assay provides several advantages over array hybridization. Sequence-based technology provides higher resolution to allele specific DNA methylation, can be performed on larger genomes, and does not require creation of DNA microarrays which require adjustments based on CpG density to properly function.

PROTEOMICS

Proteomics is the large-scale study of proteins, particularly their structures and fucntions. Proteins are vital parts of living organisms, as they are the main components of the physiological metabolic pathways of cells. The term proteomics was first coined in 1997 to make an analogy with genomics, the study of the genome. The word proteome is a portmanteau of protein and genome.

## Analysis ##

1) Protein detection with antibodies (immunoassays)

Antibodies to particular proteins or to their modified forms have been used in biochemistry and cell biology studies. These are among the most common tools used by molecular biologists today. There are several specific techniques and protocols that use antibodies for protein detection. The enzyme-linked immunosorbent assay (ELISA) has been used for decades to detect and quantitatively measure proteins in samples. The Western blot can be used for detection and quantification of individual proteins, where in an initial step a complex protein mixture is separated using SDS-PAGE and then the protein of interested identified using an antibody. Modified proteins can be studied by developing an antibody specific to that modification. For example, there are antibodies that only recognize certain proteins when they are tyrosine-phosphorylated, known as phospho-specific antibodies. Also, there are antibodies specific to other modifications. These can be used to determine the set of proteins that have undergone the modification of interest.

2) Antibody-free protein detection

While protein detection with antibodies are still very common in molecular biology, also other methods have been developed that do not rely on an antibody. These methods offer various advantages, for instance they are often able to determine the sequence of a protein or peptide, they may have higher throughput than antibody-based and they sometimes can identify and quantify proteins for which no antibody exists.

Detection methods

One of the earliest method for protein analysis has been Edman degradation where a single peptide is subjected to multiple steps of chemical degradation to resolve its sequence. These methods have mostly been supplanted by technologies that offer higher throughput. More recent methods use mass spectrometry-based techniques, a development that was made possible by the discovery of "soft ionization" methods such as matrix-assisted laser desorption/ionization (MALDI) and electrospray ionization (ESI) developed in the 1980s. These methods gave rise to the top-down and the botton-up proteomics workflows where often additional separation is performed before analysis.

Seperation methods

For the analysis of complex biological samples, a reduction of sample complexity is required. This can be performed off-line by one-dimensional or two-dimensional separation. More recently, on-line methods have been developed where individual peptides (in bottom-up proteomics approaches) are separated using Reversed-phase chromatography and then directly ionized using ESI; the direct coupling of separation and analysis explains the term on-line analysis.

3) High-throughput proteomic technologies

Proteomics has steadily gained momentum over the past decade with the evolution of several approaches. Few of these are new and others build on traditional methods. Mass spectrometry-based methods and micro arrays are the most common technologies for large-scale study of proteins.

Reverse-phased protein microarrays

This is a promising and newer microarray application for the diagnosis, study and treatment of complex diseases such as cancer. The technology merges laser capture microdissection (LCM) with micro array technology, to produce reverse phase protein microarrays. In this type of microarrays, the whole collection of protein themselves are immobilized with the intent of capturing various stages of disease within an individual patient. When used with LCM, reverse phase arrays can monitor the fluctuating state of proteome among different cell population within a small area of human tissue. This is useful for profiling the status of cellular signaling molecules, among a cross section of tissue that includes both normal and cancerous cells. This approach is useful in monitoring the status of key factors in normal prostate epithelium and invasive prostate cancer tissues. LCM then dissects these tissue and protein lysates were arrayed onto nitrocellulose slides, which were probed with specific antibodies. This method can track all kinds of molecular events and can compare diseased and healthy tissues within the same patient enabling the development of treatment strategies and diagnosis. The ability to acquire proteomics snapshots of neighboring cell populations, using reverse phase microarrays in conjunction with LCM has a number of applications beyond the study of tumors. The approach can provide insights into normal physiology and pathology of all the tissues and is invaluable for characterizing developmental processes and anomalies.

LECTURES

Contents

3) Analysis of regulation

1) Sequencing

ChIP-Chip and ChIP-Seq

Gemone wide approaches

Seperation methods

3) High-throughput proteomic technologies

Reverse-phased protein microarrays

Navigation menu

Views

Personal tools

Search

Navigation

Advertisements

Tools

Related Links[Edit]