Gene finding typically refers to the area of
computational biologythat is concerned with algorithmically identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes, but may also include other functional elements such as RNA genes and regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
In its earliest days, "gene finding" was based on painstaking experimentation on living cells and organisms. Statistical analysis of the rates of
homologous recombinationof several different genes could determine their order on a certain chromosome, and information from many such experiments could be combined to create a genetic mapspecifying the rough location of known genes relative to each other. Today, with comprehensive genome sequence and powerful computational resources at the disposal of the research community, gene finding has been redefined as a largely computational problem.
Determining that a sequence "is functional" should be distinguished from determining "the function" of the gene or its product. The latter still demands "
in vivo" experimentation through gene knockoutand other assays, although frontiers of bioinformaticsresearch are making it increasingly possible to predict the function of a gene based on its sequence alone.
In extrinsic (or evidence-based) gene finding systems, the target genome is searched for sequences that are similar to extrinsic evidence in the form of the known sequence of a
messenger RNA(mRNA) or proteinproduct. Given an mRNA sequence, it is trivial to derive a unique genomic DNA sequence from which it had to have been transcribed. Given a protein sequence, a family of possible coding DNA sequences can be derived by reverse translation of the genetic code. Once candidate DNA sequences have been determined, it is a relatively straightforward algorithmic problem to efficiently search a target genome for matches, complete or partial, and exact or inexact. BLASTis a widely used system designed for this purpose.
A high degree of similarity to a known messenger RNA or protein product is strong evidence that a region of a target genome is a protein-coding gene. However, to apply this approach systemically requires extensive sequencing of mRNA and protein products. Not only is this expensive, but in complex organisms, only a subset of all genes in the organism's genome are expressed at any given time, meaning that extrinsic evidence for many genes is not readily accessible in any single cell culture. Thus, in order to collect extrinsic evidence for most or all of the genes in a complex organism, many hundreds or thousands of different cell types must be studied, which itself presents further difficulties. For example, some human genes may be expressed only during development as an embryo or fetus, which might be difficult to study for ethical reasons.
Despite these difficulties, extensive transcript and protein sequence databases have been generated for human as well as other important model organisms in biology, such as mice and yeast. For example, the
RefSeqdatabase contains transcript and protein sequence from many different species, and the Ensemblsystem comprehensively maps this evidence to human and several other genomes. It is, however, likely that these databases are both incomplete and contain small but significant amounts of erroneous data..
"Ab Initio" Approaches
Because of the inherent expense and difficulty in obtaining extrinsic evidence for many genes, it is also necessary to resort to "
ab initio" gene finding, in which genomic DNA sequencealone is systematically searched for certain tell-tale signs of protein-coding genes. These signs can be broadly categorized as either "signals", specific sequences that indicate the presence of a gene nearby, or "content", statistical properties of protein-coding sequence itself. "Ab initio" gene finding might be more accurately characterized as gene "prediction", since extrinsic evidence is generally required to conclusively establish that a putative gene is functional.
In the genomes of
prokaryotes, genes have specific and relatively well-understood promotersequences (signals), such as the Pribnow boxand transcription factor binding sites, which are easy to systematically identify. Also, the sequence coding for a protein occurs as one contiguous open reading frame(ORF), which is typically many hundred or thousands of base pairs long. The statistics of stop codons are such that even finding an open reading frame of this length is a fairly informative sign. (Since 3 of the 64 possible codons in the genetic code are stop codons, one would expect a stop codon approximately every 20-25 codons, or 60-75 base pairs, in a random sequence.) Furthermore, protein-coding DNA has certain periodicities and other statistical properties that are easy to detect in sequence of this length. These characteristics make prokaryotic gene finding relatively straightforward, and well-designed systems are able to achieve high levels of accuracy.
"Ab initio" gene finding in
eukaryotes, especially complex organisms like humans, is considerably more challenging for several reasons. First, the promoter and other regulatory signals in these genomes are more complex and less well-understood than in prokaryotes, making them more difficult to reliably recognize. Two classic examples of signals identified by eukaryotic gene finders are CpG islands and binding sites for a poly(A) tail.
splicingmechanisms employed by eukaryotic cells mean that a particular protein-coding sequence in the genome is divided into several parts ( exons), separated by non-coding sequences ( introns). (Splice sites are themselves another signal that eukaryotic gene finders are often designed to identify.) A typical protein-coding gene in humans might be divided into a dozen exons, each less than two hundred base pairs in length, and some as short as twenty to thirty. It is therefore much more difficult to detect periodicities and other known content properties of protein-coding DNA in eukaryotes.
Advanced gene finders for both prokaryotic and eukaryotic genomes typically use complex
probabilistic models, such as Hidden Markov Models, in order to combine information from a variety of different signal and content measurements. The GLIMMERsystem is a widely used and highly accurate gene finder for prokaryotes. GeneMarkis another popular approach. Eukaryotic "ab initio" gene finders, by comparison, have achieved only limited success; notable examples are the GENSCANand geneidprograms. A few programs like CONTRAST also use machine learningapproaches like support vector machinesfor successful gene prediction.
Among the derived signals used for prediction are statistics resulting from the sub-sequence statistics like
k-merstatistics, Fourier transformof a pseudo-number-coded DNA, Z-curveparameters and certain run features.cite journal |author=Saeys Y, Rouzé P, Van de Peer Y |title= [http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/4/414 In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists] |journal=Bioinformatics |year=2007 |volume=23 |issue=4 |pages=414–420 |doi=10.1093/bioinformatics/btl639 |pmid=17204465]
It has been suggested that signals other than those directly detectable in sequences may improve gene prediction. For example, the role of
secondary structurein the identification of regulatory motifs has been reported.cite journal |author=Hiller M, Pudimat R, Busch A, Backofen R |title=Using RNA secondary structures to guide sequence motif finding towards single-stranded regions |journal=Nucleic Acids Res |year=2006 |volume=34 |issue=17 |pages=e117 |id=Entrez Pubmed|16987907 |doi=10.1093/nar/gkl544 |pmid=16987907] In addition, it has been suggested that RNA secondary structure prediction helps splice site prediction.cite journal |author=Patterson DJ, Yasuhara K, Ruzzo WL |title=Pre-mRNA secondary structure prediction aids splice site prediction |journal=Pac Symp Biocomput |year=2002 |pages=223–234 |id=Entrez Pubmed|11928478] cite journal |author=Marashi SA, Goodarzi H, Sadeghi M, Eslahchi C, Pezeshk H |title=Importance of RNA secondary structure information for yeast donor and acceptor splice site predictions by neural networks |journal=Comput Biol Chem |year=2006 |volume=30 |issue=1 |pages=50–57 |id=Entrez Pubmed|16386465 |doi=10.1016/j.compbiolchem.2005.10.009] cite journal |author=Marashi SA, Eslahchi C, Pezeshk H, Sadeghi M |title=Impact of RNA structure on the prediction of donor and acceptor splice sites |journal=BMC Bioinformatics |year=2006 |volume=7 |pages=297 |id=Entrez Pubmed|16772025 |doi=10.1186/1471-2105-7-297] Rogic, S (2006). " [http://www.cs.ubc.ca/grads/resources/thesis/Nov06/Rogic_Sanja.pdf The role of pre-mRNA secondary structure in gene splicing in Saccharomyces cerevisiae] ". "PhD Dissertation, University of British Columbia".]
Comparative Genomics Approaches
As the entire genomes of many different species are sequenced, a promising direction in current research on gene finding is a
comparative genomicsapproach. This is based on the principle that the forces of natural selectioncause genes and other functional elements undergo mutation at a slower rate than the rest of the genome, since mutations in functional elements are more likely to negatively impact the organism than mutations elsewhere. Genes can thus be detected by comparing the genomes of related species to detect this evolutionary pressure for conservation. This approach was first applied to the mouse and human genomes, using programs such as SLAM, SGP and Twinscan/N-SCAN.
Comparative gene finding can also be used to project high quality annotations from one genome to another. Notable examples include Projector, GeneWise and GeneMapper. Such techniques now play a central role in the annotation of all genomes.
* [http://www.nslij-genetics.org/gene/ Bibliography on computational gene recognition by Wentian Li]
* [http://genome.imim.es/software/geneid/ geneid]
* [http://genome.imim.es/software/sgp2/ SGP2]
* [http://genes.mit.edu/GENSCAN.html GENSCAN]
* [http://mblab.wustl.edu/software/twinscan/ Twinscan/N-SCAN]
* [http://www.scfbio-iitd.res.in/research/genepredictor.htm CHEMGENOME]
* [http://opal.biology.gatech.edu/GeneMark/ GeneMark]
* [http://www.cebitec.uni-bielefeld.de/groups/brf/software/gismo/ Gismo]
Wikimedia Foundation. 2010.
Look at other dictionaries:
Gene — For a non technical introduction to the topic, see Introduction to genetics. For other uses, see Gene (disambiguation). This stylistic diagram shows a gene in relation to the double helix structure of DNA and to a chromosome (right). The… … Wikipedia
Gene expression profiling — Heat maps of gene expression values show how experimental conditions influenced production (expression) of mRNA for a set of genes. Green indicates reduced expression. Cluster analysis has placed a group of down regulated genes in the upper left… … Wikipedia
List of RNA structure prediction software — This list of RNA structure prediction software is a compilation of software tools and web portals used for RNA structure prediction.ingle sequence structure predictionncRNA gene prediction software See also * RNA * Non coding RNA * RNA… … Wikipedia
Protein-protein interaction prediction — is a field combining bioinformatics and structural biology in an attempt to identify and catalog interactions between pairs or groups of proteins. Understanding protein protein interactions is important in investigating intracellular signaling… … Wikipedia
Nucleic acid structure prediction — This article is about the computational prediction of nucleic acid structure. For experimental methods, see Nucleic acid structure determination. Nucleic acid structure prediction is a computational method to determine nucleic acid secondary and… … Wikipedia
Earthquake prediction — An earthquake prediction is a prediction that an earthquake in a specific magnitude range will occur in a specific region and time window. Predictions are considered as such to the extent that they are reliable for practical, as well as… … Wikipedia
Protein structure prediction — is one of the most important goals pursued by bioinformatics and theoretical chemistry. Its aim is the prediction of the three dimensional structure of proteins from their amino acid sequences, sometimes including additional relevant information… … Wikipedia
De novo protein structure prediction — In computational biology, de novo protein structure prediction is the task of estimating a protein s tertiary structure from its sequence alone. The problem is very difficult and has occupied leading scientists for decades. Research has focused… … Wikipedia
MLL (gene) — HRX redirects here. For the bank holding, see Hypo Real Estate. Myeloid/lymphoid or mixed lineage leukemia (trithorax homolog, Drosophila) PDB rendering based on 2j2s … Wikipedia
TRO (gene) — Trophinin, also known as TRO, is a human gene.cite web | title = Entrez Gene: TRO trophinin| url = http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gene Cmd=ShowDetailView TermToSearch=7216| accessdate = ] PBB Summary section title = summary text =… … Wikipedia