Sequence profiling tool

A sequence profiling tool in bioinformatics is a type of software that presents information related to a genetic sequence, gene name, or keyword input. Such tools generally take a query such as a DNA, RNA, or protein sequence or ‘keyword’ and search one or more databases for information related to that sequence. Summaries and aggregate results are provided in standardized format describing the information that would otherwise have required visits to many smaller sites or direct literature searches to compile. Many sequence profiling tools are software portals or gateways that simplify the process of finding information about a query in the large and growing number of bioinformatics databases. The access to these kinds of tools is either web based or locally downloadable executables.

Introduction and usage

The "post-genomics" era has given rise to a range of web-based tools and software to compile, organize, and deliver large amounts of primary sequence information, as well as protein structures, gene annotations, sequence alignments, and other common bioinformatics tasks. In general, there exist three types of databases and service providers. The first one includes the popular public-domain or open-access databases supported by funding and grants such as NCBI, ExPASy, Ensembl, and PDB. The second one includes smaller or more specific databases organized and compiled by individual research groups Examples include [http://www.yeastgenome.org/ Yeast Genome Database] , [http://www.rnabase.org/ RNA database] . The third and final one includes private corporate or institutional databases that require payment or institutional affiliation to access. Such examples rare given the globalization of the public databases unless the purported service is ‘in-development’ or the end point of the analysis is of commercial value. Typical scenarios of a profiling approach become relevant , particularly, in the cases of the first two groups, where researchers commonly wish to combine information derived from several sources about a single query or target sequence. For example, users might use the sequence alignment and search tool BLAST to identify homologs of their gene of interest in other species, and then use these results to locate a solved protein structure for one of the homologs. Similarly, they might also want to know the likely secondary structure of the mRNA encoding the gene of interest, or whether a company sells a DNA construct containing the gene. Sequence profiling tools serve to automate and integrate the process of seeking such disparate information by rendering the process of searching several different external databases transparent to the user.

Many public databases are already extensively linked so that complementary information in another database is easily accessible; for example, Genbank and the PDB are closely intertwined. However, specialized tools organized and hosted by specific research groups can be difficult to integrate into this linkage effort because they are narrowly focused, are frequently modified, or use custom versions of common file formats. Advantages of sequence profiling tools include the ability to use multiple of these specialized tools in a single query and present the output with a common interface, the ability to direct the output of one set of tools or database searches into the input of another, and the capacity to disseminate hosting and compilation obligations to a network of research groups and institutions rather than a single centralized repository.

Keyword based profilers

Most of the profiling tools available on the web today fall into this category. The user, upon visiting the site/tool, enters any relevant information like a keyword e.g. dystrophy, diabetes etc., or GenBank accession numbers, PDB ID. All the relevant hits by the search are presented in a format unique to each tool’s main focus. Profiling tools based on keyword searches are essentially search engines that are highly specialized for bioinformatics work, thereby eliminating a clutter of irrelevant or non-scholarly hits that might occur with a traditional search engine like Google. Most keyword-based profiling tools allow flexible types of keyword input, accession numbers from indexed databases as well as traditional keyword descriptors.

Each profiling tool has its own focus and area of interest. For example, the NCBI search engine Entrez segregates its hits by category, so that users looking for protein structure information can screen out sequences with no corresponding structure, while users interested in perusing the literature on a subject can view abstracts of papers published in scholarly journals without distraction from gene or sequence results. The Pubmed biosciences literature database is a popular tool for literature searches, though this service is nearly equaled with the more general Google Scholar.

Keyword-based data aggregation services like the Bioinformatic Harvester performs provide reports from a variety of third-party servers in an "as-is" format so that users need not visit the website or install the software for each individual component service. This is particularly invaluable given the rapid emergence of various sites providing different sequence analysis and manipulation tools. Another aggregative web portal, the Human Protein Reference Database (Hprd), contains manually annotated and curated entries for human proteins. The information provided is thus both selective and comprehensive, and the query format is flexible and intuitive. The pros of developing manually curated databases include presentation of proofread material and the concept of ‘molecule authorities’ to undertake the responsibility of specific proteins. However, the cons are that they are typically slower to update and may not contain very new or disputed data.

equence data based profilers

A typical sequence profiling tool carries this further by using an actual DNA, RNA, or protein sequence as an input and allows the user to visit different web-based analysis tools to obtain the information desired. Such tools are also commonly supplied with commercial laboratory equipment like gene sequencers or sometimes sold as software applications for molecular biology. In another public-database example, the BLAST sequence search report from NCBI provides a link from its alignment report to other relevant information in its own databases, if such specific information exists. For example, a retrieved record that contains a human sequence will carry a separate link that connects to its location on a human genome map; a record that contains a sequence for which a 3-D structure has been solved would carry a link that connects it to its structure database. Sequerome, a public service tool, links the entire BLAST report to many third party servers/sites that provide highly specific services in sequence manipulations such as restriction enzyme maps, open reading frame analyses for nucleotide sequences, and secondary structure prediction. The tool provides added advantage of tabbed browsing interface to track user operations and thus carry a project to its completion within one browser interface. The consequent evolution of such profilers would thus include ability to customize and automate processing of sets of sequence data. [http://bioinformatics.georgetown.edu/InstaSeq.htm InstaSeq] is a Google powered search tool that allows the user to directly enter a sequence and search the entire World Wide Web. This unique search engine, which is the only one of its kind, is in contrast to searching specific databases e.g. GenBank. As a result the user can end up with a privately hosted document or a page from a lesser known database from just about anywhere in the world. Though the presence of sequence based profilers are far and few in the present scenario, their key role will become evident when huge amounts of sequence data need to be cross processed across portals and domains.

Future growth and directions

The proliferation of bioinformatics tools for genetic analysis aids researchers in identifying and categorizing genes and gene sets of interest in their work; however, the large variety of tools that perform substantially similar aggregative and analytical functions can also confuse and frustrate new users. The decentralization encouraged by aggregative tools allows individual research groups to maintain specialized servers dedicated to specific types of data analysis in the expectation that their output will be collected into a larger report on a gene or protein of interest to other researchers.

Data produced by microarray experiments, two-hybrid screening, and other high-throughput biological experiments is voluminous and difficult to analyze by hand; the efforts of structural genomics collaborations that are aimed at quickly solving large numbers of highly varied protein structures also increase the need for integration between sequence and structure databases and portals. This impetus toward developing more comprehensive and more user-friendly methods of sequence profiling makes this an active area of research among current genomics researchers.

ee also

* [http://bind.ca/ Biomolecular Interaction Network Database (BIND)]
* [http://harvester.fzk.de Bioinformatic Harvester III] at KIT (Karlsruhe Institute of Technology)
* Entrez
* Metadata
* Sequence analysis
* Sequence motif

References

* "Biomedical language processing: what's beyond PubMed?" [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=16507357&query_hl=1&itool=pubmed_docsum Mol Cell. 2006 Mar 3;21(5):589-94.]
* "Google versus PubMed", [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=16263030&query_hl=1&itool=pubmed_docsum Ann R Coll Surg Engl. 2005 Nov;87(6):491-2.]
* "'Harvester': a fast meta search engine of human protein resources." , [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=14988114&query_hl=9&itool=pubmed_docsum Bioinformatics. 2004 Aug 12;20(12):1962-3.]
* "Human protein reference database as a discovery resource for proteomics", [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=14681466&query_hl=11&itool=pubmed_docsum Nucleic Acids Res. 2004 Jan 1;32(Database issue):D497-501]
* "Web-based interface facilitating sequence-to-structure analysis of BLAST alignment reports", [http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=16116790&query_hl=15&itool=pubmed_docsum Biotechniques. 2005 Aug;39(2):186]


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Offender profiling — Offender profiling, also known as criminal profiling, is a behavioral and investigative tool that is intended to help investigators to profile unknown criminal subjects or offenders. Offender profiling is also known as criminal profiling,… …   Wikipedia

  • Gene expression profiling — Heat maps of gene expression values show how experimental conditions influenced production (expression) of mRNA for a set of genes. Green indicates reduced expression. Cluster analysis has placed a group of down regulated genes in the upper left… …   Wikipedia

  • DNA profiling — Not to be confused with Full genome sequencing. Forensic science …   Wikipedia

  • Gene expression — For vocabulary, see Glossary of gene expression terms. For a non technical introduction to the topic, see Introduction to genetics. Genes are expressed by being transcribed into RNA, and this transcript may then be translated into protein. Gene… …   Wikipedia

  • Sequerome — [http://sequerome.georgetown.edu/sequerome/ Sequerome] is a web based Sequence profiling tool developed at the Bioinformatics and Computational Biosciences Unit ( [http://bioinformatics.georgetown.edu/ BCBU] ), Georgetown University. This tool,… …   Wikipedia

  • Bioinformatics — For the journal, see Bioinformatics (journal). Map of the human X chromosome (from the NCBI website). Assembly of the human genome is one of the greatest achievements of bioinformatics. Bioinformatics …   Wikipedia

  • RNA — For other uses, see RNA (disambiguation). A hairpin loop from a pre mRNA. Highlighted are the nucleobases (green) and the ribose phosphate backbone (blue). Ribonucleic acid (English pronunciation: /raɪbɵ.njuːˌkleɪ.ɨk ˈæsɪd/), or RNA, is one of… …   Wikipedia

  • DNA sequencing — Part of a series on Genetics Key components Chromosome DNA • RNA Genome Heredity Mutation Nucleotide Variation …   Wikipedia

  • Open reading frame — In molecular genetics, an open reading frame (ORF) is a DNA sequence that does not contain a stop codon in a given reading frame[1]. Normally, inserts which interrupt the reading frame of a subsequent region after the start codon, cause… …   Wikipedia

  • GenBank — The GenBank sequence database is an open access, annotated collection of all publicly available nucleotide sequences and their protein translations. This database is produced at National Center for Biotechnology Information (NCBI) as part of the… …   Wikipedia


Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”

We are using cookies for the best presentation of our site. Continuing to use this site, you agree with this.