Bioinformatics journey

A bioinformatics journey is a biological investigation that uses one or more bioinformatics tools, passes sequentially through several rounds of analysis and inference, and reaches an endpoint where testable self-consistencies in the results can provides convincing measures of in silico validation. The process is analogous to solving a crossword puzzle; the fully completed puzzle is far more convincing that a half-solved puzzle, even though more rounds of supposition were required to reach completion than to reach the middle stages.

It is important to recognize when a bioinformatics journey has been performed, because a journey that works on one system may work again on an unrelated system. Systems that are analogous, and thus obey the same bioinformatics grammar, can be investigated through the same reusable bioinformatics journey.

PSI-BLAST provides one of the most familiar types of bioinformatics journey. Weak sequence similarity reported by BLAST is not always a convincing indicator of sequence homology. However, iterated searching with PSI-BLAST can recruit more sequences round after round into a multiple alignment that becomes the source of a progressively more sensitive search tool. Eventually, no more sequences can be recruited and PSI-BLAST is said to have converged. At this endpoint, if the journey was successful, inspecting the resulting multiple alignment and some simple cross-validation can show that all sequences recruited really are homologous; steps performed early on as guesses become confirmed, and are not guesses any longer.

Besides PSI-BLAST to convergence, examples of bioinformatics journeys include:


 * using annotation walking to expand the guild of protein families for a system such as CRISPR-associated (cas) gene regions, and showing at the end that any gene between two members of the same guild is almost always another member itself.


 * using the HMM for a cofactor biosynthesis protein family to build a phylogenetic profile, and using the profile to discover that correlated proteins include an expanded paralogous family of enzymes seemingly dependent on that cofactor. A technique called SIMBAL performs a data mining exercise that finds which subsequences most closely follow a phylogenetic profile. Matching SIMBAL hotspots to homologous sequences from a solved crystal structure, and showing that these sequences correspond to cofactor-binding sites rather than substrate-binding sites, largely validates an initially weak hypothesis of cofactor-binding specificity.
 * a putative peptide maturase is encoded next to a putative target peptide in one genome. Examination of additional genomes with close homologs to the maturase finds that some have a homolog to the putative target peptide nearby, while others do not. Follow-up examination shows that faulty gene-calling was masking that the target peptide in some species; the family actually is universally present next to the maturase.