The bioinformatics grammars of cofactor dependency


SIMBAL heat map showing that, among short stretches of polypeptide (toward bottom of triangle) from a protein in the LLM family, strong correlations (red) with a metabolic background of F420 biosynthesis belong to a few small regions only.

As a rule, a cell will provide (synthesize or import) a cofactor such as FMN or F420 (see Wikipedia:Cofactor_(biochemistry)) only if at least one protein made in the cell is an enzyme or carrier that requires the cofactor. Consequently, such an enzyme's distribution across multiple species will show an only-if relationship to cofactor availability; no such enzyme will be encoded in a genome unless that genome also encodes the means, usually biosynthesis, to provide that cofactor.

For an unknown or incompletely described system, distinguishing a cofactor-dependent enzyme from a cofactor biosynthesis enzyme may be tricky. However, closely homologous enzymes reliably use the same cofactors. Therefore, the bioinformatics grammar of the cofactor dependency signature allows for one or more expanded paralogous families of enzymes as part of the signature.

If the cofactor biosynthesis system is not universal, but instead shows sporadic distribution, then the only-if relationship will be highly informative. Partial Phylogenetic Profiling (PPP) based on a marker for cofactor biosynthesis will give top scores to proteins from the biosynthesis operon itself, but often will show a second tier of high scores for enzymes that require the cofactor. A splendid example of this signature appears for coenzyme F420 biosynthesis in Mycobacterium smegmatis, where PPP's second tier is filled with dozens of F420-dependent enzymes from three expanded paralogous families (see PMID: 20675471). In this published example, the cofactor dependency signature helped reveal that cofactor for each of these actually was F420 rather than the chemically similar flavin mononucleotide (FMN).

If an enzyme family is heterogeneous in its cofactor requirements, phylogenetic profiling-like approaches based on cofactor availability can drill down to level of subsequences within a protein, and show that this data-mining highlights cofactor-binding sites rather than substrate-binding sites. This technique, called SIMBAL (Sites Inferred by Metabolic Background Assertion Labeling), produces elegant heat maps that are most easily interpreted when a protein crystal structure is available.

Caveats for working with the bioinformatics grammar of the cofactor dependency signature:

  • An only-if relationship to a known or suspected cofactor biosynthesis system may represent non-orthologous gene displacement or variably present upstream and auxiliary components in pathways for obtaining the cofactor, rather than for using it.
  • One of the biosynthesis enzymes may belong to a paralogous family, and in fact be a neofunctionalization donor family.
  • An enzyme obviously cannot show any correlation to the presence/absence of a cofactor that is universally present. An enzyme requiring such a cofactor may be indistinguishable from a one-gene system.