Genome Properties and SEED Subsystems

Completely sequence bacterial genomes encode, on average, about 4000 proteins each, over a broad range, yet only 100 proteins or so are close to universal. Frequently, sets of genes form biochemical pathways, assemble into complexes, have one modify another, or form a regulatory cascade. It is hard to understand a species as a bag of genes, and easier to understand at the level of systems.

Genome Properties (PMID:15347579) provides systems-level organization for the bag-of-genes by defining a system as a set of components, and hidden Markov models (HMMs) as the most common evidence a component is present. It describes pathways, complexes, regulatory systems, transport systems, sorting systems, mobile elements, and various genometric properties such as GC content and amino acid abundances. Genome Properties guide the construction of equivalog models for TIGRFAMs, although not formally as requirement for use in a Genome Property.

In the Subsystems Approach (PMID:16214803), an annotation environment called The SEED treats each example of a genome in which all the components of a subcellular system have been identified as a "populated subsystem" for iterative use in furthering the projection onto additional complete genomes. The associated collection of protein families, FIGfams, relies primarily on a voting algorithm, in contrast to the HMMs used in TIGRFAMs, the primary source of classifiers used in Genome Properties.

Genome Properties and the Populated Subsystems agree on several important principles, including, "The presence or absence of metabolic pathways and structures provides a context that makes protein annotation far more reliable."

The Genome Properties and SEED Subsystems approaches provide excellent means for using comparative genomics to project annotation from model systems described in the literature onto equivalent systems found in newly sequenced genomes. But they also provide a means to discovery as they show systems with a recurring pathway hole that may point to a non-orthologous gene displacement, or describe an unknown system whose meaning can be inferred from its bioinformatics grammar.