Bioinformatics grammar

A biological system can be categorized by the nature of the process it carries out, whether the process is consumption of a particular type of sugar, production of particular kind of toxin, production of a cofactor, or processing a nascent polypeptide according to a protein sorting signal. Within each category of process, the components of the system relate to each other in recognizable ways, even if the proteins taking corresponding roles in different instances have no similarity to each other. The pattern of relationships among the components, as revealed by examining the comparative genomics of the system, is its bioinformatics grammar. Features of the grammar include the basic pattern of design, which can be seen by inspecting the list of conserved components, but also computed characteristics such as sporadic distribution among closely related species.

In a bacteriocin-like peptide modification system, the leader peptide region (absent from the mature peptide) tends to be better conserved than the mature portion. The peptide will tend to be exported; a neighboring gene may be resemble a transporter fused to a protease. The mix-and-match association often observed to link precursor peptide families to their maturation cassettes can lead to discovery of new precursor classes near orphan maturation cassettes, or new maturation enzymes encoded near known precursor peptide genes.

A cofactor dependency signature can not only reveal that an enzyme, or paralogous family of enzymes, depends on some non-univeral cofactor, such as F420 or PQQ, but also reveal which sites within the enzyme contribute to cofactor-binding specificity. In the discovery of the mycofactocin system (PMID: 21223593), the cofactor system grammar proved a better fit than the bacteriocin-like peptide modification system grammar.

The uniform functional organization, or design pattern, common to the typical sugar utilization pathway ties together catabolic enzyme cassettes, transporters, and regulators. Enzymes broadly assigned to categories such as aldolase or carbohydrate kinase can point to one of these pathways, and enable its study by comparative genomics methods.

In a protein sorting system, a common bioinformatics grammar is based on the recognition motif occurring within a short homology domain with a telltale architecture: a structure consisting of a distinctive motif, a transmembrane alpha-helix, and a cluster of basic residues, located at the protein C-terminus. This architectural grammar is coupled to a distribution pattern, with the homology domain occuring many times per complete genome if it occurs at all. The domain will occur in proteins that share no other regions of homology to each other, but all proteins with the domain will also have an N-terminal signal sequence. Genomes that encode such a sorting signal will also encode the enzyme that recognizes and processes it. This grammar describes the LPXTG motif and the protein-sorting transpeptidase called sortase. But a matching grammar occurs for the PEP-CTERM domain, and was used to discover the putative cell-sorting enzyme called exosortase (PMID: 16930487).

For the type 2 toxin-antitoxin system, the bioinformatics grammar specifies a toxin from one guild of protein families, and an antitoxin from another. The two proteins, both small, are encoded in two-gene modules that show fast evolution, the sporadic distribution of extensive lateral mobility, frequent occurrence on plasmids or in prophage regions, and some character of mix-and-match association between the two guilds (see Makarova, et al., PMID:19493340). The antitoxin, although small, always has a protein-protein interaction domain and a DNA-binding domain. These features of the grammar have allowed data-mining by comparative genomics for new homology families of toxin-antitoxin systems.

Most examples of one-gene system present a degenerate, largely uniformative bioinformatics grammar.