Misannotation's greatest hits

Genome annotation goes wrong in a variety of ways, several of which occur regularly enough for the failure modes to merit standard vocabulary terms.

Original sin in characterization is an error in characterization or interpretation made prior to inspection by a biocurator. The aluminum resistance gene ALU1-P from Arthrobacter viscosus (see PMID:9367855) seems actually to encode the queuosine biosynthesis protein QueC, involved in a tRNA modification pathway, as does exsB, "which probably codes for a regulator of succinoglycan biosynthesis (see PMID:8544814)."

Original sin in biocuration is a post-publication error that affects the first sequence exemplar to be provided to annotation pipelines. ALU2-P is reported in an unpublished database submission to be an aluminum resistance protein, but biocuration seems mistakenly to have linked the sequence of ALU2-P, a putative pyridoxal phosphate-dependent enzyme, to the characterization published for ALU1-P.

Most annotation does not originate with an experiment, but rather through propagation from a sequence with "known" function to some other sequence, usually a homolog.

Transitive annotation error is any misannotation created by carrying information forward from a sequence that already bears the information (rightly or wrongly) to a sequence that should not. In extreme cases of daisy chain transitive annotation, an annotation may come to a protein based on homology in one region, then propagate based on some non-overlapping region.

Overly specific prediction of function is the error of assigning an exact name (e.g. galactose-6-phosphate isomerase or glucose-6-phosphate isomerase) when homology and context support only a more general name (see PMID:19844580). A useful step toward removing this type error would be to develop a protein naming tree (or Directed Acyclic Graph), with more generalized functionally descriptive names (e.g. hexose-phosphate isomerase) assigned to internal nodes.

Verschlimmbessern, in genome annotation, is the act of making a protein name worse through well-meaning attempts to improve it, perhaps to achieve better standardization. (examples needed).

Erroneous computational analysis (see PMID:11790254).

Faulty spelling occurs both by original sin in biocuration and by transitive annotation error. Hypothetical proteins may become "hypotetical", "hypotethical", or "hyothetical", while "protien", with nearly 2000 instances, is one of misannotation's greatest hits.