Combining learning and constraints for genome-wide protein annotation



This paper provides a framework for correct and high quality machine-generated annotated genomic protein annotation. This papers focuses on cell life such as protein function and protein protein interaction (PPI) that are tightly connected. Protein function is defined as characterization of Protein behavior as formalized by Gene Ontology Consortium (GO). In order to help Genome-wide prediction such as protein function and interaction which involves inferring the annotations of all proteins in a genome one can benefit from prior biological knowledge to reconcile prediction accuracy. OCELOT instantiates one task for every candidate GO term, i.e., deciding whether a given protein should be annotated with that term, plus a separate task for deciding whether a given protein pair interacts. The overall, genome-wide annotations are obtained by imposing consistency across the predictions of all tasks.

Image of Ocelot Decision making

Genome Prediction and constraint learning

The image above shows the decision making processes of Ocelot. The nodes/circles represents PPI’s and edges/lines represents physical interactions. The GO taxonomy, boxes are terms and arrows are IsA relations. Predicted annotations for proteins p1 and p2 (black): p1 is annotated with terms f1, f4, f5 and p2 with f2, f4. The functional predictions are driven by the similarity between p1 and p2, and by consistency with respect to the GO taxonomy (e.g. f1 entails either f3 or f4, f2 entails f4, etc.). The interaction predictions are driven by similarity between protein pairs (i.e. (p1, p2) against all other pairs) and are mutually constrained by the functional ones. For instance, since p1 and p2 do interact, OCELOT aims at predicting at least one shared term at each level of the GO, e.g. f4 at the middle level. These constraints are not hard, and can be violated if doing so provides a better joint prediction. As an example, p1 is annotated with f1 and p2 with f2.

acknowledgement: Thanks to Irene Tortosa for explaining how Genome sequencing works.