by Matthew Brush
In September 2013, the Phenotype RCN sponsored a three-day workshop at Oregon Health & Science University to align sequence feature and genetic variation representation and thereby support phenotype data integration. Participants included developers of the Sequence Ontology (SO)  (Karen Eilbeck, Mike Bada, and Bret Heale), and members of the ontology team from the Monarch Initiative  who have been developing a genotype ontology called GENO (Matthew Brush, Melissa Haendel, and Chris Mungall).
One of the goals of the Phenotype RCN is to promote coordination and standardization of phenotype-related data. A standardized representation of genotype information is required for integrating genetically-linked phenotype data from diverse sources including model organism, human variation, livestock, and evolutionary databases. A particular challenge relates to harmonizing phenotype annotations where they are linked to genetic variations at different levels of granularity – from complete strain genotypes, to specific gene alleles, to single nucleotide polymorphisms.
Monarch and SO Projects
The Monarch Initiative is a new effort that aims to integrate genotype-to-phenotype and related data from numerous sources under a common semantic framework, and develop tools and services for user-guided exploration and analysis. Towards this end, Monarch required development of new modeling for genotypes (housed in GENO), which was lacking in the ontology landscape. The scope of GENO necessarily overlaps with that of the Sequence Ontology, but has a unique perspective on sequence features as they relate to linking different scales of genetic variation and to organismal phenotypes. The need to align modeling between SO and GENO motivated our collaboration, which was particularly timely as the SO had recently initiated a refactoring to accommodate use cases beyond its initial charge of genome annotation. This refactoring aimed to define the context of the SO with respect to the Basic Formal Ontology (BFO) and other OBO ontologies, enhance representation of sequence variation, and develop a parallel representation of material sequence features (MSO) to complement the abstract feature representation in the existing SO. These goals were consistent with those of Monarch to support better phenotype data integration and therefore a workshop was funded by the Phenotype RCN.
Genetic Variation in GENO
The genotype information modeled in GENO is broadly conceived to include any variation in gene expression that is tied to an observed phenotypic effect. Two types of ‘genetic variation’ are explicitly distinguished in GENO: (1) ‘Sequence-variation’ describes changes in the sequence of an organism’s genome, which are captured in the traditional genotypes shared by biologists. In this context, ‘sequence variant genes’ are heritable changes in genomic DNA, and include things like point mutations, SNPs, or transgenic insertions that are represented in SO. (2) ‘Expression-variation’ relates to experimental alterations in the expression-level of genes that are not due to changes in the sequence of the subjects’ genome. Here, ‘expression variant gene’ are genes that are altered in the level of their expression as a result of some experimental intervention such as targeted gene knock-down using reagents such as morpholinos and RNAi, or transient expression from DNA constructs. Like sequence variants, these expression variants change what is expressed in an organism and can lead to measurable phenotypic outcomes. The GENO ontology aims to re-use and co-develop the SO sequence variation model, but the notion of expression variation was concluded to be outside the SO scope. Modeling in GENO will extend and be logically consistent with the SO approach and will leverage links to orthogonal ontologies to place variation in a broader biological context . Additional information about the SO and GENO models and their interaction can be found in the presentation posted here .
Workshop Goals and Outcomes
One of the immediate goals of our workshop was to find consensus on high-level ontological issues that have yet to be resolved in the development of these and other OBO Foundry ontologies and document these decisions for the community. Many such issues have been broadly debated for years, and our outcomes may be relevant for other domains or applications in biomedical research. Much progress was made in resolving key issues, and a plan was established for ongoing collaborative work. Some outcomes are below, and more detailed notes can be found here .
- Terminological standardization of core terms. Terms such as ‘sequence’, ‘gene’, ‘allele’, variant’, ‘reference’, ‘mutant’, ‘genetic’ are variably and ambiguously used across communities, and required precise definitions and consistent use. Work is ongoing to craft such definitions, which will be reflected in our respective ontologies as they are refined and vetted.
- The ontological nature of sequences and sequence features (and their place in the BFO/IAO framework). Specific topics included: (1) the merits and implications of modeling sequence features as generically dependent continuants, or more specifically as information content entities, (2) defining identity criteria for sequence features to include their sequence and their position (as opposed to sequence only), (3) how to model attributes of sequence features such as biological activity, experimental provenance, reference status, and zygosity, and (4) the ways in which sequence features are considered to vary with respect each other (e.g. wild-type vs mutant sequences, reference vs alternate sequences).
- Gene representation, and modeling the central dogma. We debated strategies to provide an OWL-based ontological representation and identifiers for genes and their variants, that would serve SO, Monarch, and the broader phenotype community. Related discussions focused on how to build from this gene representation to link to derived sequences at RNA and protein levels, and describe properties that emerge in this derivation.
- Variant representation. A precise and explicit account of how the concept of ‘sequence variation’ should be defined across SO and GENO was established. In this model, a ‘variant’ is any sequence feature that varies_with some other instance of the same feature. So sequence variants are considered to be ‘variant_with’ any other version of that feature, rather than ‘variants_of’ some reference. But we will also represent more specific types of the ‘variant_with’ relation that describe the different ways biologists consider sequences to vary with each other based on the roles that the variants in this relation hold (including where one is reference and another alternate versions, or one is wild-type and the other mutant). This is a critical facet of relating phenotypes to genotypes.
- Integration of expression-level variation modeling in GENO with sequence-variation modeling in SO. Here, the high level approach for representing expression variation in terms of genetic sequences that are altered in their expression was reviewed and vetted by members of Monarch and SO teams. Several approaches for conceptual integration of the expression and sequence variation models are under consideration.
- Technical approaches for coordinated development. Topics included how to manage parallel construction and coordination of abstract SO and physical MSO ontologies – where strategies for automated derivation of the SO from the MSO were reviewed. In addition, we discussed how to manage community development of SO and GENO as integrated but separate ontologies, using existing platforms, tools, and standards for software development (Google projects, trackers, list-serves, build and QA tools, etc).
As noted above, more details on each of these topics, as well as many others, can be found in the document here . Participation of the broader community is encouraged through feedback on this document or participation in ongoing coordination calls (contact firstname.lastname@example.org for info).
- ICBO 2013 conference paper – http://www2.unb.ca/csas/data/ws/icbo2013/papers/ec/icbo2013_submission_60.pdf
- Presentation to the Phenotype RCN, October 2013 – http://www.slideshare.net/mhb120/phenotype-rcn-sogenoworkshopshared
- Google doc summarizing workshop outcomes – https://docs.google.com/document/d/1AUEVX0Sx_iy9mTI6F59Yo7ZCXu4zv5uSk28AHid5zhc/edit#