Integrating the Paleobiology Database (PaleoDB) into our taxonomy workflow

 Data Curation, fossils, paleobiology database, taxonomy, Taxonomy Ontology  Comments Off on Integrating the Paleobiology Database (PaleoDB) into our taxonomy workflow
Feb 142012

In the original Phenoscape project, our focus was on asking comparative questions regarding living taxa. Although we added fossil taxa to the Teleost Taxonomy Ontology (TTO) when our publications included them, we had no general need to add fossil taxa to the contemporary groups provided by the Catalog of Fishes.   However, in our renewal, the focus has both expanded taxonomically (to all vertebrates) and narrowed to the evolution of fins and limbs.   The evolution of limbs from fins occurred over 300 million years ago, meaning the morphological data for this transition exists only in the fossil record.  Therefore, including fossil data and taxonomy has become essential.

These fossil taxa are not available in the major online sources of names, whether taxon-specific, such as Catalog of Fishes, or general such as Catalog of Life or the NCBI taxonomy. Although NCBI includes some fossil taxa, taxa are only included when a related molecular sequence is submitted, which will never be the case for the vast majority of fossil taxa. These latter taxa will only ever be represented as morphological remains.

This need for fossil data, along with the absence of names from recognized sources, requires us to either add names (and hopefully plausible taxonomy) as curators encounter them in papers, or find an alternative source for names of fossil taxa. Although we have and will continue to add fossil taxa to our taxonomy, we do not, and did not intend to become a name or taxonomy authority in our own right.  In light of the strengths and weaknesses of the Phenoscape team allying with a recognized source of fossil taxonomy seems the best option.

The Paleobiology database also called PaleoDB or simply PBDB is an online repository covering a wide range of paleontological data across all taxa represented in the fossil record. These data include names as well as taxonomic opinions appearing in paleontology publications. These data are available and queryable on the PBDB website and are also available for bulk download. As part of developing the Vertebrate Taxonomy Ontology (VTO), an expansion of the TTO to cover all vertebrates and several chordate groups of interest, I have implemented a tool that adds the content of these bulk downloads to a taxonomy ontology. The process of updating from PBDB was designed to minimize disruption to the existing taxonomy by only adding new taxa from PBDB along with whatever taxonomic lineage is required to link each new taxon to a taxon already known to the existing taxonomy. This way, updating from PBDB does not disrupt any existing taxonomic hierarchy we have either incorporated from other resources or were the result of prior curators’ efforts.

However, no taxonomic resource is ever complete. As our term of curators annotate publications, they are encountering fossil taxa unknown to PBDB, and have begun contributing the publication and taxonomy information back to the PBDB. John Alroy and the PBDB board have accepted several project members as authorizers and enterers of data into the PBDB. This allows us to give back to the PBDB as well as simplify the process of adding fossil taxa to our vertebrate taxonomy. We have developed a workflow where a curator can enter publications, names, and taxonomic opinions directly into the PBDB. This immediately makes our additions visible to a wider community and the opportunity to engage expertise we may not have known existed. Subsequent PBDB bulk downloads will include these new names and reflect any changes to the taxonomic opinions entered during curation. These will then be added to the next update of the VTO.

Filed under: Data Curation, Taxonomy Ontology, Uncategorized Tagged: fossils, paleobiology database, taxonomy

Matching Phenotypes

 Informatics, Knowledge Base, Science  Comments Off on Matching Phenotypes
Dec 172010

An important goal for the Phenoscape project is to be able to suggest candidate genes that may have contributed to evolutionary change.  The way that we have proposed to do this is to search for changes in phenotype that appear as the result of mutations in model organisms and also appear as phenotype changes on an evolutionary tree.  There are several challenges in designing this search, apart from simply recognizing similar phenotypes, that we have been working on during the past few months.

The first issue is that we are interested in changes in phenotype, not simply matching phenotypes.  For phenotypes associated with mutants of model organism mutants, it is understood that they vary with respect to the wild type.  For taxa, however, this means looking for taxonomic nodes where variation in a phenotype is observed among the children of the node.  For example, there are nine species within the genus Aspidoras with annotations for the shape of the opercle bone.  Of these, eight exhibit opercle bones with round shape, but the ninth (A. pauciradiatus) is annotated with a triangular opercle.  In contrast, all three annotated species of the related Hoplosternum are annotated with a triangular opercle.  Thus there is detectable variation in opercle shape within the children of Aspidoras, but not within  Hoplosternum - suggesting that change in opercle shape has occurred somewhere among the descendants of  Aspidoras. For our analysis, identifying variation among descendants is important.

Thus, our search for shared variation in phenotypes focuses on matching phenotypes associated with genes with phenotypes of taxa showing variation.  However we are looking for matches at a larger scale than single phenotypes; we are looking for matches across the set of phenotypes affected by a gene or the set of features that have changed among the descendants of a taxonomic node.   We refer to these sets of phenotypes as the ‘phenotypic profile’ of a gene or taxon, following a seminal paper by Washington et al. 2009.  Washington et al. propose four metrics (three based on ‘information content’) to score matches between the sets of phenotypes in a pair of profiles.

In the course of developing the search, we have encountered several important differences in curation approach between ZFIN and Phenoscape.  In some cases tehre are different uses of PATO to model the same phenotype, for example the absence of an entity.  In other cases ZFIN uses a quality ‘abnormal’ that applies to mutants, but not in a taxonomic, comparative sense, which means these phenotypes will be inaccessible to us.  Thus, implementing this search is helping us to better understand our data and our choices in modeling the data and how it interoperates with other ontology-based data.  Such reflection would have been difficult or impossible without the use of ontologies to represent the phenotypes.

Filed under: Informatics, Knowledge Base, Ontology, Science

What’s new in TTO

 Taxonomy Ontology  Comments Off on What’s new in TTO
Jul 192010

In past months, the TTO (Teleost Taxonomy Ontology) has undergone some changes that will, we hope, make it more useful by connecting it with other taxonomic resources. Here, I will discuss three changes that have been added since last January, but check as more (and important) connections will be coming soon.

When the TTO was first built, we followed the pattern of the NCBI taxonomic ontology that was generated from the NCBI taxonomy database. One design feature of this ontology was the inclusion of terms for taxonomic ranks (e.g., family, genus, etc.) as a separate ‘tree’ of terms with the same ontology. The ontology file contained two root nodes, one for taxon terms, the other for taxonomic rank terms. We had long felt that ranks should exist in a separate ontology (more correctly a vocabulary) that could be shared across ontologies for different taxonomic groups. After several rounds of discussion on the obo-discuss list, we were invited in January to add the taxonomic rank vocabulary to the OBO library of ontologies of interest.

This acceptance allowed us both to register the rank vocabulary and to finally strip out the tree of rank terms from TTO and replace the internal rank tags with ‘has_rank’ links to terms in the (external) rank vocabulary. However, the new rank vocabulary is more than just the set of ranks that we used in tagging taxa in TTO.  The rank vocabulary  incorporates rank terms from two additional sources: first the rank terms that appear in the NCBI taxonomy itself, and also terms from a rank vocabulary developed for TDWG.  We hope that other taxonomic ontologies will be able to make use of this vocabulary.

More recently, we have gone back to the NCBI taxonomy and added cross references between our terms and lexically identical names in NCBI.  As TTO’s names are mostly drawn from the Catalog of Fishes, the exact relation between TTO terms and NCBI names is not, in some cases clear, which lead to the decision to leave the relationship at the level of a cross reference.

In the same release (156), common names, contributed by FishBase were added as synonyms.  As of now, approximately 16,000 taxa have common names with cross references back to their source in FishBase.  We hope to be able to add more common names and eventually include appropriate language tags to these names.

I’ve already started work on our next integration target, but I’ll save that for a later post.

Filed under: Taxonomy Ontology