Tree of life and data integration challenges at the first FuturePhy workshop

 Phenex, Phylogenetics, Teleosts, Workshops  Comments Off on Tree of life and data integration challenges at the first FuturePhy workshop
Apr 062016

What are the challenges in building, visualizing and using the Tree of Life? How can we best utilize and build on existing phylogenetic knowledge and look ahead to address the challenges of data integration? Recently, fellow Phenoscaper Jim Balhoff and I attended the first FuturePhy workshop in Gainesville, Florida (February 20-22, 2016). The workshop brought together three taxonomically-defined working groups (catfish, beetles, barnacles) to build megatrees from existing phylogenetic studies, and identify and begin applying diverse data layers for their respective groups. Open Tree and Arbor personnel were on hand discuss and help solve issues in data integration.

The catfish team (John Lundberg, Mariangeles Arce, Jim Balhoff, Brian Sidlauskas, Ricardo Betancur, Laura Jackson, Kole Kubicek, Kyle Luckenbill, and myself, Wasila Dahdul) included participants with expertise in catfish anatomy, phylogenetics (molecular and morphological), development, bioinformatics, and digital imaging. We were motivated to build on the work of the All Catfish Species Inventory to achieve a more complete understanding of catfish diversification by integrating published phylogenies, 2D and 3D images in various online repositories, and thousands of computable phenotypes for catfishes in Phenoscape.

Screen Shot 2016-04-06 at 9.58.44 AM

We held several hands-on sessions on tree grafting (using Mesquite, R, and Arbor), data annotation (using Phenex), and tree submission to Open Tree.  We also examined an automatically generated supermatrix for 18 published catfish matrices in the Phenoscape KB (generated using the OntoTrace tool), and prototype data visualizations for supermatrices developed by Curt Lisle in Arbor. We used Mesquite to manually create a draft megatree, and in parallel, uploaded trees to Open Tree, which automatically synthesized a megatree for catfishes. Our plan is to compare the output of manual tree-building in Mesquite with the automated tree from Open Tree.

Among the issues and priorities that emerged during the workshop was the need for inclusion of the authoritative Catalog of Fishes taxonomy in Open Tree, and allowing the addition of unnamed or uncertainly identified taxa commonly used in matrices. We also discussed challenges in automated character consolidation across multiple studies, and the reuse of images across multiple online archives.

We left with a plan to continue tree building and data layer integration post-workshop, with the aim of publishing the catfish megatree (including the methods and remaining challenges) and the integration of data layers via interactions between Arbor, Open Tree, and Phenoscape.

Filed under: Phenex, Phylogenetics, Teleosts, Workshops

Report from Tucson: from characters to annotations with text mining

 Data Curation, NLP, Phenex, Software  Comments Off on Report from Tucson: from characters to annotations with text mining
Mar 302013

There is a wealth of phenotypic information in the evolutionary literature that comes in the the form of semi-structured character state descriptions. To get that information into computable form is, right now, an awfully slow process. In Phenoscape I, we estimated that it took about five person-years in total to curate semantic phenotype anphenowordcloudnotations from 47 papers. If we are to get computable evolutionary phenotypes from a larger slice of the literature, we really need to figure out ways to speed this up.

One promising approach is to use text-mining.  This could contribute in a few different ways.  First, one could efficiently identify all the terms in the text that are not currently represented in ontologies and add them en masse, so that data curation does not have to stop and resume whenever such terms are encountered. Second, one could present a human curator with suggestions for what terms to use and what relations those terms have to one another, speeding the process of composing an annotation.

CharaParser, developed by Hong Cui at the University of Arizona, is an expert-based system that decomposes character descriptions into recognizable grammatical components, and it is now being used in several different biodiversity informatics projects. Baseline evaluation results from BioCreative III showed that a naive workflow combining CharaParser and Phenex, the software curators use to compose ontological annotations and relate them to character states, was capable of identifying candidate entity and quality phrases (it outperformed biocurators by 20% in recall on average) but had difficulty translating those into ontological annotations.  This first iteration workflow also was not yet reducing curation time.

In March, a small contingent from NESCent (Jim Balhoff, Hilmar Lapp and Todd Vision) visited Hong Cui’s group in Tucson. We talked through improvements to CharaParser and the curation workflow, brainstormed plans for a more thorough set of evaluation tests, began refactoring of the code so that it can be more easily shared across projects, and gained a better understanding of what features make a character difficult to curate for humans vs. text-mining.  We made substantial progress on all fronts, and are looking forward to seeing how much improvement in the accuracy and efficiency of curation will be achieved in the next round of testing.

We are also pleased to report that the CharaParser codebase will now be available from GitHub under an open source (MIT) license.

Filed under: Data Curation, NLP, Ontology, Phenex, Software

Phenex 1.4.2 released

 Data Curation, Phenex  Comments Off on Phenex 1.4.2 released
Aug 162012

A new bugfix release of Phenex is available. Phenex 1.4.2 addresses the following issues:


Filed under: Data Curation, Phenex
 Posted by on August 16, 2012 at 9:20 pm  Tagged with:

Collaborative editing in Phenex 1.2

 Curation Tools, Phenex  Comments Off on Collaborative editing in Phenex 1.2
Feb 132012

We have recently released version 1.2.1 of our Phenex annotation software. This release adds some functionality for easier collaborative editing of data files. While our curators have used Subversion revision control software in the past, the new features make it more reliable to share Phenex data files with user-friendly file synchronization software such as Dropbox. While a NeXML document is open in Phenex, the application monitors for changes to the document file in the background. If the file is being shared via Dropbox and is simultaneously edited by someone else, Phenex will alert the user that the file has changed and offer to load the new version. If there are no unsaved edits then Phenex will reload the file automatically. Phenex 1.2 also provides an autosave feature which saves the document after every edit—this reduces the chance that the file might be edited elsewhere while one has unsaved changes, avoiding complicated file merges.

Filed under: Curation Tools, Phenex
 Posted by on February 13, 2012 at 5:35 pm  Tagged with:

Phenoscape and colleagues meet with PATO on ontology and phenotype representation issues, Sept. 25-27, 2010

 Phenex  Comments Off on Phenoscape and colleagues meet with PATO on ontology and phenotype representation issues, Sept. 25-27, 2010
Nov 132010

At the end of September, members of Phenoscape (Mabee, Balhoff), the Hymenoptera Anatomy Ontology (HAO) project (Yoder, Deans, Seltmann) and TAIR (Huala) met with developers of the Phenotype and Trait Ontology (PATO) (Gkoutos, Mungall, Westerfield, Lewis) at the University of Oregon.   Our discussions were focused on finding solutions to problems that have arisen as a result of PATO ontology structure, and problems for representing phenotypes in the EQ model, which have arisen in the course of annotating comparative phenotype data from the fish and hymenoptera literature.  We prepared for this meeting by developing a list of common issues and importantly, specific examples, on a Google doc shared among participants.  We all co-edited this document during the meeting with notes, decisions and examples, and we ‘published’ this Google doc for you all to see.  A number of important changes to the PATO hierarchy were proposed and subsequently made.  We also clarified best practices for modelling some common but tricky phenotypic features. One additional outcome was the participants strong recommendation that a ‘shape jamboree’ be held to improve the usability of this branch of the PATO ontology.

Some proposed changes to PATO:

  • Consolidate relational and monadic branches: One of the more major decisions was to remove the distinction between the “relational” and “monadic” branches of PATO. The relational terms can be descendants of the monadic terms which are pre-composed with a reference to a dependent entity. This change would be a major improvement helping to relate terms dealing with similar concepts and for inference using these concepts.
  • Continuums: Add a relationship ‘ranges_from’ which can be used to specify a continuum of values between two indicated qualities.  Example: a color grading “from yellow to brown”.
  • Position: Remove term ‘position’ after moving its children to more appropriate places.
  • Enable directional references: Add classes describing directions a structure can be pointed.  These classes should include pre-composed logical definitions drawing on the spatial ontology. An example would be ‘directed posteriorly’.
  • Spatial term review: Review all existing PATO terms referencing spatial aspects, and verify that they are based on logical definitions using the spatial ontology.
  • Size vs. shape: Several children of ‘size’ were noted to actually be types of ‘shape’. Many of these were noted and several were immediately revised by George Gkoutos.  This discussion revealed that many free-text characters defined by biologists refer to size-sounding terms when they are actually describing changes in shape.
  • Changes in various term definitions:
    • PATO:1485 should be obsoleted and two new separate terms, ‘condensed’ and ‘compressed’ (as a synonym of ‘flattened’) should be added. ‘Condensed’ is considered to be a type of ‘structure’, while ‘flattened’ is a ‘shape’ instead of a ‘curvature’. All children of ‘flattened’ should be reviewed to remove references to ‘curvature’.
    • Improved definition for ‘morphology’ referencing “shape or size or structure”.
    • Clarified terms ‘surface feature shape’ and ‘texture’. The parent term ‘surface shape’ was obsoleted while ‘surface feature shape’ is retained as a shape with a repeated feature on a surface. A ‘has_repeated_part’ relation should be added to be used in pre-composed subclasses of ‘surface feature shape’, linking them to specific shapes.
    • Add a term defining ‘spatial density’. The existing ‘density’ term refers to the physics concept.
  • Spatial pattern: We proposed reworking the ‘spatial pattern’ term hierarchy to logically reference other terms which the given pattern is in respect to (structures, colors, etc.). The ‘color pattern’ term should be moved under ‘spatial pattern’.
  • Synonyms: We discussed some community-specific term labels for various PATO classes.  There is an existing OBO standard for how to implement these as synonyms. Community-specific applications would need to support display of the appropriate synonym.
  • Comparative relations: Generalize existing comparative relations in PATO.  For example, Instead of ‘increased_in_magnitude_relative_to’, it will be ‘increased_in_value_relative_to’.
  • Qualitative branch: Clean up and better document ‘qualitative’ hierarchy (which is used for various logical “shortcuts”).

EQ representation issues:

  • Size bins: Relative size characters (small, medium, large) can be represented by creating (within a given annotation application) anonymous subclasses of size which are related to each other in the appropriate way using relations such as ‘increased_in_magnitude_relative_to’.  This will provide the appropriate relative size inferencing for the given character states, but, as in the original paper, not be readily comparable to size classes created for characters in other studies.
  • Negation: When describing a phenotype that is simply “not something else”, e.g. ‘not round’, the complement_of operator should be used in an OWL class expression.
  • Comparative/relative qualities should not be conflated with relational (or, better, dependent) qualities. Comparative relations such as ‘increased_in_value_relative_to’ can be used to relate one EQ to another.
  • Phenex and other annotation tools should provide enhanced interfaces for these special representation issues: creating comparative EQs by simply entering a relative entity or taxon, a simple means to say things like ‘not round’, and a way to create local groups of relatively ordered qualities for a given character (e.g. small, medium, large).

Filed under: Ontology, Phenex

Phenex 1.0.3 released

 Curation Tools, Informatics, Phenex, Software  Comments Off on Phenex 1.0.3 released
Feb 232010

Phenex 1.0.3 is now available.  This release fixes a serious bug which caused Phenex to append modified phenotype annotations within files, instead of replacing the previous data. Phenex will now read and write NeXML files correctly. It should also automatically recover the latest data from files saved with older versions of Phenex.

All Phenex users should replace their current copy of Phenex with the latest release. It can be downloaded from the Phenex homepage on the Phenoscape wiki.

Filed under: Curation Tools, Informatics, Phenex, Software
 Posted by on February 23, 2010 at 9:14 pm  Tagged with: