Tree of life and data integration challenges at the first FuturePhy workshop

 Phenex, Phylogenetics, Teleosts, Workshops  Comments Off on Tree of life and data integration challenges at the first FuturePhy workshop
Apr 062016

What are the challenges in building, visualizing and using the Tree of Life? How can we best utilize and build on existing phylogenetic knowledge and look ahead to address the challenges of data integration? Recently, fellow Phenoscaper Jim Balhoff and I attended the first FuturePhy workshop in Gainesville, Florida (February 20-22, 2016). The workshop brought together three taxonomically-defined working groups (catfish, beetles, barnacles) to build megatrees from existing phylogenetic studies, and identify and begin applying diverse data layers for their respective groups. Open Tree and Arbor personnel were on hand discuss and help solve issues in data integration.

The catfish team (John Lundberg, Mariangeles Arce, Jim Balhoff, Brian Sidlauskas, Ricardo Betancur, Laura Jackson, Kole Kubicek, Kyle Luckenbill, and myself, Wasila Dahdul) included participants with expertise in catfish anatomy, phylogenetics (molecular and morphological), development, bioinformatics, and digital imaging. We were motivated to build on the work of the All Catfish Species Inventory to achieve a more complete understanding of catfish diversification by integrating published phylogenies, 2D and 3D images in various online repositories, and thousands of computable phenotypes for catfishes in Phenoscape.

Screen Shot 2016-04-06 at 9.58.44 AM

We held several hands-on sessions on tree grafting (using Mesquite, R, and Arbor), data annotation (using Phenex), and tree submission to Open Tree.  We also examined an automatically generated supermatrix for 18 published catfish matrices in the Phenoscape KB (generated using the OntoTrace tool), and prototype data visualizations for supermatrices developed by Curt Lisle in Arbor. We used Mesquite to manually create a draft megatree, and in parallel, uploaded trees to Open Tree, which automatically synthesized a megatree for catfishes. Our plan is to compare the output of manual tree-building in Mesquite with the automated tree from Open Tree.

Among the issues and priorities that emerged during the workshop was the need for inclusion of the authoritative Catalog of Fishes taxonomy in Open Tree, and allowing the addition of unnamed or uncertainly identified taxa commonly used in matrices. We also discussed challenges in automated character consolidation across multiple studies, and the reuse of images across multiple online archives.

We left with a plan to continue tree building and data layer integration post-workshop, with the aim of publishing the catfish megatree (including the methods and remaining challenges) and the integration of data layers via interactions between Arbor, Open Tree, and Phenoscape.

Filed under: Phenex, Phylogenetics, Teleosts, Workshops

Ontology-based text markup tools

 Uncategorized  Comments Off on Ontology-based text markup tools
Jan 142016

Efficiently extracting knowledge from the published literature is a challenge faced by many database projects in biology, and many of us are interested in tools that can assist and speed up the task of identifying concepts in free text. I’ve recently used two text markup tools that are helpful in keeping up with the literature and rapidly developing ontologies. As a participant in the Fifth BioCreative Challenge, in which biocurators test and evaluate text mining systems, I evaluated the EXTRACT bookmarklet tool. EXTRACT was developed for metagenomics data and provides full-page tagging of mapped terms from environment, disease, taxonomy, and tissue ontologies, and can also markup shorter selections of text on an HTML page. The tool is immediately useful, particularly during the first stages of the curation process, as a curator is surveying the literature for relevant articles.

Annotating long, descriptive text has also been a challenge for Phenoscape. To assist curators in this task, we recently added a text annotator tool to the Phenoscape Knowledgebase that tags selected text passages copied in from a source with matched terms from anatomy (Uberon), taxon (VTO), and quality (PATO) ontologies. Viewing the annotated results, with color-coded text, has aided curators in the process of applying large, complex ontologies to equally complex text.

Filed under: Uncategorized

Half-duck, half-crocodile, and bigger than T. Rex: a giant semiaquatic predatory dinosaur

 Uncategorized  Comments Off on Half-duck, half-crocodile, and bigger than T. Rex: a giant semiaquatic predatory dinosaur
Sep 262014

A team led by University of Chicago Phenoscapers Nizar Ibrahim and Paul Sereno have published new findings about the remarkable semiaquatic predatory dinosaur Spinosaurus aegyptiacus in the latest issue of Science.  It has been receiving some nice coverage at NPR and other news outlets.

Workers at the National Geographic Museum in Washington grind the rough edges off a life-size replica of a spinosaurus skeleton.  Credit: Mike Hettwer/National Geographic.

From the abstract:

We describe adaptations for a semiaquatic lifestyle in the dinosaur Spinosaurus aegyptiacus. These adaptations include retraction of the fleshy nostrils to a position near the mid-region of the skull and an elongate neck and trunk that shift the center of body mass anterior to the knee joint. Unlike terrestrial theropods, the pelvic girdle is downsized, the hindlimbs are short, and all of the limb bones are solid without an open medullary cavity, for buoyancy control in water. The short, robust femur with hypertrophied flexor attachment and the low, flat-bottomed pedal claws are consistent with aquatic foot-propelled locomotion. Surface striations and bone microstructure suggest that the dorsal “sail” may have been enveloped in skin that functioned primarily for display on land and in water.

Citation: Ibrahim N, Sereno PC, Dal Sasso C, Maganuco S, Fabbri M, Martill DM, Zouhri S, Myhrvold N, Iurino DA (2014) Semiaquatic adaptations in a giant predatory dinosaur. Science.

Filed under: Uncategorized

Phenoscape poster at Evolution 2014

 Conferences, Curation Tools, Data Curation  Comments Off on Phenoscape poster at Evolution 2014
Aug 272014

I attended the Evolution 2014 meeting a few months ago in Raleigh, NC, and presented a poster on Phenoscape’s curation effort: “Moving the mountain: How to transform comparative anatomy into computable anatomy?”, with coauthors A. Dececchi, N. Ibrahim, H. Lapp, and P. Mabee. In this work, we assessed the efficiency of our workflow for the curation of evolutionary phenotypes from the matrix-based phylogenetic literature. We identified the bottlenecks and areas of improvement in data preparation, phenotype annotation, and ontology development. Gains in efficiency, such as through improved community data practices and development of text-mining tools, are critical if we are to translate evolutionary phenotypes from an ever-growing literature. The poster was well received and several researchers at the meeting were interested in learning more about open source tools for phenotype annotation.

Filed under: Conferences, Curation Tools, Data Curation

The Vertebrate Taxonomy Ontology

 Taxonomy Ontology, Vertebrates  Comments Off on The Vertebrate Taxonomy Ontology
Jan 252014

Our paper describing the Vertebrate Taxonomy Ontology (VTO)  is published!  See: .

One primary objective for Phenoscape and similar projects is to aggregate phenotypic data from multiple studies to named taxa, which in many phylogenetic studies are species but also might be at higher taxonomic levels such as genera or families. While there are many widely used taxonomies that include rich sampling of species and higher taxa, for example Bill Eschmeyer’s widely used Catalog of Fishes, there are few vetted “bridging” taxonomies that allow for aggregating data across, say, fishes, amphibians, and mammals. This problem becomes even more acute when you consider integrating data for extinct taxa as well. As a first step towards addressing this issue for vertebrates, we created the Vertebrate Taxonomy Ontology (VTO) that brings together taxonomies from NCBI, AmphibiaWeb, the Catalog of Fishes (via the previously existing Teleost Taxonomy Ontology), and the Paleobiology Database. The resulting curated taxonomy contains more than 106,000 terms, more than 104,000 additional synonyms, and extensive cross-referencing to these existing taxonomies. The Phenoscape Knowledgebase will leverage this taxonomic ontology by allowing for phenotype statistics to be displayed by taxon, including coarse measures of the extent of annotation coverage and phenotypic variation. Though phenotypes may be annotated to a species, the use of an ontological framework for the taxonomic hierarchy facilitates aggregating phenotypes to higher levels, such as genera or families. In the future, we hope to be able to integrate other excellent and rich sources of taxon-specific taxonomies, such as that in the Reptile Database or the International Ornithologists’ Union Bird List. This is a work-in-progress and the Phenoscape team is certainly interested to integrate new taxonomic sources as well as explore different ways that such a resource can be used and developed by the larger community.


Filed under: Taxonomy Ontology, Vertebrates

Phenoscape makes a splash at SVP, TDWG

 Conferences  Comments Off on Phenoscape makes a splash at SVP, TDWG
Jan 102014

In an effort to expand the user community and to demonstrate what is possible using our infrastructure, members of the Phenoscape team gave multiple presentations across two continents on our recent developments. In late October Paula Mabee gave an invited presentation on mapping phenotypes across phylogenies at the Muséum national d’Histoire naturelle in Paris. This was followed by presentations at the 73rd annual meeting of the Society of Vertebrate Paleontology (SVP) in Los Angeles and the 2013 meeting of the Taxonomic Database Working Group (TDWG) in Florence, Italy. Phenoscape had a significant presence at SVP with both a poster presented by Alex Dececchi demonstrating our progress in generating supermatrices from our annotations as well as a talk given by collaborator Karen Sears, using EQ supermatrices from Phenoscape fin/limb data to examine integration patterns across the fin to limb transition. Karen’s talk marks the first of the collaborations coming out of our 2013 San Francisco workshop. It also showed how data from Phenoscape can drive independent projects and is easily integrated with existing phylogenetic and statistical tool such as Mesquite and various R modules. The talks and poster were well received, with numerous researchers inquiring on how they could incorporate Phenoscape or use ontology based annotations.

Filed under: Conferences

Report from Tucson: from characters to annotations with text mining

 Data Curation, NLP, Phenex, Software  Comments Off on Report from Tucson: from characters to annotations with text mining
Mar 302013

There is a wealth of phenotypic information in the evolutionary literature that comes in the the form of semi-structured character state descriptions. To get that information into computable form is, right now, an awfully slow process. In Phenoscape I, we estimated that it took about five person-years in total to curate semantic phenotype anphenowordcloudnotations from 47 papers. If we are to get computable evolutionary phenotypes from a larger slice of the literature, we really need to figure out ways to speed this up.

One promising approach is to use text-mining.  This could contribute in a few different ways.  First, one could efficiently identify all the terms in the text that are not currently represented in ontologies and add them en masse, so that data curation does not have to stop and resume whenever such terms are encountered. Second, one could present a human curator with suggestions for what terms to use and what relations those terms have to one another, speeding the process of composing an annotation.

CharaParser, developed by Hong Cui at the University of Arizona, is an expert-based system that decomposes character descriptions into recognizable grammatical components, and it is now being used in several different biodiversity informatics projects. Baseline evaluation results from BioCreative III showed that a naive workflow combining CharaParser and Phenex, the software curators use to compose ontological annotations and relate them to character states, was capable of identifying candidate entity and quality phrases (it outperformed biocurators by 20% in recall on average) but had difficulty translating those into ontological annotations.  This first iteration workflow also was not yet reducing curation time.

In March, a small contingent from NESCent (Jim Balhoff, Hilmar Lapp and Todd Vision) visited Hong Cui’s group in Tucson. We talked through improvements to CharaParser and the curation workflow, brainstormed plans for a more thorough set of evaluation tests, began refactoring of the code so that it can be more easily shared across projects, and gained a better understanding of what features make a character difficult to curate for humans vs. text-mining.  We made substantial progress on all fronts, and are looking forward to seeing how much improvement in the accuracy and efficiency of curation will be achieved in the next round of testing.

We are also pleased to report that the CharaParser codebase will now be available from GitHub under an open source (MIT) license.

Filed under: Data Curation, NLP, Ontology, Phenex, Software

California Dreaming

 Evolution, Knowledge Base, Outreach, Science, Workshops  Comments Off on California Dreaming
Mar 282013

Winner of a competition among participants to illustrate the essence of Phenoscape, from Paul Sereno

It’s easy to get caught up in the details when developing infrastructure. You know it will be useful – because the grant application said so!  But there’s so much engineering to do. And no matter how thoughtful and deliberate a process you follow to anticipate the needs of your future users, once they have a complicated thing in their hands who knows how they will actually use it.

Enter the Phenoscape Knowledgebase.  After a heroic data collection push this winter, our next release of the Knowledgebase will contain millions of evolutionary phenotypes from throughout the vertebrates, linked to genetic phenotypes from human, mouse, Xenopus, and zebrafish, and a particularly rich set of annotations for skeletal features of fins and limbs.  The Knowledgebase is far from comprehensive, and annotations do not capture the full richness of the original characters in the evolutionary literature, but we think it’s a pretty useful resource.

So, it’s time to see what capabilities our users are excited by and what limitations frustrate them. To that end, we brought a small group of experts who look at phenotypes in a variety of different ways (e.g. genetics, systematics, evo-devo, clinical biomedicine, paleontology, even zooarchaeology) to the California Academy of Sciences in February, and we asked them what questions they’d most like to address using the KB as it exists today.

To help us in tapping into the assembled brainpower, we enlisted KnowInnovation, facilitation pioneers that specialize in helping researchers self-organize into teams to tackle creative research challenges. This they did with amazing resourcefulness, milking ideas out of us that we wouldn’t have imagined we even had.  The workshop was no ordinary parade of PowerPoints. We did speed-dating to toss research ideas off of each other, generated a  staggering number of post-it notes, sculpted creatures and skeletal parts out of clay and engaged in a host of other seemingly contrived but strangely liberating activities.  We watched in amazement as Karl Gude took visual minutes.


And we came up with some great collaborative ideas for research that take leverage the Knowledgebase to ask questions that would have been difficult to impossible to answer without it, including questions about genetic convergence and parallelism, global comparisons of intra and interspecific phenotypic variation, and the evolution of phenotypes affected by duplicated genes. These projects will now serve as driving applications for Phenoscape so that we know better what our users really need the Knowledgebase to do for them.  We look forward to reporting on the outcome of those in due course.

A big thank you to David Blackburn and the Cal Academy for providing such an inspiring venue, being exquisite hosts, and for conveniently having an open museum night during our workshop.  Thanks also to a great group of participants and facilitators, and to to NSF for a supplemental award that helped to make the workshop a success.

Filed under: Evolution, Knowledge Base, Outreach, Science, Workshops

Homology in anatomy ontologies: Report from a Phenotype RCN meeting

 Anatomy Ontology, Homology, Vertebrates, Workshops  Comments Off on Homology in anatomy ontologies: Report from a Phenotype RCN meeting
Feb 262013

At the end of October 2012, the working groups of the Phenotype Research Coordination Network (RCN) all met at the Asilomar Conference Center, in Pacific Grove, CA. One of the groups, the Vertebrate working group, made it their goal to discuss methods of representing phylogenetic and serial homology in anatomy ontologies, an issue that is central to Phenoscape as well. Though common ancestry is implicit in the semantics of many classes and subclass relationships (see for example the ‘homology_notes’ for digit in Uberon), most multispecies anatomy ontologies, including Uberon, VSAO, and TAO, do not assert homology relationships between anatomical entities.  Nonetheless, homology is central to comparative biology, and therefore to enriching computations across data types, species, and evolutionary change.

The working group used ontological relationships, phenotypes, and homology assertions across a small set of skeletal elements from vertebrate fins and limbs as a test case to identify requirements for making and reasoning over homology assertions. These included both positive (data expected to be returned) and negative (data expected not to be returned) results for particular queries involving phylogenetic and serial homology.  The group developed a number of such queries across subtype (is_a) and partonomy (part_of) relationships.  One example is that without homology assertions a query for phenotypes involving the ‘humerus’ would not retrieve phenotypes for ‘femur’.  Asserting that the ‘forelimb skeleton’ is serially homologous to the ‘hindlimb skeleton’ would not remedy this, because doing so would not imply that their parts (humerus and femur, respectively) would be homologous as well.  Instead, serial homology must be directly asserted between entities, even when they are parts of other already homologous structures (i.e., in this case humerus and femur have to also be directly asserted to be serial homologues).  Conversely, it was determined that homology relations, both serial and phylogenetic, should propagate to subclasses. For example, to return phenotypes for types of both the ‘paired fin skeleton’ and the ‘skeleton of limb’ in a query for either requires asserting phylogenetic homology only for these high-level classes. With this assertion propagating to all their subclasses, such as  ‘pectoral fin skeleton’, ‘hindlimb skeleton’, or ‘autopodial skeleton’, phenotypes for any of their subtypes would then also be returned.  The group also discussed how to define the identity of elements of a series consistently and ideally, universally.  The consensus was to specify subsets of digits for different taxa with different conventions, e.g., a basal tetrapod subset and a bird subset.

In summary, as identified at the workshop the requirements for reasoning over both phylogenetic and serial homology turned out to be fully consistent with standard OWL property semantics. Furthermore, the recommendations that emerged from the workshop for defining elements in a repeated series are fully in line with the goal of defining classes in anatomy ontologies such that they can be applied unambiguously, including in a manner that is not inconsistent with knowledge of developmental and evolutionary origin.

Aside from several Phenoscape personnel (Jim Balhoff,  David Blackburn, Alex Dececchi, Hilmar Lapp, Paula Mabee, Chris Mungall), participants in the meeting included Eric KansaHans Larsson and Karen Sears, who were new to the RCN (and Phenoscape). We are grateful to them for helping us work through the questions in a way that kept it grounded in enabling science.

Filed under: Anatomy Ontology, Homology, Ontology, Vertebrates, Workshops

Society of Vertebrate Paleontology Annual Meeting 2012 (Raleigh, NC)

 Conferences  Comments Off on Society of Vertebrate Paleontology Annual Meeting 2012 (Raleigh, NC)
Nov 272012

The Phenoscape project had a strong presence at the largest Vertebrate Paleontology/Comparative Anatomy conference in the world this year, the Society of Vertebrate Paleontology annual meeting. In one of the large conference halls, and in front of a packed audience, I gave a talk on the history, goals and background of the Phenoscape project (“Phenoscape: A New Anatomical Ontology of Vertebrates”). The authorship also included Paul  Sereno, Paula Mabee, Todd Vision and Hilmar Lapp. The talk was well received, and several attendees expressed great interest in our work. The difficult part now is to make sure this first spark of interest is maintained – this can be difficult when the community has not been exposed to ontologies before and the project appears to be so different from anything they have done before – but we’ll do our best to stay in contact with those people that expressed strong interest.

Alex Dececchi presented a poster on Phenoscape at the same conference (Phenoscape: bridging the gap between fossils and genes – his co-authors were J. Balhoff, W. Dahdul, N. Ibrahim, H. Lapp, P. Midford, P. Sereno, T. Vision, M. Westerfield, P Mabee and D. Blackburn), making sure that even those that could not attend the talk would get an opportunity to learn more about our exciting work.

Nizar Ibrahim, University of Chicago

Filed under: Conferences