[This post was written by Anne Thessen and originally appeared at datadetektiv.com; we thank Dr. Thessen for sharing it with our community as well!]
One of the more difficult aspects of trying to apply “big data” thinking in ecology is the massive heterogeneity of terms. I stumble over this issue every time I work on a data set for the Encyclopedia of Life. The many different ways to describe the same habitat (among other things) and the varying granularity with which people describe habitats make it very difficult for data consumers to find, for example, all the beetles that live in the desert. It’s doubly more difficult to go a step further and ask for traits of beetles that live in deserts, like color, for example.
As a side note, that example is very similar to some use cases I published with several colleagues about ways to combine phenotype and environment data.
Right now, we can ask Google “How much does a narwhal weigh?” and get the answer because of the fine work my EOL colleagues and I have been doing on TraitBank (go ahead, try it), but we’ve still got a way to go before we can ask “What color are beetles that live in the desert?”. We have a plan, though, and it involves semantic technology, i.e. ontologies.
Biology already has many ontologies available for use of varying quality. Most of them can be found at OBO Foundry. Not all domains of biology have good ontologies available, for example, ecology has been left out. That means there is no standard, machine-readable way of expressing which organisms are autotrophs, or nocturnal, or use camouflage, etc. Including terms such as these in an ontology is one of the many necessary steps before we can ask “Which organisms are nocturnal in an alpine forest habitat?” or, if we want to get more complicated, “Is there a relationship between the phylogeny of terrestrial, nocturnal organisms and latitude or elevation?”.
Building an ontology is a large, never-ending, hugely complicated task. One of my clients at University Colorado, Boulder, is the ClearEarth project. The goal of this project is to repurpose NLP and ML algorithms developed for biomedicine for use in geology and biology. These algorithms can read text and automatically generate ontologies. We’ve made a lot of progress annotating domain-specific text and will have some “auto-ontologies” by this summer. Very exciting! To support this effort and make sure the ontologies resulting from this project are meshed in with existing bio-ontologies, we are hosting an “ontology-a-thon” in Boulder this summer. Please take a look and apply, if you are interested in participating. We don’t have a detailed agenda just yet, but the idea is to get ontology and ecology experts in one room to curate the auto-ontology. All expenses paid, but space is limited.