Wednesday, 30 October 2013

Example uses of the Catalogue of Life: Media libraries

Anyone who has ever searched an online image database looking for a particular species by scientific name will know it can be a frustrating exercise. Often the name you enter brings back a completely different species from the one you are looking for, or else it returns nothing at all. Yet this may not be due to the species identification skills of the photographer, or the online library's lack of images, but rather a result of the limitations of the controlled vocabularies that index these media libraries. For specialists that want to find images by scientific name, and for nature photographers with identification expertise that want to sell them, the integration of the Catalogue of Life taxonomy into these systems could certainly help yield better results. The Catalogue of Life uniquely offers a simplified and unified hierarchical classification across all organisms plus one accepted name for each species, and as such, has the potential to be used as an indexing mechanism for this kind of file management.

Commercial (and social) online contributor-based media libraries (such as iStockphoto, Shutterstock, Corbis Images, Getty Images, Alarmy, Dreamstime) have had, and continue to experience, exponential growth. Indexing and retrieval effectiveness across an international market are key to their usability and profitability.  Nature is arguably the most photographed and videoed area of life, and keywording or tagging such images for later retrieval ideally includes identifying key organisms present in each shot. So why is it that managers of these media libraries find it difficult to index nature-related information in an efficient and accurate way for their specialist users and contributors? 

Fig 1: Just some of the international common 
names held in  Catalogue of Life for this 
well-known and  wide-spread species of conifer 
Pseudotsuga menziesii

花旗松 (hua qi song) - China
British Columbia fir -  Canada, France
Douglas spruce  - Canada, USA
Douglas-fir - UK, USA
Oregon pine - UK, USA
Douglas d'Orégon - France
sapin de Douglas - Canada, France
Douglasfichte- Germany
Amerikai duglászfenyo - Hungary
Abete di Douglas - Italy
Douglasgran - Norway
The naming of organisms, both scientific and common, is rife with synonymy (different name, same species) and homonymy (same name, different species). Where common-use names are both language and location dependent (see Fig 1) and scientific names, while internationally recognised, can change periodically as competing academic views take precedence as was shown in a previous blog post on elephants. In addition, the ability to name a subject is dependent on the expertise of the contributor or the index creator, leading to varying levels of specificity, where an animal to one person, may be a pig to another, and Sus scrofa domestica to someone else! The sheer number and complexity of names can cause a great deal of confusion as is shown in the example at the end of this post.

For contributor based media libraries, controlled vocabularies are one of the most effective ways to control synonyms, arrange terms into hierarchies (to broaden or narrow search terms based on level of expertise), and determine other related or associate terms. As museum collection managers have shown for centuries, the Linnaean taxonomy of binomials in a rank-based classification is the most effective controlled vocabulary to deal with these issues in relation to species. An accepted scientific name offers a unique and universal code for every species and can act as the indexing tag of all other names - for example with plants, it can index common names, horticultural cultivar names, food ingredients and natural products. Yet the current controlled vocabularies and tagging systems used by media libraries are not adequately curating species, leading to missing, limited or incorrect search results. More names need to be added to these vocabularies, existing content re-tagged and a expert curated taxonomy decided upon, before they can hope to service expert users and handle the continued expansion of content predicted. Unfortunately no single, complete, electronic list of accepted species names exists anywhere, let alone with associated common names and synonyms. But by adding the most comprehensive list of accepted species and rank names, and manoeuvring contributors through the controlled vocabulary to ultimately choose one (as the defining tag) would be a step forward.

The Catalogue of Life is working to complete an inventory of life on earth where all known species (~1.9m plants, animals, fungi and micro-organisms) are named, documented and made available on the web.  This global quality-assured checklist currently holds over 1.4m accepted names, 1m synonyms and 0.5m common names (in multiple languages) and is expanding every month.  The Catalogue of Life is already used as an indexing mechanism by the world’s largest online biodiversity providers (European Nucleotide Archive, Encyclopedia of Life, IUCN Redlist, GBIF) and as a synonym expansion search tool (ie type in one name it will find resources that include all known synonyms too) for text-based resources such as Biodiversity Heritage Library and the Dictionary of Natural Products. While science has different motivations and methods to update and curate their datasets from those of the commercial media library, both desire the same end product - a quality controlled, up-to-date, sustainable, multilingual and internationally relevant taxonomy. Utilising the expertise of the curators that supply the Catalogue of Life is the best option for the rapid enhancement of controlled vocabularies for nature-related collections.

The example of 

Fig 2: Gentianella amarella
When submitting a species image, such as the plant in Fig 2 to the istockphoto image library, you are asked to tag it with appropriate keywords. For this image it seemed logical to include the following: 

  • Gentianella amarella -  the scientific name
  • Autumn Gentian - the UK common name
  • Gentianaceae - the plant family
  • Gentian - the common name for the family, and lastly
  • Northern Gentian - another known common name for this plant from Canada. 

What is returned from the image manager is as follows:

  • Northern Gentian is 'unknown'
  • Autumn Gentian is 'unknown'
  • Gentianella amarella is 'unknown'
  • Gentianaceae is recognised but as a synonym of Lisianthus 
  • Gentian is recognised and can be included

This example shows the current limitations of the controlled vocabulary of istockphoto in not adequately dealing with the indexing of this species. Apart from not recognising the common names of a relatively well known wildflower in the UK and Canada, it also is unable to recognise the scientific species name. Furthermore, it is erroneously matching the whole plant family Gentianaceae to the name Lisianthus. Lisianthus is a commonly used name for the cultivars of one species of Gentianaceae in the small genus Eustoma. However, in science Lisianthus is the name of a different genus in Gentianaceae, and Gentianaceae has 78 possible genera, of which Lisianthus is just one.

What this means for the Gentianella amarella image seeker is they will probably experience the frustration noted at the start of this post. Only a broader search term will help find their species (in this case 'Gentian'), that will then return many unwanted images that they will need to wade through to find one that they want.  If Gentianella amarella had been in the vocabulary, both image contributor and end user would have had a greater chance of success.

No comments:

Post a Comment