Friday, 20 September 2013

i4Life Part 3: The Catalogue of Life

Catalogue of Life in i4Life data flow

Previous posts in this series:
Part 1: Improving the world's taxonomic data indexing
Part 2: Global Species Databases

The Catalogue of Life provides a unified taxonomic index of living species on Earth. The first Annual Checklist was produced in 2001 on CD and had 204, 216 species. Today it is on DVD and also online and contains over 1.4 million species from over 135 contributing taxonomic databases. The Annual Checklist, as the name suggests, is a once per year fixed edition and so is a referenceable version of the entire Catalogue. Additionally, the Catalogue of Life today produces a second edition - the now monthly updated Dynamic Checklist, this differs in not being a fixed edition, but reflects updates to the supplier databases or inclusion of new databases as and when the Catalogue of Life receive them. Accessing the Catalogue of Life is easy, you can do it through the easy to use browse and search interface online (shown by the 'End User' in the diagram above) or for larger data consumers who want all or part of the Catalogue, it can be done programmatically. Which ever way you choose to do it, the Catalogue of Life tries to make data sharing as easy as possible while ensuring that our contributors are fully credited for their work.

The Catalogue of Life has many users from different fields, including students, nature enthusiasts, ecologists, museum collection managers, publishers, commercial natural product manufacturers and policy makers to name but a few. It is unlikely that a taxonomist would use the Catalogue of Life for the taxa they have expertise in, but taxonomists do use it to investigate related taxa outside of their own area. The Catalogue of Life also provides a taxonomic backbone for large biodiversity data suppliers (i4Life Global Biodiversity Programme partners) where the 1.4 million species names act as an indexing mechanism within their databases, from which all other biodiversity data can radiate. It is this important user group that is central to the goals of i4Life and the tools and processes that it has created to achieve them. Using the Download, Piping and Cross-mapping tools (which will be covered in later posts in this series) developed during i4Life, the Catalogue of Life and its global partners have been able to identify the differences and similarities in their taxonomic catalogues. What follows this, is a process of harmonisation between all catalogues. Once this process is fully completed, it will also give a clearer indication of the remaining gaps in the world's understanding of biodiversity that need to be filled.

Today, we look at what the Catalogue of Life team do (referred to as workflows) to create the monthly Dynamic Checklist before making it available online and to Global Biodiversity Partners for use in their own data portals.

As outlined in the previous post in this series, the Catalogue does not produce data itself, instead it acts like a publisher, assembling side-by-side expert taxonomist's global species checklists into a unified and simplified whole. Keeping this huge global taxonomic checklist well organised and up-to-date is a complex task. To achieve this the Catalogue of Life team carries out a process of taxonomic data integrity checks, editorial and aggregation once a month known internally as 'data assembly'. Some of this assembly is automated, some semi-automated and some manual, which ever method is used, the system in place (called the Workbench) allows for a flexibility to move between the three as and when is required.  The editor has overall content control, the data assembly team carry out the updates, and the whole process of production and roll-out is overseen by the CoL Systems Manager. 

Each month a number of steps or workflows occur: Firstly, files are collected or received from GSDs. These are then extracted and transformed into the Catalogue of Life Standard Dataset in text delimited files. Some come this way already, whereas others need more work to get them into a usable format. This data is then transferred into MySQL, the database software that houses the Catalogue of Life. The next step is running checks for data integrity and consistency. Some are carried out and resolved automatically, whereas others, for example ones that deal with nomenclatural and taxonomic issues, may need manual input. After all editorial decisions have been made and updates completed, checklists are merged to form the complete Catalogue and this is then converted to a production schema and deployed on the University of Reading servers.

The following team members have particular responsibility for different stages of this process:

Executive Editor (Dr Yuri Roskov) and Editorial Assistant (Dr Thomas Kunze)
The usability of the Catalogue of Life is dependent on the underlying management classification for unification and simplification. The biological classification systems of different kingdoms follow different Codes and nomenclatural practices, in addition some have alternative classifications within kingdoms.  As noted above, the end user of the Catalogue of Life is generally not an expert in taxonomy so to present multiple taxonomies from which to choose one, would decrease the usability of the Catalogue for one very large user group (although it may improve it for another!). So the management classification, by using a single taxonomy brings all taxa across all kingdoms into a coherent master view, and where possible enforces consistent nomenclature. To place a global species checklist within the Catalogue of Life a specific set of adjustments may need to be decided upon by the editors on where and how to insert it, to make it as consistent as possible, while not losing the essential taxonomic information it has been created to provide.

The Catalogue of Life retains the GSD's own classification
below entry point and uses the management classification above.
When a simplification occurs - for example using the management classification above a GSD insertion point (see picture above) or removal of ranks not recognised by the Catalogue of Life - it is done with the knowledge that the Catalogue of Life links every species to its source database, where a full classification and often extra, associated data can be found by the user.

The Executive Editor is also responsible for continually searching for and identifying new taxonomic data sources. If one is found, the Editor then also facilitates the necessary independent Peer Review process that is required to occur before being accepted into the Catalogue of Life.

Data Assembly (Luvie Paglinawan until July 2013 and Luisa Abucay)
Some data integrity checking does not need taxonomic input, so for example, running a query to check that all synonyms in a checklist have an associated accepted species name or making sure that abbreviations of taxonomic ranks are consistent within a checklist can be carried out automatically, results of which are passed back to the database supplier for consideration of inclusion with the next update. The data assembly do not just run automated checks, they also carry out data transformations as instructed by the editors and do most of the initial data gathering and combining phase of the process.

Systems Manager (Viktor Didziulis)
The informatics processes are overseen by the Systems Manager and once all the updates and assembly have been completed by the data assembly team and editors the Systems Manager will oversee the deployment of the new version of the Dynamic Checklist to the servers. There is more than one server, both for security (ie if one server goes down there is another one running) and for updates, where one can be updated, whilst the other is still running and then vice-versa, so there is no interruption of service to the online Catalogue of Life users.

Global Biodiversity Programmes receive an updated Catalogue of Life database via webservices and the i4Life Download tool. The topic of our next post in this series.

Next up: i4Life Part 4: Download and Web Services

No comments:

Post a Comment