Thursday, 8 August 2013

i4Life Part 1: Improving the world's taxonomic data indexing

The i4Life project is enabling a flow of names and taxonomic concepts between the Catalogue of Life, its supporting Global Species Databases and the internationally used biodiversity data portals.

What this means is whether you are looking for DNA sequences, distribution patterns or conservation status of your chosen species, the shared and interlinked catalogues of organism names in i4Life will help you to find the same plant, animal, fungus or micro-organism under the same name in each data portal.

How is it doing this?  Well it is utilising a number of informatics tools developed during the lifespan of the project to manage both the cycle of taxonomic data flow between data providers and the processes that enable taxonomic improvements along the way. This system, and the high level of collaboration that is required among project partners to make it work, is depicted in the i4Life workflow diagram below. Over the coming weeks on the blog, we will look at some of these tools in detail to see what exactly they are doing and how they are doing it. In the meantime please see if from the diagram you are able to work it out yourselves!

The i4Life system diagram
Next up: i4Life Part 2: Catalogue of Life


  1. My first impression looking at this is diagram is the emphasis on process. From my perspective the Catalogue of Life has never been terribly useful as it is not linked to data. Instead users are expect to trust its "authority". I want names connected to data, including the primary taxonomic literature. For example, why should I accept this name over that one? Why should I follow a CoL classification when other databases follow different classifications? For example, CoL is horribly out of data for groups such as frogs, and its data harvesting processes frequently mangle the taxonomic literature supplied by contributing databases.

    Don't get me, CoL is an impressive achievement, but at its heart it's simply a set of digitised 5x3 index cards listing taxonomic names. I think we need a lot more than this if we are to make progress.

    1. Hi Rob, Catalogue of Life aims to deliver a taxonomic opinion for each group of living organisms. Those opinions are based on primary taxonomic data assessed by people. It does not aim to re-present the raw data on which those opinions are based. However it does have the facility, and many contributors use this, to cite relevant taxonomic literature on which the opinion is based. Take the Zebra Danio - Danio rerio as an example; look at the entry on Catalogue of life and you will find that the current use of the name is cited against "Menon, A.G.K., 1999. Check list - fresh water fishes of India. Rec. Zool. Surv. India, Misc. Publ., Occas. Pap. No. 175, 366 p.", the common names are listed against a wide range of different original references, distribution is referenced and so on. Contributing databases have the opportunity to update their content on a monthly basis through the dynamic checklist. Not all databases are as comprehensive as that for fish but that is an aspirational level of achievement for most contributors. Catalogue of Life has a review process to allow better databases from new sources to replace out of date ones.
      The diagram you see above IS a process diagram. It describes a process of data gathering and feedback that did not exist before i4Life. The process isn't about delivering data that allows taxonomy to be done, it is about delivering names to taxonomists who can evaluate them through their own research and about delivering names to data providers that allow them to cross reference their taxonomies with a single common one. Ultimately it isn't about delivering a 'correct' taxonomy, and I would argue there is no such thing, but delivering a single reference taxonomy that allows others to stick their differing data together more easily. We now have a system in place that performs cross mapping of names, and to an extent concepts based on synonymy, between some of the world's major data providers. You always say data should be sticky - in this case the project is giving the means for data on DNA sequence, distribution, conservation status etc to be linked through a common taxonomic backbone. To me that bears little resemblance to a 5x3 card index!

  2. My remark about "5x3 card index" is that CoL has no links to the literature, simply having a text string as a citation is frustrating and an anachronism in the age of the web (I have to cut and paste it into Google and see if I get lucky). The page for any taxon is essentially what you'd have if you'd scanned a bunch of 5x3 index cards. I think we need to aim higher than that. Obviously we will be limited by the extent to which literature is available online, but that is growing every day.

    An analogy would be how you'd tackle the task of digitising the contents of a library. In the early days we digitised the card catalogue, because that was the authoritative list of what was in the library. Then along came Google and said "let's scan all the books". Turns out that makes it a lot easier to find things than any card catalogue. I think CoL is stuck at the level of scanning index cards, when we can do so much more.

  3. Give me a place to stand and with a lever I will move the whole world.... But I'd settle for suitable grant funding and a computer to add hyperlinks to CoL. This might be an action for a Horizon2020 grant but there is still a huge problem of scanned but poorly marked up legacy literature - and bigger one for the literature that is not yet digitised.