Today, I gave a presentation about ‘Data interoperability’ at the NBIC BioAssist meeting. I tried to recap the various things that were said at the mini-symposium on study capture I organized last week together with Morris Swertz.
There are many levels at which you can approach this, but I think that most people will agree that we have at least an interoperability challenge when it comes to storing bioinformatics data. Because no matter what type of data you are generating, in many cases you will want to connect to other databases and information out there on the web – be it genomic annotations, protein sequences, metabolites, biological pathways or even PubMed publications. Now most of those websites will have webservice connectors, either via SOAP or RESTful services serving JSON or XML. And yes, there are workflow tools like Taverna or Galaxy that can take advantage of those connectors. But I don’t want to do workflows, I just want to enrich my data with e.g. BioPortal ontologies. So I can either write all those connectors myself, or I can take advantages of libraries such as Ontocat or BioPython, depending on what’s available in my coding language, but either way I will have to change my database schema to implement these foreign identifiers. Not to mention the synchronization and dependance issues involved in that.
Now that’s usually where the Semantic Web people come in. If you store your data as triples, they argue, you have a very flexible and extensible data model – actually, you don’t have structure limitations at all. But storing my data as triples introduces multiple issues as well. For example, in NGS, performance is already an issue when working with well-designed conventional data tables, let alone if I would have to extract the data for a specific sample from a triple store with billions of triples each time I do a calculation. Also, I would need a triple store that can actually store numbers of different types – integers or floats depending on the property. Last but not least, you lose important database features such as mandatory fields or foreign keys, and keeping the data consistent and validated will be much harder.
And there’s the issue of identifiers, which supposedly could be solved by Semantic Web practices. For example, the ConceptWiki has taken up the Tantalus job of creating identifiers for all unique concepts in the life sciences – people, publications, proteins, SNPs… And not only that, they also retain ‘also known as’ tables for each concept to keep track of all the public database identifiers lying around for that concept, which allows you to traverse the graphs in the Linked Web of Data. And they create concepts for relations (predicates) as well. Which is important. For example, in the W3C standard language RDF the predicates of a triple is just plaintext. So in RDF, there is no check at all if I use consistent identifiers – I could easily intermix terms as ‘name’, ‘hasName’, ‘named’ and I will have a query headache when I want to get the names of a set of objects. ConceptWiki identifiers can solve this problem because the identifers of the predicate concepts are opaque – not English, no disputable names, so they are supposed to be eternal. However, storing my data as triples with the predicate as ConceptWiki UUID and the objects as number or ConceptWiki UUID also, makes me rely very heavily on the ConceptWiki to be available to make any sense of what I just stored. If all bioinformaticians will start to do that, they will blow up the ConceptWiki, no matter how many servers they purchase. So I will probably end up storing local caches – which introduces the synchronization issues all over again.
Finally, the most compelling argument for me to not store my data in triple stores or graph databases is common sense. Even Semantic Web freaks (see my presentation) agree that it just doesn’t make sense to store high-volume, well-structured quantitative data in another system than relational databases, which were designed for this task. So the real question is, how far to go with describing the metadata of your data according to Semantic Web principles. Because whether you come from the database world, or from the Semantic Web world, if we want to be interoperable, it all comes down to using the same terms and ontologies. But that is food for a follow up story.