Biobankers are people that generate massive amounts of data. Today, I visited a biobanker (Frans van der Horst at Reinier de Graaf Gasthuis, Delft) who basically asked the question: how can I store my data and everything I know about my data in such a way that bioinformaticians would profit the most from it? Which is of course a question that would get any bioinformatician started. Use Semantic Web principles! Use ontologies! Use my tools! For me, it would be tempting to say: use GSCF!
The challenges that a biobanker faces are massive. Frans pointed out that there are two different processes in his lab: one for standardized assays which are performed for e.g. general practioners, and one for clinical trials. Obviously the latter is the one to talk about for us bioinformaticians, but it was certainly enlightening to see at what a rate you can process clinical assays once the whole process is standardized in fixed protocols.
Talking about protocols, one of the most obvious challenges is the actual ‘sample tracking’ in the lab. Each clinical trial comes with its own protocols, about sample storage conditions, sample processing steps, the assays that have to be performed, the exact measurement conditions and instruments involved, even the parcel company that delivers the samples can be specified. Christine Chicester, my colleague from NBIC and I were astonished when Frans showed us the current system to track all that, which consists of a single binder with the paperwork for each clinical trial. But it makes sense. All these protocols are slightly different, have their own conditions attached etc. Automating and computerizing this process in a user friendly way would be one hell of a task. There are situations where we simply cannot outbeat good old pen and paper – and I say that as a gadget enthusiast.
However, while pen and paper might be the most efficient way in the lab, things are getting complicated after the assays are performed. How do I report all the assay results and couple them to right sample identifiers? Where are the samples from that clinical trial last year that had these odd readouts, and why was that again? Did we use this protocol earlier? What samples can we expect to be delivered this week? Do we still need to keep those samples in the fridge, where are they from again? Fortunately there are already many software packages that can help this kind of sample tracking. Frans pointed us to IDQuest, but a quick Google search gave me i3Cube, CryoTrack, ItemTracker, Wikipedia lists a dozen LIMS systems, and there are no doubt many more.
Coming back to the viewpoint of a bioinformatician, we would like to know everything that relates to the sample measurements which could potentially influence our results, and that is a lot. Frans gave the example of sample age: proteins don’t simply stay put in a blood sample, even if you store them well beyond freezing point. So you can get strange results when you start comparing microarray results from samples that are fresh with samples that have been in the fridge for a month. Let alone several years. Without going into the details of which information should be stored and how to get that from for example the trial protocols, what would in the end be the most effective way to store this information?
At first glance, I see two different ways of storage. There is the actual assay data, which inevitably in the end takes the form of a table of samples vs. measurements with the measured values in the cells. And there is all the metadata, such as measurement conditions, the equipment used, the calibration samples used, the trial or study which set the context for the assay etc. Not to mention all the metadata that you could get on the samples itself, even before they arrive at the lab: from which species were they taken (although RdG only takes human samples), under which conditions etc. I would say that we should store the data in the most natural form, which would definitely be tables for the actual assay data (as it is now already), and probably some kind of interlinked data for the metadata. Which would lead us right to the old distinction of relational databases and graph databases. The real question is, of course, how we could make this work for RdG: to which ontologies can we connect to describe those metadata fields, how can we connect to compound or assay databases, how can we connect to the internal databases that are already in place? That is what we have to find out.
Finally, one last thought. Frans pointed out that there is a lot of ‘biobanking expertise’ buried in papers and in the heads of experienced lab people. I think he shared a dream with many people on the Semantic Web front, when he said: ‘wouldn’t it be nice to just be able to look up if anyone knows the stability of potassium in tube type X?’ Now that sounds exactly like a nanopublication to me. I have the feeling that ConceptWiki can play a number of roles in this story.