Home Welcome

  • Leverage billions of government investments:
    Use open source products for bioinformatics in R&D.

  • Did you know there is > 20 petabyte public bioinformatics data available right now?
    Learn how to leverage it with open source tools.

  • Our core business is developing web applications that visualize large amounts of data.

News Latest Posts

Rewiring tranSMART from product to platform

A year ago, tranSMART was the internal research datawarehouse for translational research at Johnson & Johnson, and for a number of projects in which J&J companies participated, such as IMI U-BIOPRED. It had already quite some traction at that moment, including a CIO 100 award and a Bio-IT World Best Practices Award, both in 2010. But the really disruptive step was taken by J&J (or Janssen as it’s called nowadays) and Recombinant Data (now Recombinant by Deloitte) when they finalized the legal process to make tranSMART open source, and published the source code on internet, now a little more than a year ago.

At first, not much seemed to happen with that source code. Sure, a lot was going on behind the scenes. Other major pharmaceutical companies started doing pilots with tranSMART. University medical centers in the United States and across the world started to investigate tranSMART and it’s possibilities. Disease foundations started dialogues about how they could leverage tranSMART for sharing data and knowledge. Standards organizations and regulating government agencies showed interest. Some companies even invested in implementing new functionality in tranSMART. But from an open source point of view, tranSMART still largely remained a proprietary built software product which happened to have (parts of) its source code published.

This was bound to change, however. And it did! For me, as a long term open source advocate, it is fascinating to see how the disruptive power of the sharing and collaboration values behind the open source philosophy work their magic ways even in the IP fortresses of this world – the pharmaceutical companies. The step that the J&J scientists took proved to be visionary. It was just a matter of time before large public and public-private initiatives started picking up the opportunities that an open source platform with the nature of tranSMART could provide. An important project in this space is IMI eTRIKS, a joint EFPIA and European Union project that has the mission of supporting all other IMI projects with a shared IT infrastructure. The eTRIKS consortium decided to leverage tranSMART to build this infrastructure. A comparable project in which we are heavily involved, CTMM TraIT, has the same mission and commitment for 20+ CTMM translational research projects funded by the Dutch government. And it is only a matter of time before other national and international initiatives around translational research realize the enormous potential that collaboration around a platform like tranSMART could provide, both in software and data operability, and will join in.
Read more

Inital Work on tranSMART’s “core”

Introduction

At the tranSMART Developer Workshop on London last February, there was a consensus that the future of tranSMART should comprehend a core that would implement the essential functionality of the application. The rest would be built upon this core, opening the way for having a more modular architecture – with a stable API against which plugins could be developed – and for a better quality code base, since writing the core would imply at least a partial rewrite of tranSMART’s haphazard code.

The exact boundaries of this core were quite unclear after the workshop, but it could be inferred that it should include a data model and an interface for accessing and possibly submitting the data.

Modules core-api and core-db

Given the interest of the community in developing a core component, we decided to start development and see what this component could look like.

We decided to create two modules:

The module core-api is a Groovy Maven project with the interfaces needed to program against the core. It contains very little logic.

The module core-db is a Grails plugin that implements core-api and adds other functionality upon which the tranSMART Grails application relies. The transmartApp project, which is the current, monolithic tranSMART Grails application, was made to depend on core-db (though it supports a plugin mechanism, which is used for an R plugin, transmartApp is nevertheless tightly coupled to that “plugin”). This allows for a transitional phase where functionality can be moved gradually from transmartApp to core-db and other future plugins.

It should be noted that core-db has probably too large a scope right now. Arguably, it should only implement core-api and not include other functionality like controllers. This can be addressed in the future.

Elimination of i2b2 Application Dependency

The tranSMART application requires having i2b2 running on a JBoss server with which it can communicate. We decided to focus our initial core development effort on eliminating this runtime dependency. The task was small enough to be feasible in a few weeks.
Read more

Set up a tranSMART Postgres development environment

Earlier, I blogged about how to set up a tranSMART development environment – this was the GPL 1.0 release which has Oracle as a prerequisite. Nowadays, we have a version that runs on Postgres – it still has some problems, but it is promoted as the current master version.

Recently we had a tranSMART hackathon at Imperial College, organized by Yike Guo c.s. and the eTRIKS project. I was invited to chair this session, even though I am not formally part of eTRIKS, which is just awesome – note how open source community building is not only preached but also practiced by eTRIKS here!

However, at this workshop I found out that many developers who are willing to have a look at tranSMART, are struggling to even get it installed locally. That is a very bad situation, because the entry barrier to the open source community / code should be as low as possible, and I am sure that for many developers, getting it installed and running locally is just the first step to do to get a feeling of what tranSMART is.

So in this blog post, I would like to walk through the steps of installing a local development environment for the current Postgres-based master branch of tranSMART, using an installer that my colleague Pieter Lukasse wrote. I will use the Mac OS X operating system as an example, but the steps are very similar under Linux and even in Windows.

Decide where you want the stuff
I put all my custom applications in a folder called /app, and throughout this example I will use /app/transmart-dev as a base path.

Getting the prerequisites
The README specifies that we have to install a JDK, Git, PostgreSQL, Ant and Grails. Ant is already installed by default on OS X, so is a Java JDK. If you are not using a package manager for OSX yet, I would recommend homebrew – it’s easy to use, flexible, and installs all tools in /usr/local. Using homebrew, we can easily install the other tools:

brew install git postgresql grails


Read more

We are looking for experienced frontend and backend developers

Does working with scientists, to create applications that visualize and explore large amounts of biomedical data, sound like your kind of game? Then you have come to the right place!

We are looking for experienced back-end and front-end developers:

Experienced Java Software Engineer

Experienced Front-end Developer


Read more

Set up a tranSMART development environment

In an earlier post, I shared my experiences with installing tranSMART on Amazon, but there are now AMI’s availabe for the tranSMART 1.0 GA version, which makes that post obsolete. However, I like to be able to code on the road when I have to travel, so I wanted to have a local development environment on my MacBook. Since Oracle XE is not supported for Mac OSX, this is not trivial, but by setting up a virtual machine it is possible to make a transparent local tranSMART development setup.

First of all, we need virtualization software. You could just install VirtualBox if you haven’t already, Oracle provides it for free and in my experience it works well. Next, we need a virtual machine host. I hate install wizards, so I’m happy to re-use the ready to go Virtual Box images of CentOS from the virtualboxes.org team (http://virtualboxes.org/images/centos), and tweak that one. I used the following CentOS image:

http://sourceforge.net/projects/virtualboximage/files/CentOS/5.6/Centos-x86_64.7z/download

Just extract the downloaded image, import it into VirtualBox and you are almost ready to go. Before you start the image, configure the networking options for the network card in the virtual machine. I choose NAT, because I want to have all tranSMART stuff running as if it was on my local machine. To achieve that, we have to forward all the ports needed tranSMART software (see screenshot below). Also, we expose the host’s SSH port as 2222 on the MacBook, so we can use Terminal instead of the crappy VirtualBox console window to manage the virtual machine. See the screenshot below:

Read more

Phenotype Database installation guide, GSCF

This post will assist you with installing (parts of) the Phenotype Database project [1] . We will be using a CentOS installation [2] . CentOS uses the source code from the commercial Red Hat Enterprise Linux project [3] . Currently the project’s main component already has such an installation guide, written for a Debian GNU/Linux installation. That guide can be found in the repository [4] .
The Phenotype Foundation’s software project consists of multiple components, which are called modules. These modules generally have the same requirements. We will start by looking at the main module, called the Generic Study Capture Framework (GSCF). The commands provided here assume that the account you will be using for the installation has the required permissions.

Java

The project uses the Java Virtual Machine. At this point in time, there is no dependency on a particular vendor. If no Java runtime has been installed yet, we will have to do so. We can test wether we have one, by running the following command:

java -version

If this runs fine, then we have a Java version. We need a Java with version 1.6, at the least. If you don’t have the right version, or don’t have any version, you will need to install one. You can install the open source implementation OpenJDK or the commercial version by Oracle. We can install the open source version with the following command:

yum install java-1.6.0-openjdk

PostgreSQL

The database we will be using is PostgreSQL. We can install this with the following command:

yum install postgresql-server

To see if the postgresql service is running, we do:

service postgresql status

You may get a negative answer, as I did, in which case we can issue the following command to start the service:

service postgresql start

Either way, since we are configuring a server, we will want this service to start when we start up the server. In my case, the installer did not configure this for me. You can easily test this out for yourself, either by restarting your server, logging in and then checking if the service has been started, or by looking at the output of

chkconfig --list postgresql

and confirming that runlevels 2, 3 and 4 are on.
This is one way to set the service to start at boot:

chkconfig --add postgresql; chkconfig --level 234 postgresql on

Click to expand.
If you were to reboot after running this command, you would likely be prompted by a ‘setup agent’ – you can safely ignore this and allow booting to resume by choosing the ‘exit’ option.
To configure postgres, we will switch to the postgres user account:

su - postgres

We will start the PostgreSQL interactive terminal:

psql

We will enter some commands into this terminal. If the terminal responds to you with text that starts with “ERROR:”, then, yes, something is going wrong. Make sure to use the correct ” and ‘ characters, as shown in the following command examples.
We will create a postgres user called ‘mydbuser’ with password ‘mydbpassword’:

create user mydbuser password 'mydbpassword';

Click to expand.
Of course, you should replace these authentication details with values that seem sensible to you.
Now we will create a database:

create database "mytestdb";

Finally, we will tell postgres that our new user has all privileges for the new database, and that our new user owns that database:

grant all privileges on database mytestdb to mydbuser; alter database mytestdb owner to mydbuser;

Click to expand.
We are done with the psql program. We will exit psql:

\q

Now we will log out of the postgres account and go back to our own account:

exit

We will change some PostgreSQL settings, to make our install more secure. The file we will edit is /var/lib/pgsql/data/pg_hba.conf. We issue the following command:

nano /var/lib/pgsql/data/pg_hba.conf

Now we will scroll to the bottom. We should find something like this:

# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD

# "local" is for Unix domain socket connections only
local   all         all                               ident sameuser
# IPv4 local connections:
host    all         all         127.0.0.1/32          ident sameuser
# IPv6 local connections:
host    all         all         ::1/128               ident sameuser

Click to expand.
We will change the last two entries, such that we end up with the following:

# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD

# "local" is for Unix domain socket connections only
local   all         all                               ident sameuser
# IPv4 local connections:
host    all         all         127.0.0.1/32          ident sameuser
# IPv6 local connections:
host    all         all         ::1/128               ident sameuser

Click to expand.
To make sure that our postgresql service is aware of these changes, we restart it:

service postgresql restart

Tomcat

GSCF consists of several files, which will be wrapped in a so-called container. This container is a WAR file. WAR stands for web application archive. We need a program that can ‘serve’ the contents of such a container. We will be using the Apache Software Foundation’s Tomcat web server, version 7.

yum install tomcat7

This version may not be in the repositories that your CentOS version uses. In that case you will have to install it manually. In this guide we will assume that your OS does in fact have tomcat7 in one of it’s repositories. Either way, don’t forget to check if the service starts along with the server, as we did with the postgresql service. If not, make sure it does.

Installing our application

First, we will stop the tomcat service from running. If you have a proper install, you will probably use the following command to do so:

service tomcat7 stop

This script may not exist yet and you may need to drop the version number, depending on how exactly you installed tomcat. We will place the GSCF WAR-file (which can be downloaded from GitHub) in tomcat’s webapps directory. This directory is probably located at /var/lib/tomcat7/webapps. Confirm the location of the webapps folder. Next, copy the WAR-file to that location.

cp gscf-www.war /var/lib/tomcat7/webapps/gscf.war

GSCF Configuration file

GSCF requires a configuration file. The following is an example of what it’s contents could look like.

# server URL
grails.serverURL=http://test.dbxp.org

# DATABASE
dataSource.driverClassName=org.postgresql.Driver
dataSource.dialect=org.hibernate.dialect.PostgreSQLDialect
dataSource.url=jdbc:postgresql://localhost:5432/gscf-www
dataSource.dbCreate=update
dataSource.username=mydbuser
dataSource.password=mydbpassword
#dataSource.logSql=false

# SpringSecurity E-Mail Settings
grails.plugins.springsecurity.ui.forgotPassword.emailFrom=gscfproject@gmail.com

# module configuration
#modules.sam.url=http://sam.test.dbxp.org
#modules.metabolomics.url=http://metabolomics.test.dbxp.org
#modules.metagenomics.url=http://metagenomics.test.dbxp.org

# default application users
authentication.users.admin.username=admin
authentication.users.admin.password=admiN123!
authentication.users.admin.email=admin@dbnp.org
authentication.users.admin.administrator=true
authentication.users.user.username=user
authentication.users.user.password=useR123!
authentication.users.user.email=user@dbnp.org
authentication.users.user.administrator=false

// override application title
application.title=Phenotype Database

# use shibboleth authentication?
authentication.shibboleth=false

Click to expand.

You will have to modify the contents of the file so that it corresponds with your setup, at least the server URL and the database connection credentials if different from mentioned above. The use of the modules is optional. The file has to be placed in a .gscf directory in the home directory of the user under which the tomcat process runs (e.g. /home/tomcat7/.gscf).
As you can see we have set the grails.serverURL property to be http://test.dbxp.org. We will be using this adress to access our application, using the Apache webserver. If you wish to try the application locally or without apache, then you should set this property to be the server’s IP adress, with port number 8080, ending in a slash and the application name. For example: 192.168.0.100:8080/gscf-0.9.0. It could also be 192.168.0.100:8080/gscf, if you created the symbolic link. If you are unsure what the application name is, we can find out in the next section. The locations of additional modules have been commented out in this example. This is not a problem, as modules can be added anytime through GSCF. Make sure the right database details are set. Change the authentication details that are listed under “default application users” into whatever authentication details you wish to use. New users can be added at any time through GSCF.

Tomcat’s permissions

Your tomcat application should have been set up such that it is started by the tomcat user. We need to make sure that the tomcat user has all the permissions that it needs to. One way to do that could be as follows:

cd /usr/share/tomcat7/
chown tomcat:tomcat . -R
cd ./webapps
chmod gu+rx *.war -R

This particular chown command sets all files in and under this directory to be owned by user tomcat and to be associated with the tomcat group. This particular chmod command sets all .war-files in it’s directory (the webapps directory) to be readable and executable by the file’s owner and members of the file’s group. Remember that new .war-files should be made readable and executable for the tomcat user and tomcat group too.

Starting our application

The logfiles for tomcat are probably located in the folder /usr/share/tomcat7/logs. We will open two sessions to our server, one to start tomcat and one to look at it’s main log. One way to look at a log file and be kept updated of changes to it, is to use the tail program with the -f option.

tail /usr/share/tomcat7/logs/catalina.out -n 500 -f

This command will keep us updated of the last 500 lines of the catalina.out file. We will be looking at these contents to see if our application starts properly. In a second session, we start tomcat. At some point, an entry like the following should appear.

INFO: Deploying web application directory /usr/share/tomcat7/webapps/gscf

Finally, the log should say something like this:

INFO: Server startup in 62939 ms

We can test if our application’s homepage does indeed load, by browsing to the application’s address. It should be something like http://localhost:8080/gscf. If you want to confirm this from the server but don’t want to install any browser, you can probably do something like this:

wget http://localhost:8080/gscf --output-document="/dev/null"

This command should display several lines of output. If the application can be found, you will find lines like the following among the output:

--2012-08-21 13:38:24--  http://localhost:8080/gscf/
Connecting to localhost|127.0.0.1|:8080... connected.
HTTP request sent, awaiting response... 200 OK

The 200 OK indicates that the homepage could be loaded just fine.

Setting up access to our application

We probably want our application to be accessible from outside our local network, at a specific URL. We have already set the DNS record for this URL to point to our server. We will be setting up the Apache webserver (httpd) for this. First, we will install it:

yum install httpd

Again, we should now make sure that the service is started on boot.

We will make sure httpd is not running:

service httpd stop

You may get an error message that says FAILED, this is fine. It just means that the service wasn’t running yet.
We will configure the Apache webserver to load the modules we need. The installation directory is assumed to be /etc/httpd/. First we will list out those modules we want:

ls /etc/httpd/modules | grep -e "_rewrite" -e "_proxy"

We will now check if these are listed somewhere at the top of the httpd.conf file. This file can be opened by issuing this command:

nano conf/httpd.conf

After pushing the “Page Down” key on our keyboard a few times, the “Dynamic Shared Object (DSO) Support” listing should scroll in to view. We should check if the previously mentioned files are listed here. If not, we should list them in the same way that these other files are listed, e.g. the file located at /etc/httpd/modules/mod_rewrite.so should be listed as follows:

LoadModule rewrite_module modules/mod_rewrite.so

The pattern for these entries is as follows:

LoadModule NAME_module modules/mod_NAME.so
LoadModule EXTRA_LONG_NAME_module modules/mod_EXTRA_LONG_NAME.so

Click to expand.

Now that Apache knows to load the modules we want, we will configure Apache to serve up our web application.
We will be using the address test.dbxp.org, and we will use that address to name our configuration file. We will now create the as of yet non-existant file, by “touching” it:

touch /etc/httpd/conf/test.dbxp.org.conf

The following is an example of what this file could contain. It is set up to use the previously mentioned URLs and directories, so you should change this to reflect your changes.

    ServerName test.dbxp.org
    ServerAlias test.gscf.dbxp.org

    ErrorLog /var/log/httpd/gscf-test-error.log
    CustomLog /var/log/httpd/gscf-test-access.log combined

        RewriteEngine on

        # keep listening for the serveralias, but redirect to
        # servername instead to make sure only one user session
        # is created (tomcat will create one user session per
        # domain which may lead to two (or more) usersessions
        # depending on the number of serveraliases)
        # see gscf ticket #321
        RewriteCond %{HTTP_HOST} ^test.gscf.dbxp.org$ [NC]
        RewriteRule ^(.*)$ http://test.dbxp.org$1 [R=301,L]

        # rewrite the /gscf-a.b.c-environment/ part of the url
        RewriteCond %{HTTP_HOST} ^test.dbxp.org$ [NC]
        RewriteRule ^/gscf/(.*)$ /$1 [L,PT,NC,NE]

            ProxyPass http://localhost:8080/gscf/
            ProxyPassReverse http://localhost:8080/gscf/

Click to expand.

Information on properly loadbalancing GSCF can be found at the bottom of the the INSTALLATION.md file.

We will keep an eye on the logs while we issue the command to start the httpd.
These log files will be located in /var/log/httpd. Right now this directory might be empty, but it will soon contain at least the following files:

access_log  
error_log  
ssl_access_log  
ssl_error_log  
ssl_request_log

To look at those log files we might use a command like this, using a different session:

tail /var/log/httpd/error_log -n 500 -f

The error_log will be the most interesting one right now, so let’s look at that one.

Let’s start the webserver:

service httpd start

Your GSCF instance should now be up and running at the URL (and alias URL) you have chosen. Keep an eye out on the tomcat and httpd logs, they may help you with troubleshooting.
httpd’s error_log probably tells us something like the following:

Apache/2.2.3 (CentOS) configured -- resuming normal operations

Combining Grails with less-structured data: GORM or no GORM?

Using the Grails Object Relational Model tends to work quite well when using conventional, structured data. At that point the type of database you are using, relational or otherwise, doesn’t really matter. GORM’s dynamic finders [1] are a pleasure to use. Also of note are the way in which one can add custom validation functions [2] or switch between deep and shallow validation on the fly [3], it’s cascading behaviour options and lazy and eager fetching options [4]. However, there is at least one reason why one might want to avoid using such a layer, and thus lose access to this functionality.

When writing in a highly iterative manner, as might occur when working on a prototype, it can be of benefit to use a schemaless database like MongoDB. One might find that using a more conventional approach to Grails-based development, with Domain objects that dictate the exact structure your persistent data can have, can take away a lot of the benefits of using a schemaless database, which doesn’t care much what structure your data has. Where one does use GORM, things may turn out to be a lot less easy to update or refactor than one had hoped. For one, this requires writing database upgrade scripts, as Grails will complain when handling Domain objects that do not have the expected fields set in the expected manner – and rightly so! Another thing is that some views may have been written in a less flexible way, requiring additional work. This additional work can slow down your progress considerably, and some of the advantages of using GORM don’t really surface when writing a prototype.

One might choose to skip creating Domain objects and have a service talk directly to the database instead. This can cut down on the amount of work required when making a change in how data is stored, because your database is schemaless. Just as important is that the related functionality should be written in a modular and layered manner, such that a minimum of code needs to be rewritten.


Read more

Using the same MongoDB database with Grails and non-Grails applications

In one of our projects we had reason to combine a Grails-based web application with non-web applications. We were using SpringSource’s MongoDB GORM plugin [1] for our Grails-based application. We also had applications that operated on the same database as the web application, for these we used the official MongoDB Java driver [2].

By default, Grails uses a sequence of increasing java.lang.Long-type values as the database identifier of it’s Domain objects. One easy way to change the type of the “id” property of your Domain object is to explicitly declare “id” as a different type. If you wish to use a Mongo database in combination with your Grails application as well as other applications, it may be a good idea to set the relevant Domain objects’ “id” property to be of type org.bson.types.ObjectId, as this is the default type that the official MongoDB drivers will use to set the “_id” field [3]. This would also scale better when one uses database clusters. However, if you wish to use the standard Grails database identifer, or the project is already well underway by the time you have reason to have other applications operate on the same database, changing the type of the database identifier may not be an option.

If you have both a Grails project with a Long as the identifier type and a project using the official Java driver inserting to and retrieving from the same collections, you will get a mix of both identifier types. This will most likely lead to problems, perhaps with your Grails project, as GORM will encounter ObjectId-based field values that it will try to convert to a Long.
To solve this problem from the non-Grails side we need to make sure that, when inserting a document, the “_id” field of the document is a Long value larger than the largest already present for this Collection. We might want to have a function that reserves the id for us, and makes note of this in a special Collection. This sort of solution is called ‘optimistic locking’.


Read more

How to run the tranSMART open source package for translational research on Amazon AWS

Great news for translational research: Johnson & Johnson recently open sourced their translational medicine datawarehouse, called tranSMART. A paper on tranSMART was published in 2010, and just a few days ago a first version of the source code was put on GitHub (see also the tranSMART project home page).

In this blog post, I will do a walkthrough of how to get tranSMART running on Amazon Web Services. It is a Grails project, so that part is not too complex to deploy, but it also has a lot of dependencies, including an Oracle database, the open source i2b2 project, and various bioinformatics tools such as GenePattern.

Obtaining a host machine

First, we need a suitable host machine. A large part of the software runs on the JVM, so it should be relatively cross-platform. A Linux server would be an obvious choice. But we also need Oracle, so we are a little bit limited in which Linux flavors we can use here. For this walkthrough, I will use Oracle XE, since I would like to use ”freely” available and open source components as much as possible. Oracle XE is available under the Oracle Technology Network Developer License Terms, and you need to create an account with Oracle for that. Licensing for this project is horribly complicated anyway: i2b2 has its own i2b2 Software License, Oracle has its OTN license (but you will want to get a commercial one for a serious deployment), GenePattern has its own license, and the tranSMART Grails application itself is released under GPLv3.


Read more

Why open source? (2)

So what does open source mean, really? I would like to skip the legal part and review the community side in this post. What defines and makes a succesful open source project? Why have projects such as Linux, Firefox, Apache, MySQL, Postgres, Plone etc. changed the world to an extent that 20th century IT gigants such as Microsoft now have to change strategy and claim to be committed to openness?

There are a number of great reads that go into these questions in depth. The classic read is of course Eric Raymond’s essay The Cathedral and the Bazaar. I also like the article The Transformation of Open Source Software by Brian Fitzgerald et al, written in MIS Quarterly in 2006, and this study on Plone. Fitzgerald emphasizes a difference between FOSS (‘Free and Open Source Software’) and what he calls OSS 2.0, ‘a more mainstream and commercially viable form’ of open source software. He makes the argument that contrary to the grassroots FOSS efforts, in OSS 2.0, open source is used as a deliberate product placement strategy by commercial vendors. If you have a good open source product and you can activate a community of developers around it, chances are you get more back from the community than you put in as a company in the first place. And best of all, you will be able to make money by provide commercial services around it, such as Service Level Agreeements. The community also gains from that, because you ‘commit’ improvements back into the community project. This is by now a proven business model, as I pointed out in my last blog post.

But this is especially true if you create a platform, on which other parties can build their own tools. Think about the hugely popular games in Facebook – Farmville, Mafia Wars etc. By just providing application developers a good API and documentation on how to build applications for Facebook, the games – created by enthusiasts and companies from all over the world – now are a major contributing factor to the ever-increasing popularity of the platform, which still grows worldwide despite their questionable privacy policy decisions. Google can learn a lot from Facebook in this respect, as discussed in an internal letter by Google’s own Steve Yegge – see this blog post.


Read more

Why open source? (1)

When I explain to people what The Hyve does, I try to observe their reactions. This is fun, but it also tells me a lot about how they feel about our business model, which is important feedback to me. Often they smile and nod, until I get to the point where I explain that we accomplish our services by using and providing open source software. “But how do you then make money?” is a question I get very often. “Isn’t that contradictory, that you give the products you develop away as open source, but at the same time you try to make money with them?” If I then go on and tell that most of our current projects are from EU or national government funded projects, or from generous clients like TNO Quality of Life who care about sharing and disseminating their research, they feel relieved. Their model of the world doesn’t have to fall apart, business is still business and IP is still crucial in making successful a biotech startup.

However, frankly, that’s not what I want to achieve in the end! Put very boldly, I would like to turn the bioinformatics business upside down. Many bioinformatics companies have become big by perfecting and maintaining their software, and generate a stable income by selling licenses for it. This is a perfectly valid business model, in which the money you earn from licenses allows you to maintain the products and keep adding new and exciting features to it. This worked in the nineties, and as of 2011, it still works. It is the safe choice for pharma and nutrition companies that are struggling to defend their R&D costs in bioinformatics and especially systems biology, which, as I learned yesterday, still only has poor to moderate contribution to actual drug development.

This isn’t the path I want to tread, though. My prize example is a technology startup called Cloudera. Cloudera has a strong relation with an open source project called Apache Hadoop, recognizable by the elephant logo. Hadoop is a software solution which allows for storage of and computations on big data on commodity hardware, replicating data three times so that it doesn’t matter when a few machines fail.


Read more

A business case for semantic data integration

What is the major bottleneck in your bioinformatics research when it comes to data and software? I asked this a number of bioinformatics researchers from pharma and food industry at the ISMB conference. It’s a very general question, but it never failed to provoke an answer. “Are you kidding? Where do you want me to start?” And then the story comes, about the challenges of managing the in-house data, connecting it with public data and annotations, and especially, interpreting what it all means in the context of their target biology. Because that is where the real bottleneck resides – in the data interpretation.

It’s no coincidence that the EMBL Programme 2012-2016 theme is coined ‘Information Biology’. It’s also no coincidence that the first keynote talk of ISMB was titled ‘Computational biology in the 21st century: making sense out of massive data’. Yes, we still have a a lot of challenges in data management – my favorite ISMB quote was Dominic Clark presenting the results of an EBI database user survey and showing that the most common activity of database users in industry is retrieving records from the databases. With a total of 10 petabytes storage at EMBL (of which I suspect EBI databases make up a significant part, see the ELIXIR business case) it makes sense that smaller businesses cannot afford a local mirror. They will have to use online database searches to carry out their research. However, keeping in sync with EBI databases is what I would view a small, solvable infrastructural problem. Interpreting the data, finding relevant literature, finding relevant databases, and then come up with a sensible, testable hypothesis of the biological mechanisms at hand, that’s where the real time and money goes.

So what is the ‘rate limiting step’ in this process? I would say there are a number of them, as always in biology. One of them has to do with finding relevant literature, and extracting the information you need from that, turning it into knowledge.


Read more

From compound spectra to online games

Battling standardization in metabolomics
BETHESDA, MD – Imagine a snow-covered Washington DC, just before Christmas. That was the scene where metabolomics researchers from around the world headed on December 15, 2010. They gathered at the National Institutes of Health to tackle a problem that has been on their radar for a few years: the standardization of reporting for metabolomics experiments. Granted, it’s an intrinsically complex task: differences in handling protocols, data acquisition, experimental designs and data processing all affect the experimental outcome, not to mention the effects of measurement and biological variation! And despite efforts, started already in 2005 by the Metabolomics Society, to organize the community and write up and implement those standards, the field still lacks the standardized repositories of compounds, spectra and experiments which would enable researchers to re-use each others’ data. However, the discussion is slowly moving from the content and procedures to the actual efforts and governance structures that need to be in place to really implement this standardization. If one thing became clear at this meeting, it is that the will and vision for pulling this off is present, but there are a few missing key pieces. For example, no publishers were present at the meeting. Also, actual collaborations on programming and database levels are still limited. And there is still a lot to develop on the level of standardization itself, within and outside the scope of this initiative: how to uniquely describe not only complete, but also partially known chemical structures, which ontologies to use for the description of biological samples, or how to organize the curation of uploaded chemical structures (let’s make a computer game out of it!) The workshop, organized by the Pharmacometabolomics Research Network, will have a number of follow up sessions in the coming year on dedicated topics, so stay tuned via http://metaboknow.org!

Databases vs Semantic Web

Today, I gave a presentation about ‘Data interoperability’ at the NBIC BioAssist meeting. I tried to recap the various things that were said at the mini-symposium on study capture I organized last week together with Morris Swertz.

There are many levels at which you can approach this, but I think that most people will agree that we have at least an interoperability challenge when it comes to storing bioinformatics data. Because no matter what type of data you are generating, in many cases you will want to connect to other databases and information out there on the web – be it genomic annotations, protein sequences, metabolites, biological pathways or even PubMed publications. Now most of those websites will have webservice connectors, either via SOAP or RESTful services serving JSON or XML. And yes, there are workflow tools like Taverna or Galaxy that can take advantage of those connectors. But I don’t want to do workflows, I just want to enrich my data with e.g. BioPortal ontologies. So I can either write all those connectors myself, or I can take advantages of libraries such as Ontocat or BioPython, depending on what’s available in my coding language, but either way I will have to change my database schema to implement these foreign identifiers. Not to mention the synchronization and dependance issues involved in that.

Now that’s usually where the Semantic Web people come in. If you store your data as triples, they argue, you have a very flexible and extensible data model – actually, you don’t have structure limitations at all. But storing my data as triples introduces multiple issues as well. For example, in NGS, performance is already an issue when working with well-designed conventional data tables, let alone if I would have to extract the data for a specific sample from a triple store with billions of triples each time I do a calculation. Also, I would need a triple store that can actually store numbers of different types – integers or floats depending on the property. Last but not least, you lose important database features such as mandatory fields or foreign keys, and keeping the data consistent and validated will be much harder.


Read more

Biobanking as a Service

Biobankers are people that generate massive amounts of data. Today, I visited a biobanker (Frans van der Horst at Reinier de Graaf Gasthuis, Delft) who basically asked the question: how can I store my data and everything I know about my data in such a way that bioinformaticians would profit the most from it? Which is of course a question that would get any bioinformatician started. Use Semantic Web principles! Use ontologies! Use my tools! For me, it would be tempting to say: use GSCF!

The challenges that a biobanker faces are massive. Frans pointed out that there are two different processes in his lab: one for standardized assays which are performed for e.g. general practioners, and one for clinical trials. Obviously the latter is the one to talk about for us bioinformaticians, but it was certainly enlightening to see at what a rate you can process clinical assays once the whole process is standardized in fixed protocols.

Talking about protocols, one of the most obvious challenges is the actual ‘sample tracking’ in the lab. Each clinical trial comes with its own protocols, about sample storage conditions, sample processing steps, the assays that have to be performed, the exact measurement conditions and instruments involved, even the parcel company that delivers the samples can be specified. Christine Chicester, my colleague from NBIC and I were astonished when Frans showed us the current system to track all that, which consists of a single binder with the paperwork for each clinical trial. But it makes sense. All these protocols are slightly different, have their own conditions attached etc. Automating and computerizing this process in a user friendly way would be one hell of a task. There are situations where we simply cannot outbeat good old pen and paper – and I say that as a gadget enthusiast.

However, while pen and paper might be the most efficient way in the lab, things are getting complicated after the assays are performed. How do I report all the assay results and couple them to right sample identifiers? Where are the samples from that clinical trial last year that had these odd readouts, and why was that again? Did we use this protocol earlier? What samples can we expect to be delivered this week? Do we still need to keep those samples in the fridge, where are they from again? Fortunately there are already many software packages that can help this kind of sample tracking. Frans pointed us to IDQuest, but a quick Google search gave me i3Cube, CryoTrack, ItemTracker, Wikipedia lists a dozen LIMS systems, and there are no doubt many more.

Coming back to the viewpoint of a bioinformatician, we would like to know everything that relates to the sample measurements which could potentially influence our results, and that is a lot. Frans gave the example of sample age: proteins don’t simply stay put in a blood sample, even if you store them well beyond freezing point. So you can get strange results when you start comparing microarray results from samples that are fresh with samples that have been in the fridge for a month. Let alone several years. Without going into the details of which information should be stored and how to get that from for example the trial protocols, what would in the end be the most effective way to store this information?

At first glance, I see two different ways of storage. There is the actual assay data, which inevitably in the end takes the form of a table of samples vs. measurements with the measured values in the cells. And there is all the metadata, such as measurement conditions, the equipment used, the calibration samples used, the trial or study which set the context for the assay etc. Not to mention all the metadata that you could get on the samples itself, even before they arrive at the lab: from which species were they taken (although RdG only takes human samples), under which conditions etc. I would say that we should store the data in the most natural form, which would definitely be tables for the actual assay data (as it is now already), and probably some kind of interlinked data for the metadata. Which would lead us right to the old distinction of relational databases and graph databases. The real question is, of course, how we could make this work for RdG: to which ontologies can we connect to describe those metadata fields, how can we connect to compound or assay databases, how can we connect to the internal databases that are already in place? That is what we have to find out.

Finally, one last thought. Frans pointed out that there is a lot of ‘biobanking expertise’ buried in papers and in the heads of experienced lab people. I think he shared a dream with many people on the Semantic Web front, when he said: ‘wouldn’t it be nice to just be able to look up if anyone knows the stability of potassium in tube type X?’ Now that sounds exactly like a nanopublication to me. I have the feeling that ConceptWiki can play a number of roles in this story.

LabKey installation

Today I want to check out LabKey, an open source software suite which claims to be ‘the best way to manage your biomedical research data’. Its website looks promising, and so does their list of participants and funding parties. When it comes to open source, the source code is indeed downloadable, although it is quite hidden and there is also no open source repository (but that’s not an OSI requirement). The license used for the code is the flexible and widely used Apache 2.0 license.

The data model of LabKey looks quite promising (because it seems flexible and study-oriented). I couldn’t find an overview document for the whole data model / architecture, but many things become more clear when you look at the table schemas of the internal database from within an installed LabKey instance (instructions). Also, I found a PDF document which describes the way in which property values are stored in the database. It was a pleasant surprise to read that they are actually stored as RDF triples. What struck me as a bit odd is that SQL queries are used to query those triples, which is indeed very hard to do – why not switch from Postgres as underlying database to a native graph database implementation such as neo4j? In that case SPARQL could be used to query, which is much better at querying RDF data.

Anyway, I decided to give LabKey a serious try, because I am very interested in any open source software suite that would help me facilitate the biologists that want to do ‘multi-omics’ and are looking for software that can help them track that kind of data. The instruction for installing LabKey from source or even from compiled sources on Linux looked so frightening to me that I decided to go for the one-click Windows installer. I fired up a dedicated Windows VM for it, downloaded the installer, and it installed Postgres, Tomcat and all neccessary LabKey binaries in one go – that was a very smooth first user experience! However, after I installed it, I could only reach the website from localhost. I tried to tune the installed Tomcat installation, but it seems be configured as a virtual host for ‘localhost’ by default, refusing access via any other host names, including 127.0.0.1 on the same machine.


Read more

SADI framework and the Semantic Web

The Semantic Web is a very nice concept that has been around for more than a decade. Some people say that there can never be an economically viable big application for the Semantic Web, because its applications are too knowledge intensive, and therefore hard to setup and maintain on a big scale. Others say that we will run out of computation power long before we have gathered enough data and knowledge to compute something useful. While these are very general claims, they indeed expose the weak spot of the Semantic Web: the dependence on carefully laid out knowledge, expressed in a computer-readable format. But that’s also the strength of the idea.

If we look at reality, indeed we could say that the other approach, don’t care about the exact knowledge structure, but just use a huge amount of data and a fuzzy algorithm to infer knowledge, has far more followers and business use cases. Just look at Google. I talked about this with David Weiss, and he mentioned the example of the spam algorithm that Google uses to shield your mailbox. It’s really good at classification. Why? Because of the sheer amount of users that click ‘Mark as spam’ every time they see a spam message.

Today, I want to introduce what very well could be an important step in the Semantic Web story, especially for Life Sciences. No, I’m not talking about the long-awaited ConceptWiki. I mean the SADI framework, to which I was pointed by the myGrid guys at ECCB10 – they co-developed a SADI Taverna plugin.


Read more

My Nutrigenomics Quest

For about a year now, I am technical project manager of the Nutrional Phenotype Database (dbNP)project, which was initiated by The European Nutrigenomics Organisation (NuGO). Nutrigenomics is what I would call ‘applied bioinformatics’. It is different from bioinformatics flavours such as next generation sequencing, metabolomics or transcriptomics in the sense that it has no data format of its own. The same goes for toxicogenomics or pharmacogenomics. Those bioinformatics disciplines just use whatever data they can use to answer biological research questions. Of course, you can still publish a nutrigenomics paper in which only metabolomics data is analyzed under for example two different diet conditions. But you won’t get a paper through just about ‘peak picking’.

So if one thing is important for nutrigenomics studies, it’s a clear description of your study design. Who or what did you measure, under which conditions, which events took place, when did you take your samples, etc. We would also need such a description of the study design to link all the data gathered over the course of such an experiment. These data sets can be huge in terms of diversity: in one NuGO study, there were over thirty different ‘omics assays’ performed on thousands of samples, all with different time points, groups and diet conditions. At that point, an Excel sheet simply can’t do the job anymore. So we decided to start out with this dearly needed part of the ‘database’. (Why do biologists call every computer application they need a database, especially when they don’t know exactly what they need?  )

We want to build a fully open source solution. I was confident that there were many open source tools available which we could use for the description of study designs. We started out with the ISA tools from EBI – specifically created for this type of study descriptions. However, we soon encountered limitations, especially because the formats are tab-delimited. ISATAB is great for exchanging general metadata about a study between different platforms, but it falls short when comes to storing a full-blown study design such as the NuGO study I referred to earlier. The same was true for all the other formats


Read more

Extending a Windows 7 NTFS partition using an Ubuntu live CD

One of my laptops has a dual boot configuration with Ubuntu and Windows 7, but Windows 7 ran out of hard disk space. I was able to shrink my Ubuntu partition and extend my Windows NTFS partition using GParted on a Xubuntu 8.04 live CD, however, after I restarted Windows, the volume still had the old size, even as the partition in the Disk Manager (under Computer Management) had the new size. I finally found out that I had to manually expand the volume to the new partition size using the Windows program DiskPart: just switch to the partition and enter the command ‘expand’ (see screenshot below).

tranSMART developer and user workshop

CTMM TraIT is organizing a tranSMART developer and user workshop in Amsterdam on June 17 to June 19. The main purpose of this meeting from TraIT perspective is to line up efforts on tranSMART with the international community, a.o. IMI eTRIKS and tranSMART Foundation, which is why these two parties are also co-organizing the event. As The Hyve we are both having developers and users participating in the TraIT Foundation Team, we will attend both developer and user sessions. 

Open PHACTS Community Workshop

The 4th Open PHACTS Community Workshop focused on the Open PHACTS API. The Hyve was present at this 2-day event.

Open PHACTS

Eagle Genomics symposium: Big Data

The Hyve was present at the third Eagle’s symposium on Big Data:

Eagle's third symposium

CTMM TraIT symposium

The Hyve was present at the CTMM TraIT symposium ‘Connected in translation’, were updates on all workpackges were given. Also several demos were given of tools that are being used within the several domains.

Details on the program can be found on the CTMM website. 

Partners in TraIT:

TraIT Partners

tranSMART Workshop & Hackathon

The Hyve attended the tranSMART developer workshop, hosted by eTRIKS Consortium / Imperial College.

Kees was also chair of the tranSMART hackathon, organized on the 2nd and 3rd Day.

More details and updates on the hackathon can be found on the tranSMART wiki. 

Pistoia Alliance European Conference

This year, the European Conference of the Pistoia Alliance will be organised in conjuction with the gddif summit. The Hyve will be present at both events.

Pistoia Alliance European Conference

gddif summit 2013

The Hyve is one of the confirmed sponsors of the gddif summit 2013, which will be held in London. And we will also send at least 2 delegates to the summit.

The Annual gddif is the leading R&D summit for decision makers, scientific leaders and strategists, and for us, this will be a great opportunity for networking!

Life Science Momentum

The Hyve attended the Life Science Momentum 2012, in Den Haag, last year.

Life Science Momentum 2012

Bio-IT World

The Hyve attended this year’s BioIT World Conferene & Expo, in Boston, MA. The 2012 Expo unites 2,000+ life sciences, pharmaceutical, clinical, healthcare, and IT professionals from 30+ countries. The Expo provides the perfect venue to share information and discuss enabling technologies that are driving biomedical research and the drug development process.

tranSMART face-to-face meeting

Jamboree

From June 18 to June 20 a jamboree was organized in Hoofddorp, near Amsterdam Airport, Netherlands. The Hyve joined this 3-day come together with the scientific and technical community.

Go to the Phenotype Foundation website, to see more on this event.

Marriage Rene Voorberg

Our designer Rene is getting married!


RSS Github

Services What We Do

Services


Software Development

We specialize in developing software for data intensive biology research (both fundamental and clinical research).
Software development is at the heart of most of our activities.
Typical projects are

  • implementation of new functionality in existing open source software projects
  • integration of open source components with existing software stacks
  • customization and branding of open source components for private or in-house use

 

Consultancy

Because developing software for scientists is our main activity, we have experience with many aspects around IT and data governance in the life sciences.

We are regularly called upon to advise or deliver data services, and we would love to assist your research department with current data challenges.

We can help you leverage the value of existing public and private data sources, by developing ontologies and appropriate data models. We can also assist in choosing and implementing the respective data processing tools.

Hosting / SLA

The Hyve is able to provide on-demand hosting for software projects on a public cloud service of choice.
We have experience with Amazon AWS, Rackspace and several other hosting companies and cloud stacks. We can also offer

  • deployment and configuration of software on your on-premises cloud
  • automation of deployment and configuration steps (e.g. by building Puppet scripts)

Solutions Our Works

Solutions


Examples of open source solutions we can deploy, improve and maintain:

Team Team

Team


We empower scientists by building on open source software

Meet the people that make up The Hyve!

  Riza
   Ruslan

Contact us for open source bioinformatics Get in Touch

Contact


Need to get in touch with us? You can email, write, or call us today. Fill out the form below and we will contact you.

Padualaan 8     +31 (0)30 7009713
Room W101     office@thehyve.nl
3584 CH Utrecht      
 The Netherlands