Great news for translational research: Johnson & Johnson recently open sourced their translational medicine datawarehouse, called tranSMART. A paper on tranSMART was published in 2010, and just a few days ago a first version of the source code was put on GitHub (see also the tranSMART project home page).
In this blog post, I will do a walkthrough of how to get tranSMART running on Amazon Web Services. It is a Grails project, so that part is not too complex to deploy, but it also has a lot of dependencies, including an Oracle database, the open source i2b2 project, and various bioinformatics tools such as GenePattern.
Obtaining a host machine
First, we need a suitable host machine. A large part of the software runs on the JVM, so it should be relatively cross-platform. A Linux server would be an obvious choice. But we also need Oracle, so we are a little bit limited in which Linux flavors we can use here. For this walkthrough, I will use Oracle XE, since I would like to use ”freely” available and open source components as much as possible. Oracle XE is available under the Oracle Technology Network Developer License Terms, and you need to create an account with Oracle for that. Licensing for this project is horribly complicated anyway: i2b2 has its own i2b2 Software License, Oracle has its OTN license (but you will want to get a commercial one for a serious deployment), GenePattern has its own license, and the tranSMART Grails application itself is released under GPLv3.
Considering all this, I went for RedHat Enterprise Linux – it works well with Oracle, and it’s well supported by AWS. I used the 64-bit version, and and an m1.large instance. I”m assuming you are familiar with AWS, if not, you of course can also go for a RHEL flavour such as CentOS on a local cloud server, a VM on your local computer with virtualization software like VirtualBox, or even try to install it directly on your OS. I use RightScale for managing my AWS instances, and I tested this with RHEL 6.2 starter (ami-41d0052) on the AWS US East region, and the RightScale CentOS 5.6 (ami-6c0c3f18) on the AWS EU region. For convenience, I also entered the IP address of the newly started server in the DNS via Amazon Route53, so that I can just use transmarttest.thehyve.nl as hostname. You might want to configure the security group to at least close all ports except 22, because Oracle and the various web applications needed for tranSMART expose several ports.
Adding swap memory to the host
For Oracle and the various web applications to work smoothly, you also need swap memory, preferably several GBs of it. Vanilla AWS RedHat images don”t come with swap attached by default, so you need to create for example a 4GB EBS volume for this first, and attach this volume to your transmart host machine. I like to call the volume /dev/sdm, for memory. You can do this easily with RightScale, however, for the purpose of this tutorial I used the AWS management console:
I like to call it /dev/sdm (m for memory). After attaching the volume, you have to enable it in the OS. We need to SSH into the machine for that:
Installing Oracle XE
Next, you need to obtain an Oracle XE (I used Oracle Database Express Edition 11g Release 2) RPM from the Oracle website, and get that onto your Linux server. If you copy the download link with the AuthParam you get after logging in, you can use wget or curl. Unzip the file, and then install the RPM:
Just accept the defaults, enter a password, and configure Oracle to start on boot.
Now, you should be able to login into the database with the Oracle sqlplus shell:
Import of the tranSMART database
From this point on, you can import the tranSMART database. However, we first have to create a number of tablespaces that are referenced in the import, which we can do in sqlplus:
Also, we can create an Oracle directory that will hold the import files:
Next, we download the tranSMART GPL 0.9 .dmp and .exp files:
Now that we have created the tablespaces and set up the database dump, the only thing that is left is to do the actual import:
This will take a while, but it should fill your database with a starting point for tranSMART.
The final step is to install and start JBoss (with i2b2), Tomcat (with the tranSMART Grails application), Solr, R, and optionally GenePattern and PLINK. These are steps are documented in the Install Guide document, so I won’t repeat them here. Happy researching!