With the growing popularity of the tranSMART data mart and analysis platform the quality standards for all its modules are continuously rising. The ETL framework (Extract, Transform, Load; the part that reshapes and uploads the data) has been developed in a combination of shell scripts, Kettle code and stored procedures from the conception of the open source tool. However, this leaves room for improvement in a few fronts:
- There is not a lot of Kettle experience in the community and debugging the tool is not familiar to most Java developers.
- There are no tests in the current ETL pipeline and the current setup makes creating them very difficult.
- The stored procedures are maintained separately for each supported database (currently Oracle and Postgres). This gives a lot of room for them to diverge, when developers most often work only on one of the two databases.
Starting from these action points we have started to develop a new ETL framework having a code that is familiar to current tranSMART developers, and thus is easy to debug, and which renders the stored procedures obsolete. Funding for this efforts is provided by IMI EMIF, Janssen and CTMM TraIT.
The framework is based on Spring Batch. Spring is familiar to tranSMART developers as the Spring framework is a component of Grails, the web application framework on which tranSMART is written.
Currently the upload of clinical data, including across trial support, has been implemented in the transmart-batch project, which can be found on our Github repository at https://github.com/thehyve/transmart-batch.
On Tuesday November 18th our senior developer Gustavo Lopes, who has been the main developer involved together with our Carlos Silva, has given a presentation on transmart-batch, introducing both his colleagues and the wider community to the efforts. The screen capture of this presentation is on Youtube, embedded below.