Sjoerd van Hagen

Project Manager / Big Data Architect

@TheHyveNL

Currently, Big Data is one of the hottest topics in computer science, due to the rapid increase of the amount of data we, as a society, produce and store. The driving force behind this increase is a dramatic drop in the costs of collecting and storing this data.

cost_genomeThis decrease in costs is not only observed in social data (behavior) and the internet of things (sensors) but also in the area of genomics. In fact this price drop has become steeper than the price/performance drop for hardware. This means that the amount of genomics data is growing faster than the capacity to do computations on it. At the same time we would like to have our results faster, for example to start a treatment sooner rather than later.

To be able to obtain maximum value from this data we need a system that can scale as the data grows. Another important aspect when working with big data is being able to use open source because the last thing you would like to do is to lock you valuable data into a vendor specific system and find out that it does not support an analysis you would like to perform.

The current best implementation of the well known map-reduce-paradigm is Apache Spark, which optimizes the computations by making better use of memory than Hadoop, the previous defacto open source choice for doing map-reduce. Spark is a general purpose framework and can be used for any kind of map-reduce computation.

For genomics data we use ADAM, which is basically a set of formats for storing genomics data and a set of algorithms for doing various computations on the data. These include transformations for example from BAM to VCF. The set of formats allows for querying the data in a SQL-like syntax. Using Apache Spark we can store huge datasets and do various types of computations on it within a short amount of time. For genomics data we can leverage ADAM but we do not stop there. We can handle any type of biological data by creating algorithms and datastructures for Spark. We expect that other types of bio-data, from wearables for example, will grow significantly in the near future.

If your data is also growing faster than your infrastructure please contact us and we will see what we can do for you.

Read all our blog posts on ADAM in the News section.

Sjoerd van Hagen

Project Manager / Big Data Architect

@TheHyveNL