What is ETL and data curation?

 

Data curation process includes activities related to organising and integrating data collected from different sources, annotation, publishing and presenting the data. The process of transforming disperse and non-standardized files into a centralised and uniform data structure is often referred to as the ETL process:

ETL from `extract`, which means you have to extract the original data (e.g., from the source files or from a server)

– ETL from `transform`, which means you have to transform various data types to make them recognisable by the endpoint of your choice (system/software package/database you wish to use for accessing the data)

– ETL from `load`, which means you have to load the transformed data into the endpoint of your choice

 

Data curation and ETL

 

Why is ETL and data curation important?

Having a well established and maintained ETL pipeline is essential for anyone working in translational research nowadays. You often work with multiple data providers, receive data in a variety of formats, but you wish to integrate it and analyse it as a whole. Even when you work with a single data vendor, a multitude of files in various formats need to be processed and loaded into a database/software package for you to look at, analyse and leverage from.

 

Excel files will have different formats and column headers, text files will be be tab-, comma- or space-delimited, imaging and histology data will be saved as PNG, IMG or JPG. Analyzing such data diversity manually will take a lot of your time and effort! That is why your research will leverage from having an automated pipeline of formating the original data and importing it in your tools of choice, so that you can analyze it directly, without reformatting it manually. Designing an ETL process that fits your requirements and your specific use case is often – if not always – a very laborious and delicate process that requires background in bioinformatics.

 

How The Hyve helps

Here at The Hyve we have a lot of experience with designing ETL and curating data. Our team of data scientists have previously loaded both public, e.g., TCGA, CCLE, 1000 Genomes, and private client data into cBioPortal, tranSMART, OHDSI, CKAN and many more tools and data marts. We have ample expertise with developing ETL for centralised systems as well as for federated queries. Contact us if you are interested in our data curation services, if you want to get consultancy on your ETL tooling, or if you have any other translational data-related questions!