Automating cBioPortal ETL with Apache Airflow

Did you know that 75% of companies that use workflow automation declare it provides a solid competitive advantage? In this article, The Hyve’s cBioPortal specialists share what makes Apache Airflow such an attractive open-source platform for automating complex workflows and how it is used by The Hyve for cBioPortal data management and Extract-Transform-Load (ETL) processes.

Workflow Automation with Apache Airflow

Workflow automation streamlines processes by increasing their efficiency, accuracy, and scalability, all while reducing the rate of manual errors and operational overhead. Apache Airflow introduces this methodology in an open-source platform for data engineering pipelines, specifically by utilizing Directed Acyclic Graphs (DAGs) as their foundational architecture.

The DAGs used by Apache Airflow organize tasks within a workflow capable of executing a rich spectrum of functionalities, including running bash scripts, managing Docker containers, interacting with cloud storage, and executing Python scripts, to name a few. As such, these operators are able to carry out a wide range of ETL tasks. Moreover, task execution in a DAG depends on the successful completion of the preceding steps, ensuring a well-orchestrated sequence of operations.

Figure 1 highlights this sequence of operations and their directed acyclic nature. In this example, the first three tasks run in parallel, but they all direct sequentially to the same next task, which is also part of another parallel running set of the DAG but is dependent on the finishing of the first set of tasks. This attribute makes the DAGs used by Apache Airflow a good choice for automating complex workflows.

Figure 1: Example of a Directed Acyclic Graph (DAG) comprised of Operators (tasks)

Moreover, Apache Airflow provides a user-friendly web-based interface, displaying a comprehensive dashboard featuring workflows, current task statuses, and detailed logs. This visibility enables robust monitoring and troubleshooting on the DAG runs.

Figure 2 shows an example of the user interface grid view for an operator of a DAG, where information like the status, start time, duration, and the number of runs can be found.

Figure 2: Apache Airflow’s user-friendly web-based interface

Additionally, workflow execution can be scheduled at predetermined intervals or initiated manually, offering flexibility and control to users. This structured yet dynamic approach enhances operational efficiency and error resilience and provides a clear path for managing and optimizing data workflows. On top of that, Apache Airflow is built to handle large-scale data pipelines and is horizontally scalable, meaning that it can increase capacity to facilitate growing workloads.

Overall, Apache Airflow simplifies the creation and management of complex workflows by providing a robust framework for automating data pipelines, making it a popular choice in data science.

cBioPortal and Apache Airflow integration at The Hyve

The cBioPortal for Cancer Genomics is an open-access and open-source platform that enables the interactive exploratory data analysis of multidimensional cancer genomics datasets. cBioPortal aims to make cancer genomics datasets accessible and interpretable by providing a user-friendly interface to allow researchers to get the most out of their data.

To get the most out of the features the platform offers, The Hyve assists its clients in loading the data in the correct format. Based on the source data and the client, this process can include ingesting the data from the source, transforming it to the appropriate cBioPortal format, validating the data, and then loading it onto the platform (Figure 3). This process can be intercepted by several other smaller tasks specific to the client's needs, such as sending out email alerts, pre-processing the data before transforming, and configuring the loading process.

Figure 3: ETL pipeline to load data onto cBioPortal

The Hyve's cBioPortal ETL experts have found Apache Airflow to be one of the top choices to streamline data engineering pipelines that can otherwise involve a fair amount of manual intervention. With the help of its user-friendly interface and built-in operators, ETL scripts are linked together in a serialized manner to carry out efficient data loading.

The Hyve has successfully set up Airflow for several clients. For one of our clients, we employ an Airflow DAG to execute daily tasks. This DAG encompasses the following key steps (Figure 4):

(i) Download new patient files: Identify and fetch any recently added patient files from the server that are not already part of the existing study.

(ii) Transform to cBioPortal format: Convert the acquired files into the required cBioPortal format.

(iii) Merge with previous output: Integrate the newly transformed files with the existing output.

(iv) Reload study with new patients: Update the study by incorporating the information from the new patients.

(v) Email notification on successful completion: Upon successful execution, an email notification is automatically dispatched, providing the names of newly processed files.

In case of any failure during the execution of these steps, an email containing the error log message is promptly sent out. This verification step eliminates the need for manual checks on the DAG’s health, streamlining the process of debugging.

Figure 4: Example workflow automation pipeline using AirFlow

Each DAG can be set to be triggered periodically (daily, weekly, monthly, etc.) using the Scheduler, allowing the pipeline to keep running; however, it can also be triggered manually to allow for testing.

Moreover, Airflow can integrate with existing identity providers (IDPs), allowing centralized access management across its workflows. For instance, roles can be created on Keycloak, which is The Hyve’s chosen tool for authorization and authentication for cBioPortal, giving permission to selected users to pause and unpause DAGs when required.

In summary, Apache Airflow offers robust features tailored to cBioPortal’s needs, ranging from support for existing scripts to efficient task management, monitoring, scalability, and seamless integration with external services. We would like to encourage the open science community to consider Airflow for workflow orchestration in the field of cancer genomics and target discovery.

Closing remarks

We hope this article influences you to elevate your data management experience with Apache Airflow. For inquiries on cBioPortal workflow automation or other cBioPortal services, feel free to contact us.

cBioPortal cBioPortal

The Hyve manages the largest number of active cBioPortal installations in the world, for a wide variety of clients, including pharma companies, hospitals, research institutes, data providers and research collaborations. Our contributions to the open-source code base can be found in our articles.

Read more