Making a custom-made clinical data model

Azadeh , Hernando , Liam , Stefan

19-09-2024 10 min read

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 101016851.

Rome was not built in a day nor was the PANCAIM clinical data model! The Hyve designed and developed it in close collaboration with a large interdisciplinary team from various technical and clinical organisations across Europe to perfectly fit the research purposes of the PANCAIM project.

Introduction

Pancreatic cancer AI for genomics and personalised Medicine (PANCAIM) is an ambitious European consortium to improve cancer treatment with Artificial Intelligence optimising and integrating genomics and medical imaging. The project was funded from the European Union’s Horizon 2020 research and innovation programme and was kicked off in January 2021. The project has brought together 9 participating organisations with academic, clinical and technical expertise from across Europe.

The aim is to help clinical decision-makers to give the right treatment to pancreatic ductal adenocarcinoma (PDAC) patients at the right time. The intended effect is to improve treatment outcomes of pancreas cancer patients avoiding the current costly trial-and-error use of expensive drugs with strong side-effects. To achieve this ambitious goal the project is pooling various types of data e.g. pathology, imaging, genomics and EHR data in addition to generating trusted impactful AI applications that can parse information as well as combine information from these data sources.

Six top-expert clinical partners provide eleven Pan European repositories of almost 6000 patients open to ongoing accrual. These clinical centres treat more than 2,000 PDAC cases each year. Once combined across clinical centres, this data is an invaluable resource for training AI algorithms and making progress in helping rare cases and complex decisions. However, each medical centre has a set of conventions and tools for capturing the clinical data which varies between institutions, making combining data in the source format unattainable. To make the data interoperable, all the clinical data will be converted into a common data model which will then be used for research.

The Hyve OHDSI team has designed and developed the PANCAIM clinical data model in collaboration with all the PANCAIM data partners along with Collective Minds Radiology and is currently harmonising clinical data provided for PANCAIM into this data model.

For the purpose of harmonising all the clinical data intended for PANCAIM, The Observational Medical Outcomes Partnership (OMOP) CDM was initially considered. However it was bypassed in favour of a more compact custom-made model. The reason for this was that the scope of variables intended for research objectives of the project is limited therefore a custom model gives more flexibility to work with the data available while an existing data model carries its inherent constraints.

Method

In order to put together a minimalistic data model for PANCAIM clinical data, we needed to identify the essential variables that would be analysed in later stages of the project. The consortium partners had previously gathered a comprehensive list of variables available in each data source. In the beginning, it was essential to better understand the types and meanings of the variables in each of the source databases. Under data privacy and security constraints of PANCAIM project The Hyve does not have direct access to the source data but that poses no problem since we have vast experience developing ETLs and data models without access to the data.

First step is to request the source data custodian to provide a White Rabbit scan of the data intended for the project. White Rabbit is an open source OHDSI tool that provides information about what type of data, tables and sizes are available in the scanned dataset and it contains no person identifiable data therefore it is GDPR compatible. The Hyve team investigated this scan provided by each hospital to identify the data components that need clarification.

Next step is to hold a mapping workshop with the representatives of each hospital to discuss what we found in the White Rabbit scans, whether or not there are any legal/technical constraints with a certain variable and so on. There was a clinical expert also available to clarify what each variable means so that it can be mapped to the correct table in the target data model.

The final step is to achieve consensus between all clinical data partners about what were the minimum variables and tables to include in the data model given the objectives of the research and the shareability of the data. For example we decided that in addition to real dates, we include the intervals of time between events in the data model to overcome having to share actual dates.

Result

The PANCAIM CDM is composed of 9 tables, with the patient table being at the centre of the model. In this table, every patient has a unique record populated with characteristics relevant to pancreatic cancer research, such as the age of diagnosis and ECOG/ASA/WHO score. The other 8 tables store the patients' clinical data and each record in them is linked back to the person table. These 8 tables arrange the data by type: body measurements, past medical history, surgery, lab tests, follow-up/progression, chemotherapy, radiotherapy and time intervals, where the time in days between important clinical events (such as surgery and chemotherapy) are stored. In this manner researchers can build cohorts in a large variety of manner according to their objectives.

Advantages

The PANCAIM Clinical Data Model is a comprehensive framework that standardised a range of clinical observational measurements across European hospitals. The standardisation process makes it easier for current and future data partners to organise their data in a way that enhances its accessibility to fellow clinicians, researchers, and other hospitals. This, in turn, enables them to access data from other hospitals as well. The model has been developed with the expertise of several oncology, clinical, and data engineering experts to investigate pancreatic cancer alongside other patient features, using a range of multimodal AI tools. The model has been designed with a certain level of flexibility that enables it to handle missing variables with ease. This feature makes it a more adaptable option in scenarios where a partner does not have all of the clinical observations recorded.

Limitations

The PANCAIM clinical data model provides unique benefits for pancreatic adenocarcinoma research. If any updates need to be made to the model, it is essential to consult both clinical and technical experts to ensure the necessary changes are made efficiently. In addition, when new data sources join the consortium, revisions may be required to accommodate them. Additionally, if the model were to be extended to other types of pancreatic cancer, additional work would be necessary.

Future developments

Under PANCAIM the source clinical data is harmonised to the custom CDM and is used in conjunction with all the other types of data (e.g. radiomics, pathomics, genomics and imaging) to help faster diagnosis of the cancer. For the purpose of continuity throughout the lifetime of the project and beyond, ideally the CDM remains unchanged. However there might be situations that necessitate a change to the CDM. Here we outline a few such scenarios and the considerations that need to be taken for changing CDM.

One such scenario would be if new data sources are added to the consortium, a rigorous analysis is required to ensure their patient data fits the constraints of the models. If this is not the case, then the CDM can be modified to omit new constraints that cannot be fulfilled by the new data source. These modifications can always be backwards-compatible, so that no records in the existing database need to be changed.

As the medical community learns more about pancreatic cancer and its treatments, it is possible that new discoveries could demand changes in the CDM to add new variables or tables and start recording new types of clinical data that is revealed to be associated with the disease.

Finally, as researchers use the PANCAIM database, insights generated by it could result in changes in the CDM too. Data analysts may find that increasing the granularity or detail of the data (such as tracking more types of surgeries or cardiovascular disease) could result in more interesting results.

Considerations

Once the initial datasets have been converted to the PANCAIM CDM, alterations to the data model will require more effort (e.g. because existing data transformation ETL pipelines need to be updated). Careful consideration will therefore be required to weigh the potential analytical benefits of changing the model against the cost of implementing new CDM changes.

Impact

Despite numerous advances in cancer treatment, most patients die within one year of pancreatic cancer diagnosis (ref), highlighting the need for more research specifically focused in pancreatic cancer. One obstacle that stands in the way is the low amount of data a single health centre can collect: pancreatic cancer is the ~10th most common cancer (in the US), representing 3.3% of new cancer diagnosis (in the US, ref). Most meaningful computational analyses require hundreds of data points to generate accurate insights, thus a database with a large number of patients is necessary to be able to build large varied cohorts.

The development of the clinical data model stands as a crucial milestone for the PANCAIM consortium and its objectives. The CDM enables the harmonisation of clinical observational data with other modalities such as radiological images, OMICs data, and pathological slides, thereby streamlining data integration and improving interoperability through its standardised format. This provides the foundation for improving advanced analytics and training machine learning algorithms, where multi-modal pancreatic cancer data is leveraged to build and improve cancer survival predictions, early diagnosis, subtyping or image segmentation.

The PANCAIM consortium’s CDM and objectives help improve all FAIR principles, alongside advancements in interoperability. By enabling easy discovery and access to data across diverse sources within the consortium and future data partners, the findability of pancreatic cancer data is promoted. The structured framework of the CDM enhances accessibility by ensuring data can be readily retrieved and utilised by both humans and machines. Finally, the CDM contributes to data reusability by establishing clear semantics and formats allowing for efficient data sharing and reuse across various research efforts within and beyond the PANCAIM consortium.

The PANCAIM CDM is a unique development in pancreatic cancer care. It is one of the first of its kind and helps address a unique challenge in combining multi-modal data across multiple European hospital sites to improve pancreatic cancer care. This offers researchers and clinicians a unified platform to explore complex datasets and drive innovation in oncological research and patient care.

Conclusions

The challenges of working with clinical data from different sites are multifaceted, encompassing issues of heterogeneity in how the data is collected and stored, privacy concerns, and interoperability barriers, and they are only more severe in international and interdisciplinary projects like PANCAIM. However, the development of this CDM by The Hyve serves as a crucial solution to these challenges: by standardising data formats and semantics, the PANCAIM CDM facilitates harmonisation of clinical data across multiple Pan-European sites, enabling seamless integration with other modalities such as imaging and genomic data, and enhancing accessibility, findability, and reusability of data. With this, The Hyve and the PANCAIM consortium at large are strengthening the foundation for collaborative research and innovation in pancreatic cancer care, ultimately paving the way for more effective treatments and improved outcomes for patients worldwide.

OHDSI

The OHDSI suite is an open-source, modular solution that enables organizations to explore 360° patient journeys and turn data into evidence. The ecosystem provides a broad range of tools that cover all aspects of real world data and evidence − from data characterization to a standardized data model (OMOP CDM). This enables large scale cross-database analytics with OHDSI.

Our mapping experts can work with EHR, EMR, registry data and most popular commercial / claims datasets.
Training all the new mapping service providers in EU (EHDEN)
Integrating OHDSI with semantic standards

Making a custom-made clinical data model

Introduction

Method

Result

Advantages

Limitations

Future developments

Considerations

Impact

Conclusions

OHDSI

Let's start collaborating

Fill in this form and we will get in touch

Thank you

Making a custom-made clinical data model

Introduction

Method

Result

Advantages

Limitations

Future developments

Considerations

Impact

Conclusions

OHDSI

Azadeh Tafreshiha

Hernando Suarez

Liam Glück

Stefan Payralbe

Let's start collaborating

Fill in this form and we will get in touch

Stay updated