Data quality assessment: a worthwhile effort

OHDSI approach to ensuring data readiness for federated analysis

The importance of data quality assessment

It is a widely accepted concept in computer science and mathematics that the quality of the output is determined by the quality of the input. Our work on harmonization of observational health data to common data standards is no exception to this rule.

Standardizing observational health data to a common data model (CDM) is an essential prerequisite for enabling federated analysis amongst collaborators where multiple organizations provide and share data. In practice, for CDM-harmonized data to be usable for such a united effort, all stakeholders involved need to understand and trust the quality of the harmonized data.

As part of our Real World Data team services, The Hyve can help you attain the highest quality of harmonized data and ensure that such data is fit for purpose. We support the conversion of medical source datasets (e.g. electronic health registry (EHR) data, clinical registry data, insurance claims, etc.) to the widely adopted Observational Medical Outcomes Partnership (OMOP) CDM, thus enabling the generation of Real World Evidence (RWE).

Two key aspects of the data harmonization process are iterative optimization of the extract-transform-load (ETL) data conversion pipelines and comprehensive data quality assessments. The Hyve team leverages open-source tools of the OHDSI suite to design and implement the necessary data quality checks. These checks have been based on source and target model data, as well as on specific use cases for the harmonized data. Our data experts always work in close collaboration with source data owners to effectively review the outcome of data quality (DQ) checks, continuously providing consultancy and technical support in implementing potential data quality improvements.

There are various aspects worth evaluating when determining the quality of data in OMOP CDM format (OMOPed data). In all data quality assessments, we take a step-by-step approach that makes use of the extensive work and experience of the OHDSI community in particular regarding tools and processes for data quality verification. The Hyve in-house data quality process guarantees that the exacting standards of all final OMOPed data Data Users are met. At the same time, it is also a fluid approach that is adaptable to every instance of observational health data or the CDM we encounter.

Ensuring high data quality

Data Quality can be defined as follows:

“The state of completeness, validity, consistency, timeliness and accuracy that makes data appropriate for a specific use.” (Roebuck 2012)1

DQ cannot be observed directly, but a methodology has been developed to assess it. Two types of DQ assessments can be distinguished (Weiskopf and Weng 2013)2: assessments to evaluate DQ in general and assessments to evaluate DQ in the context of a specific study.

As we all know, there are many things that can compromise data quality. Therefore, Dasu and Johnson (2003)3 distinguish four steps in the life cycle of data, recommending DQ be integrated into each step. They refer to this as the DQ continuum:

  1. Data gathering and integration. Possible problems include fallible manual entry, biases (e.g. upcoding in claims), erroneous joining of tables in an EHR, and replacing missing values with default ones.

  2. Data storage and knowledge sharing. Potential problems are a lack of documentation of the data model and a lack of meta-data.

  3. Data analysis. Problems may include incorrect data transformations, incorrect data interpretation, and the use of the inappropriate methodology.

  4. Data publishing. When publishing data for downstream use.

In this article, we will focus on the first two aspects of the DQ continuum, since these play a crucial role in the harmonization of source data to the OMOP CDM.

Key aspects investigated in a Data Quality Review

The ultimate proof of Real World Evidence generation is study replication and establishment of results repeatability. To achieve this goal, our Data Quality review covers four key aspects of data generation and management.

  1. Source data: Is the data fit for the ETL and for the research question?

  2. Quality of the ETL: Is the ETL tested and does it perform well?

  3. Target data: Does the data comply with DQ checks?

  4. Source-to-target: Do we see the same number of records in source and target data?

Figure 1: Key aspects checked in the data quality assessment process

1. Source data: Is the data fit for the ETL and for the research question?

In order to ensure a high-quality end result, it is important to first assess the quality of the source data. Converting variable source data to a standard model, such as the OMOP CDM, changes the format in which the data is structured. However, it will not improve poor quality or correct mistakes in the underlying source data.

One can implement checks on the source dataset to investigate whether the source tables comply with the input format of the ETL data conversion pipeline.

The following example rules can be applied:

  • Existence of all tables used in ETL

  • Existence of all fields used in ETL

  • Data type as expected in the ETL

  • Referential integrity between source tables used in ETL

A sufficient understanding of the strengths and weaknesses of the source data is essential for making strategic decisions regarding a data harmonization process, as well as regarding the feasibility of analytic outcomes (i.e. what types of RWE questions can be answered once the data is harmonized in OMOP CDM). It is crucial to make decisions on how the data will be mapped to standard OMOP CDM concepts (syntactic and semantic mapping; design, development, implementation and testing of the ETL pipelines) and how the harmonized (OMOPed) data will be used (study objectives determine whether the data is fit-for-purpose and can be used to answer specific research questions). Overall, understanding source data is crucial for the selection of the appropriate data quality verification steps and the final analytic goals.

2. Quality of ETL pipelines: end-to-end testing

The next essential step towards achieving high data quality in the target CDM is to carefully design the data conversion process, including syntactic and semantic mapping, and development of the accompanying ETL data conversion pipeline. Key requirements for this process include sufficient understanding of the source data and knowledge of the OMOP CDM, the model transformations and best data mapping practices.

Here, it is important to develop end-to-end tests to check the continuously evolving ETL pipelines. These checks test whether the transformation rules in the ETL pipelines have been implemented correctly. Given that we work with known input data, the expected OMOPed results can be defined and checked.

For the end-to-end tests, we use the testing framework of Rabbit in a Hat, an OHDSI tool maintained by The Hyve. The following steps are executed to test the ETL:

1- The testing framework creates a skeleton based on the Scan Report of the data that outlines the most frequently occurring variables in the source dataset

2- Based on the scan report, we create new synthetic data (e.g. a new patient) from which we would expect specific concepts or measurements

3- Compare expected results with actual outcomes to determine whether the ETL functions as expected

Figure 2 illustrates test examples as part of the testing framework

3. Target data: Does the harmonized data comply with DQ checks?

Once the ETL pipeline is run on the source data and the data has been successfully converted to the OMOP CDM, there are a number of OHDSI data quality tools available to assess the quality of the OMOPed data. Such tools can be used to assess the conformance, completeness and plausibility of the output OMOPed data. This ensures that data conforms to the OMOP model and conventions.

To check the quality of the OMOPed data, we use the OHDSI Data Quality Dashboard (DQD) and Achilles tools, which we run in parallel on the OMOP database. The DQD includes no less than 4.000 customizable checks for OMOPed data. The tests are based on the rules set out by Kahn 20164 and score three core components to define generic DQ. They check whether the data complies with a given requirement e.g. as specified in the OMOP CDM and its standard vocabularies.

  1. Conformance: Do data values adhere to specific standards and formats?

  2. Completeness: Is a particular variable present and do variables contain all recorded values?

  3. Plausibility: Are data values believable?

Each component can be evaluated based on internal (verification) and external (validation) references (see Figure 3), such as constraints, assumptions, knowledge and benchmarks.

For more in-depth descriptions and examples of checks and evaluation rules, please take a look at the relevant chapter in the Book-of-OHDSI.

It is noteworthy that sensitive source data information cannot be extracted from this tool, therefore sensitive patient information remains anonymous. Achilles puts out descriptive statistics of the data together with useful visualization graphs. This information also helps to spot and investigate inconsistencies.

Figure 3: Example of DQD Outcome

4. Source-to-target data quality checks

With the target data checked, it is time to proceed with checks that determine how well the source data and mapped data compare. This can be simple sanity checks, for example, whether the number of patients in the mapped data matches that of the number in the source data. There are also more elaborate inquiries such as checking whether extreme cases (e.g. 120-year-old persons or blood pressure of 700/900 mmHg) have been detected.

5. Replicate a study to compare results

One of the ultimate goals of data harmonization is unlocking the potential of Real World Data to better understand population health and patient journeys. To that end, organizations conduct large-scale studies using harmonized data. Such studies make use of highly variable source data collected by and available within one organization or data integrated from multiple organizations.

When CDM-harmonized datasets are to be used in studies (internal or external), certain issues might arise as the researchers progress with developing the study protocol and design. Commonly a team of physicians and epidemiologists will create an initial study protocol and design the desired patient cohorts, for example using the cohort development OHDSI tool ATLAS. Here the study team usually spends a considerable amount of time deciding on the correct vocabulary terms and translating the protocol phrasing to cohort-specific criteria. The decisions the team makes will depend partly on the vocabulary and the community guidelines and partly on any Data Partners participating in the study.

Cohort Diagnostics is an efficient tool from the OHDSI HADES suite to evaluate the created cohorts as well as gain insights into whether the data are fit for the envisaged study. This tool allows one to explore the patient counts for the various cohorts. This tool also helps determine if transformation logic choices within the ETL result in the misplacement of patients in the OMOP CDM format database. For example, if an outdated concept has been used for the main disease of interest this may result in no patients appearing in the target cohort. Of course, this is a clear indication that something needs to be changed, either in the transformation logic or in the cohort definition.

In a nutshell

Checking and improving data quality is a critical step in harmonizing any source dataset to a common data model. Data quality assessment is key to ensuring an optimal data harmonization process toward fostering trust in the final OMOP databases. In turn, this enables RWE generating collaborations with minimal data readiness-related obstacles. The Hyve team can support you in such a process by thoroughly investigating the quality of data in the CDM, employing both automated and manual approaches. Over the years, we have gained extensive knowledge of and expertise in data quality assessment tools within the OHDSI toolset. This allows us to verify the quality of data succeeding each instance of data conversion into the OMOP CDM. By following this approach, we guarantee our customers end up with the highest quality of (federated) analysis-ready CDM-harmonized data.


1Roebuck, 2012 Data Quality: High-Impact Strategies - What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity, Vendors

2 Weiskopf and Weng, Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research, J Am Med Inform Assoc. 2013 Jan-Feb; 20(1): 144–151. doi: 10.1136/amiajnl-2011-000681

3 Dasu and Johnson, 2003, Exploratory Data Mining and Data Cleaning

4 Kahn et al., A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data, 2016 Sep 11;4(1):1244. doi: 10.13063/2327-9214.1244. eCollection 2016


What people say about The Hyve