Gaps in data quality, data management and data governance are often cited as major barriers to the adoption of AI and machine learning in organizations. This article explores how frameworks such as the FAIR principles, data quality and reproducibility can address these challenges and help companies implement data stewardship practices that facilitate data science.
Is FAIR data useful for machine learning?
At The Hyve, we have executed dozens of projects in the last few years to help customers make their data more FAIR in various ways: sometimes by assessing the FAIRness of data, sometimes by defining conventions for FAIRification, sometimes by building knowledge graphs. With such projects inevitably the question arises: What are the tangible benefits of implementing the FAIR principles? In this blog, I want to highlight one aspect in answer to that question: How FAIR data benefits machine learning applications or, in other words, the ‘machine actionability’ of data. But first, let’s quickly revisit what the FAIR principles have to say about machine actionability.
Relation with the FAIR Principles
Which aspects of machine actionability of data are in the scope of the FAIR principles?
The latest ‘authoritative’ paper regarding the interpretation of the FAIR principles is from Jacobsen et al. 2020. It features the moniker introduced by Barend Mons et al.: FAIR requires that “the machine knows what we mean”. If we zoom in on the semantic aspects of machine actionability, the following principles are of particular importance:
- F2: Rich metadata: machine-actionable metadata allow the machine to discover relevant data and services
- I3: Qualified references to other data: allow the machine to detect and merge digital resources that relate to the same topic or entity, and to detect that a digital entity is compatible with an online service
- Reusability: “Digital resources are sufficiently well described for both humans and computers, such that a machine is capable of deciding: (R1) if a digital resource should be reused (i.e., is it relevant to the task at-hand?); (R1.1) if a digital resource can be reused, and under what conditions (i.e., do I fulfill the conditions of reuse?); and (R1.2) who to credit if it is reused.”
The interpretation of reusability is especially interesting. The original definition of R1 is “(Meta)data are richly described with a plurality of accurate and relevant attributes”. The question ‘what does a machine need to decide whether a dataset is relevant to the task at-hand?’ is broader. Even if we go with the assumption that using attributes is sufficient to facilitate this decision making, the next question is: which attributes are needed for a given class of machine actors or for a particular purpose? To date, there seems to have been little work done to answer this question. R1 typically has few or no FAIR Maturity Indicators associated with it yet, so this would be a good avenue to further explore.
Which aspects of machine actionability of data are out of scope of the FAIR principles?
The FAIR principles do not cover aspects such as “ethics, privacy, reproducibility, data or software quality per se” (Mons et al. 2020). Some of these aspects are really crucial and inherent to machine actionability, in particular reproducibility and data quality. It seems a mistake that the FAIR principles would not explicitly include these aspects, and people have called this out in the past, even asking for a ‘new letter’, for example the ‘U’ of Utility (now covered in the explanation of R1). However, because the utility of data is so dependent on context, especially the intended use of the dataset, these topics may require guidelines and frameworks of their own. Now, let’s take a look at reproducibility and data quality in a bit more detail.
Reproducibility
In the consensus report on reproducibility and replicability in science the American National Academies of Sciences, Engineering and Medicine defined reproducibility as “obtaining consistent computational results using the same input data, computational steps, methods, code, and conditions of analysis”, and replicability as “obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data”.
This definition of reproducibility focuses on computational reproducibility. When it comes to measuring data from instruments, the field of metrology has developed a ‘vocabulary of metrology’ and has attempted to systematically identify uncertainty in measurement. In order to achieve replicability of for instance omics measurements, biological variability, uncertainty in measurement and computational reproducibility are all factors that contribute to the experimental results. Making these factors explicit in the metadata of the dataset – preferably by design and at the source – will greatly help data scientists that need to work with the datasets. So a major part of reproducibility is actually covered by the FAIR principles (R1.2, F2). Some of the projects we recently did focused on making these factors explicit in experimental datasets, for example using PROV-O and the OBO Foundry ontologies.
Data quality
Data quality is a term that can be understood in many ways. In enterprise context, it often refers to master data management as defined by the ISO 8000 standards. In science, the quality of data is closely linked to the suitability of the data for (re)use for a particular purpose. There are many frameworks proposed in the literature for overall data quality, such as the four data quality dimensions (Accuracy, Relevancy, Representation, Accessibility) by Wang, the five C’s of Sherman (Clean, Consistent, Conformed, Current and Comprehensive), and the three categories from Kahn (Conformance, Completeness and Plausibility). Kahn also proposes two different modes to evaluate these components: verification (focusing on the intrinsic consistency, such as adherence to a format or specified value range) and validation (focusing on the alignment of values with respect to external benchmarks).
In the OHDSI community, we have documented in the Book of OHDSI how we understand these terms and how data quality, clinical validity, software validity and method validity all contribute to the eventual quality of the generated medical evidence. Again, for any specific intended use and context, some of these frameworks could be used to create computational representations of the data quality of a dataset, such as the one visualized in the OHDSI Data Quality Dashboard, which leverages the Kahn framework referenced above.
Machine learning
In machine learning, an important balance to maintain is that between adjusting the search space and environment specification (see e.g. Gym environment specifications in reinforcement learning) on the one hand and optimizing the learning method and model on the other. If the problem is not well defined the learning outcomes will likely not be useful. However, in practice defining the search space is a large part of the work of a data scientist, and another approach could be to formulate the data requirements for every model version. This would mean starting from the learning goals rather than from the existing data, and describing the requirements for data content, quantity and quality in a systematic way. These requirements can be used to identify or generate the datasets needed to develop the model. The FAIR principles are an excellent starting point to facilitate the creation of this feedback loop between data generators (data entry systems, lab equipment, data processing pipelines, et cetera) and data scientists. Popular open science machine learning websites such as PapersWithCode and OpenML demonstrate how powerful even a basic annotation of datasets, workflows and model runs can be.
Conclusion
As we have shown, almost all aspects of data utility for machine learning are covered explicitly or implicitly by the FAIR principles. The aspects of machine actionability of data can be ordered with respect to whether they are mostly intrinsic in the dataset (e.g. does the dataset have a globally unique and persistent identifier?), versus mostly dependent on the intended use. This ordering results in a ‘gradient’ which starts with elements that are explicit FAIR principles, but then can be augmented by domain-specific frameworks for data quality, reproducibility and fitness for purpose (see visualization below).
In this visualization, the machine actionability aspects are not only ordered, but also grouped, revealing three different angles from which you could improve the machine actionability of data in your organization
- Framed by the FAIR principles, the question is which attributes (R1) of datasets are necessary for a given class of machine learning algorithms, to decide whether a digital resource can be used for the problem at hand.
- Framed by data quality aspects, we could explore what a data quality framework that informs machine actionability of data would look like.
- Framed by data requirements for machine learning, we could explore ontologies for explicitly expressing the data content, quantity and quality requirements for machine learning algorithms.
Let's start collaborating
The Hyve’s consultants can help you get started with a FAIR assessment or help you develop and implement data quality and data reproducibility frameworks for your organization and domain. To discuss the options, please reach out to The Hyve and book a free consultation with one of our FAIR experts!