Data Collaborations in Life Sciences: Comparing Centralized, Federated, and TRE Architectures

In an ever modernizing landscape of data-driven research in the Life Sciences, the way health related information is being stored, accessed, and processed plays a crucial role in determining the efficacy and ethicality of research efforts. The traditional centralized approach has long been the go-to model, but with the rise of concerns around privacy, scalability, and inclusivity, federated approaches are gaining traction, especially in areas that involve distributed analysis and machine learning algorithms, which often require harmonized datasets across a distributed collaborative network.

This article primarily explores the differences between centralized vs federated approaches in Life Science Research, with a specific focus on their key features and implementation considerations. We also explore the common Trusted Research Environments (TREs), which are in essence a “hybrid model” that includes both centralized and decentralized components.

We will not address the fully decentralized models in this article, which are not commonplace in life science research, but are becoming increasingly popular in cutting-edge technology fields that use technologies such as blockchain and advanced AI.

We will focus on the three main data collaboration architectures commonly used within Life Science research. These are Centralised (data centralization), TREs (trusted research environments with data minimization) and federated architectures, which you can see illustrated in the following image:

Data Collaborations: 3 Leading Architectures
Data Collaborations: 3 Leading Architectures

Let us take a look at each architecture one at a time and talk about their key features and implementation considerations.

The Centralized Research Approach

In the Centralised Research Approach, data centralization occurs and computational resources are aggregated and controlled from a single, central entity or location. Researchers transfer or upload data to a central server or platform, where algorithms are trained, models are built, and analyses are performed. This method has been the conventional model in many industries, from academia to private sector research. The centralized research model is increasingly being challenged in the Life Sciences sector due to privacy, scalability and stability concerns.

Let’s look at the key features and implementation considerations of a Centralised research approach:

Key Features:

  • Single Point of Control: All data flows into and is processed by a central authority, whether a university, company, or government organization.
  • Efficient Management: Centralised systems can be highly efficient in terms of data management, computation, and decision-making processes, as everything occurs in one place.
  • Unified Standards: Researchers operate under a common framework, ensuring standardization of methodologies, tools, and data formats.

Implementation Consideration:

  • Ease of Coordination: It’s easier (as opposed to decentralized) to organize and manage research efforts, as a single entity controls the resources.
  • Strong Analytical Power: With access to large, aggregated datasets, Centralized systems can harness powerful computational tools and AI to extract insights.
  • Regulatory Control: A central authority can ensure that data security and privacy regulations and ethical standards are adhered to.
  • Privacy Concerns: Aggregating data in a single location can lead to concerns over data breaches or misuse, especially when sensitive information is linked (e.g., medical or personal health records).
  • Scalability Issues: As datasets grow, Centralised systems may struggle to scale efficiently without massive investments in infrastructure.
  • Single Point of Failure: If the central system is compromised or malfunctions, the entire research effort could be jeopardized.

The Trusted Research Environment

A Trusted Research Environment (TRE), also known as a Secure Research Environment (SRE), is a controlled digital infrastructure that provides researchers with secure access to sensitive data while adhering to privacy, legal, and ethical regulations. TREs are commonly used in fields like Life Sciences and healthcare, finance, and public policy where sensitive data, such as patient records or personal financial information, must be analyzed under strict security conditions.

Rather than giving researchers direct access to raw data, TREs offer a highly controlled environment where researchers can apply to access specific datasets for analysis and processing, while minimizing risks of data leakage, unauthorized access, or misuse. This allows organizations to strike a balance between enabling research and safeguarding individual privacy.

Key Features:

  • Data Access Control and Auditability: TREs have strict data access control, ensuring that only authorized users can access sensitive datasets, and auditability by maintaining detailed logs of user activities.
  • Secure and Isolated Computing Environment: TREs offer researchers access to a secure, isolated computing environment, where all analysis and computation take place within the protected system, usually a platform.
  • De-identification and Data Minimization: TREs often incorporate de-identification or anonymization techniques, ensuring that personally identifiable information (PII) is either removed or masked before data is made available for analysis. They also offer data minimization, where researchers are only given access to the specific datasets and variables necessary for their study, rather than full, unrestricted datasets.

Implementation Considerations:

  • Enhanced Data Privacy and Security: TREs safeguard data privacy by keeping sensitive data within a secure environment and applying robust encryption, de-identification, and access control measures. TREs ensure compliance with data protection regulations such as GDPR (General Data Protection Regulation) and HIPAA (Health Insurance Portability and Accountability Act).
  • Compliance with Ethical and Legal Standards: TREs are built to meet rigorous ethical and legal standards, making them ideal for research that requires compliance with data protection laws and research ethics.
  • Trust and Collaboration: TREs foster trust among data custodians, researchers, and the public by creating an environment where sensitive data can be used responsibly and facilitate collaborative research by allowing multiple researchers or institutions to access the same dataset within a secure, shared environment.
  • Limited Flexibility for Researchers: TREs can be restrictive in terms of data access and analytical flexibility. Researchers often have to work with pre-approved software tools and data configurations, which may limit their ability to customize analyses or apply novel techniques.
  • Resource-Intensive Setup and Maintenance: TREs require significant investment in both technology and administrative oversight. TREs must be equipped with advanced security features, encryption, authentication protocols, and continuous monitoring to ensure compliance with data protection laws. The cost and complexity of setting up a TRE can be a barrier to adoption, especially for smaller organizations.
  • Challenges in Data Availability and Sharing: TREs can complicate and limit data sharing between organizations and researchers in different locations. Since data cannot leave the secure environment, researchers must collaborate within the confines of the TRE, which can slow down the process and limit access to external data sources. Moreover, data custodians may impose additional restrictions on data access, further complicating research workflows.

The Federated Research Approach

The Federated Research Approach distributes computation across multiple decentralized entities or nodes. A key distinguishing factor from other models is that the data remains in its original location, while only the necessary insights or model updates are shared with the central coordinating entity. This method is commonly used in fields like federated learning (training algorithms), where privacy and data ownership are critical, since the data is located at the site of a data custodian (e.g. a hospital, biobank, patient registry or academic research center).

It's noteworthy at this point to also distinguish between the 2 main types of federated research applications. There is federated analysis (or federated analytics), which involves performing an analysis to access the results on a federated data node, and federated learning, which involves training algorithms for performance improvements and model updates on federated data notes. We discuss both with examples of each a little later in this article.

Key Features:

  • Distributed Data: Data is stored locally at individual nodes (e.g., hospitals, universities, or personal devices), which allows research to be conducted without centralizing sensitive information.
  • Analysis and Model Training: Instead of sending data to a central location, analysis of the data and machine learning models are trained locally at each node, and only insights or model updates are shared centrally.
  • Privacy by Design: Data never leaves its local environment, reducing the risk of breaches or privacy violations.

Implementation Considerations:

  • Enhanced Privacy: Since raw data never leaves the original location, federated approaches significantly reduce privacy risks, making them ideal for sensitive fields such as healthcare research.
  • Better Scalability: Federated systems can scale more easily, as they distribute computational workloads across multiple entities.
  • Reduced Latency: Processing occurs locally, which can improve response times and reduce the load on central servers.
  • Complex Coordination: Managing a federated system is more complex than a centralized one, as the central authority must ensure that all nodes operate under the same legal and governance guidelines and that updates are correctly synchronized.
  • Heterogeneous Data Sources: Data across different nodes may be heterogeneous, requiring additional work to harmonize inputs and ensure meaningful analysis.
  • Computation Costs: Since processing is done locally, each node needs sufficient computational power, which may not always be available, especially in resource-constrained environments.

Federated network approaches represent a semi-decentralized system due to the distributed nature of the data nodes in the network, which are often in different geographic locations managed by different data custodians (e.g. hospitals, registries, research universities, biobanks…etc). They retain some centralized features such as central coordinating authority so that the same legal and governance guidelines are adhered to and updates are correctly coordinated and synchronized. They also require data to be mapped to a Common Data Model (CDM) so that ontologies and vocabularies of the data sources are harmonized for coordinated analysis.

Choosing the Best Approach for your Research Needs

In choosing between a centralized, TRE or federated architecture, the decision will depend on the specific goals and constraints of the project. Above, we have discussed key features and implementation considerations you will need to be aware of for each approach.

In summary, centralized approaches offer simplicity and strong analytical power, but they may falter when dealing with privacy concerns or large-scale, distributed datasets. Trusted Research Environments (TRE) with Data Minimisation are common in research because they provide researchers with secure access to sensitive data, while adhering to privacy, legal, and ethical regulations within a central platform. Federated approaches provide a compromise, allowing for collaboration without sacrificing data privacy and security concerns, and have the ability to scale globally as the federated network grows and gains adoption by researchers.

The future of research will most certainly be a blend of the three approaches we have outlined in this article, with fields like life sciences and global collaborative efforts embracing federated models, which are gaining popularity as they enable the adoption of advanced analytical methods such as artificial intelligence involving machine learning algorithms. This is an area that will undoubtedly transform the way we identify, develop and commercialize drugs and devices to bring more innovative life-improving therapies into the hands of patients who need them.

Example of a Federated Research Network: The IMI EHDEN Project

Let’s take a deeper dive into federated approaches within Life Sciences Research, specifically with a leading example that anyone working in the Life Sciences with the ambition to generate scientific evidence from Real World Data (RWD) should be aware of.

The Hyve is a key contributor to the IMI EHDEN (Innovative Medicines Initiative European Health Data & Evidence Network) project and consortium (https://www.ehden.eu/), which is committed to developing a European ecosystem that allows large-scale analysis of Real World Data. It is achieving this by data harmonization to a common data model (CDM) and by developing the required infrastructure for a federated network at scale across Europe. The consortium also develops innovative research methods, and provides education in an open science collaboration. Therefore, EHDEN is a perfect example of an initiative that is building the foundations for a growing federated network that will increasingly enable federated analysis across European real world health datasets.

Specifically, EHDEN has created a growing network currently consisting 198 Data Partners (e.g. hospitals and registries) in 30 countries across the European region, with more than 850 million anonymous health records being harmonized to the OMOP CDM. In parallel, 64 Small Medium Sized Enterprises (SMEs) across 22 countries have been trained and certified to facilitate the OMOP CDM conversion of the source data of the Data Partners.

In essence, the harmonization of the different data sources that reside securely at the different contributing data partners (e.g., hospitals, research universities, registries, biobanks ...etc) is one of the key enablers to the federated approach. This harmonization to a CDM, in this case the OMOP-CDM, which stands for Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), is an open community data standard, designed to standardize the structure and content of observational data and to enable efficient analyses that can produce reliable evidence (see the OHDSI https://www.ohdsi.org/ open source community effort for tools on how to deal with and analysis OMOP datasets).

Performing Federated Analysis on a Federated Network

Now let’s switch gears a little and talk about performing distributed and federated analysis on a federated network and some of the key advantages this brings to researchers and other stakeholders. Remember how we mentioned above that one of the key features of a federated approach is that data no longer has to move, reducing security and privacy risks. Instead, the software moves to the different data nodes to perform an analysis (or get trained in the case of machine learning algorithms).

Because federated networks like IMI EHDEN harmonize data at each site to a common data model (CDM), analyses performed by study codes (called packages and built with input from a study protocol), work on the entire network regardless of the source of the data or location of the site. This enables large-scale distributed analysis across a federated network and enables researchers to efficiently and effectively generate evidence from raw data at site they would normally not be able to access without a federated network. You can see how this works in the slide below, where multiple study codes can run on all data sources due to the implementation of a Common Data Model across the different data sources (sites).

Federated Approach: From Data to Evidence
Federated Approach: From Data to Evidence

Enabling Federated Learning on Federated Networks

A great example of an open-source platform designed to facilitate secure, privacy-preserving, and decentralized data analysis and machine learning across multiple organizations (i.e. data nodes) is Vantage6 (https://vantage6.ai/). It enables researchers and institutions to collaborate on large-scale studies without the need to centralize sensitive data. Vantage6 allows different organizations (e.g., hospitals, universities, research institutes, biotech & pharma companies) to perform machine learning and data analysis on distributed datasets without sharing raw data. Only insights and model updates are exchanged between participants.

This federated approach is built to secure Privacy by Design: Data remains securely within each organization. Instead of moving data to a central location, Vantage6 moves the software (analytical packages or algorithms) to where the data is located, ensuring that sensitive information such as medical or personal data is protected throughout the research process.

It supports a decentralized Federated Architecture, allowing multiple nodes (organizations) to work together on complex research problems while maintaining full control over their own data. Once there is an agreement between the partners in a federated network, each participant contributes to model training or analysis by processing the data locally.

It is Scalable and Flexible as the technology can be applied across various industries and research areas. It's flexible enough to accommodate different types of data and analytical tasks, making it ideal for large-scale collaborative projects in fields like healthcare, genomics, proteomics and any form of data that is harmonized and adheres to FAIR principles - Findable, Accessible, Interoperable and Reusable.

Vantage6 even incorporates Advanced Security Measures, including blockchain technology, to ensure transparency and trust in data collaboration processes, while mitigating risks of data breaches or unauthorized access.

And possibly its best feature, is that Vantage6 is Open-Source, allowing researchers and developers to customize and extend the platform according to their specific needs, preferably while aligning with the open-source community so that the feature improvements can be collaborated on.

Conclusion

In summary, we have covered in this article the key features and implementation considerations of centralized, trusted research environments and federated approaches to research within the Life Sciences.

While the future of data collaborations in scientific research will inevitably be a combination of these different approaches to fit a project's specific needs, we emphasized the growing popularity of the federated approach to enabling data collaborations, especially in areas where there is an increasing demand for data to become machine learning enabled.

Finally, with the exponential growth and potential of generating evidence from real world data, these federated network approaches are going to be key for scientific researchers and the life science industry as a whole, to create novel and improved therapies that will deliver more value, benefit patients, improve quality of life and ultimately enable us to collaborate better in an open science ecosystem that will enable better health.

If you would like to learn more about how The Hyve can help your organization implement a future proof data collaboration approach and strategy, feel free to contact Chris Baldwin at chris@thehyve.nl or via our website contact form.

Let's start collaborating

We offer:

  • Customized open source software without license costs with little or no data transfer
  • Expertise in FAIRification of biomedical data
  • Tailored data analytics to fit your needs
  • Expertise on the implementation of data collaborations and data modeling to maximize data value

Fill in the form and we will get in touch