Internship assignment

When analysing health research data from different data sources in a single data warehouse, a flexible data model is required that still enables high performance for aggregate queries.

Currently, i2b2 [1] and tranSMART [2] use a star schema database design for storing observations data, with subjects, variables, hospital visits, date and observed value as dimensions. When using a row-oriented relational database like PostgreSQL or Oracle, the performance on large datasets is problematic for complex queries (select all patients that smoke and have high blood pressure readings) combined with aggregate and sorting operations.

In literature we find evidence that column oriented databases may perform better. [3] The assignment is about preparing a realistic anonymised benchmark set and making a performance comparison between different database implementations for this problem.

 

Description

  • Benchmark
    • Implement a scalable synthetic data generator sufficient for benchmarking taking as input statistical properties and structure of The Netherlands Twin Register (NTR), a population studies dataset with around 200k individuals and over 1000 variables
    • Gather and define queries commonly causing performance bottlenecks on the NTR dataset
    • Extract and clean a large quantity of numerical and categorical variables, including demographics, vital signs, laboratory tests and medications, from health data associated with ~60,000 intensive care unit admissions. Transform these into a data warehousing schema.
    • Define a set of varied and meaningful queries on the hospital data warehouse schema
  • Database system performance comparison
    • Investigate alternative database systems and select a few for benchmarking based on their particularities (ex. MonetDB, Clickhouse, Druid)
    • Define performance metrics and ensure they will be reported consistently and accurately ○ Implement automated data loading and benchmarking for each of the selected database systems
    • Run benchmarking for a few database sizes
    • Investigate setting adjustments for each of the database systems individually and their effect on performance
    • Confirm the synthetic data leads to equivalent performance results as real data by applying the operation to the hospital dataset
  • PoC running tranSMART – Glowing Bear
    • Replace the Postgres implementation in TranSMART based on the result of the benchmarking
  • Publication
    • Report summary of findings in The Hyve blog post

 

Profile of the candidate and some project characteristics

  • Computer science student
  • Graduation or internship assignment
  • University or higher professional education
  • Specific expertise:
    • experience with database schema design
    • experience with performance optimisations (indexes, query optimisation)
    • familiarity with relational databases, e.g., PostgreSQL
    • familiarity with column oriented databases, e.g., MonetDB
  • Minimal duration of the internship
    • preferably minimally 6 months MSc project to cover the full scope, or
    • minimally 3 months internship

 

The Hyve

The Hyve is a young, international IT company, employing more than 40 software developers, solutions architects and data engineers. We are dedicated to delivering solutions that support scientists in life sciences and healthcare R&D using open source software, open data and open standards. Our vision is to enhance the quality and impact of biomedical research by providing biomedical informatics solutions.

The Hyve is based in Utrecht, The Netherlands. The internship is part of the improvements of the open source software we work with. It is also part of an ITN Marie Curie funded PhD project.

 

Information

For more information, please contact:

 

———-

[1] Shawn N Murphy et al. (2010), Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc. 17(2). https://doi.org/10.1136/jamia.2009.000893.

[2] Elisabeth Scheufele et al. (2014), tranSMART: An Open Source Knowledge Management and High Content Data Analytics Platform. AMIA Jt Summits Transl Sci Proc. 2014. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4333702/.

[3] Michael Stonebraker et al. (2005), C-Store: A Column-oriented DBMS. VLDB 2005. https://www.vldb.org/archives/website/2005/program/paper/thu/p553-stonebraker.pdf.