Schema and Data Validation of T1D Exchange Mapped Data Using Pandera Framework

graphic, two kids standing outside, dressed in warm clothing

Authors: Brent Lockee; Mitchell Barnes; Emily Dewit; Mark C. Clements MD, Ph.D; Diana Ferro, Ph.D. 

 

Children’s Mercy Hospitals
Kansas City, Missouri, United States
bclockee@cmh.edu

 

Background/Objective:
Data registries, such as T1D Exchange, advance data-driven innovation by compiling data from centers across the nation and using them to answer complex questions and develop strategies to improve patient outcomes. That requires member institutions to map, transform, and validate their data – a challenging and detail-oriented task. Our institution was able to reduce submission errors and improve data quality by using the open-source framework Pandera (Bantilan, 2020) to validate data before submission.

Methods:
Pandera is a lightweight schema and data validation framework built in Python (3.7, 3.8, 3.9). It allows users to define a schema for their data and to specify a wide variety of data quality checks. Our team translated the mapping documentation provided by the T1D Exchange to data tests in Python using Pandera. This validation step was added after data extraction and mapping, before any data was submitted to the T1D Exchange.

Results:
In the submission prior to using Pandera, the T1D Exchange reported 26 data schema and validation errors back to our institution. In the first monthly submission after adding Pandera schema validation to the workflow, only one error was reported.

Conclusions:
Using Pandera to add a schema and data validation step to the data extraction, mapping and validation pipeline has reduced the number of errors per submission. It also provides a flexible framework that can be adapted as changes to the requirements are made by the T1D Exchange. Because it is built using open-source tools, it can also be easily shared with other member institutions.

Key Words:
Data Quality; Data Processing, Automatic; Data Sharing

​Bantilan, N. (2020). pandera: Statistical Data Validation of Pandas Dataframes. Proceedings of the 19th Python in Science Conference, 116–124. https://doi.org/10.25080/MAJORA-342D178E-010

Subscribe To Our Newsletter

Subscribe to receive news and updates.

Share This Research

Share on facebook
Share on twitter
Share on linkedin
Share on email

More To Explore

February 2022

Background: where we started In 2020, we launched a new effort called the Rising T1DE Alliance (formerly known as the Rapid Learning Lab), funded by

Do You Want To Join the Alliance?

Get exclusive access by joining our newsletter

business partners in a circle with wrists held
Skip to content