In accordance with many industries in the market, the healthcare industry is experiencing the exponential growth of data as its new oil. With Accountable Care Act’s success and digitization of health records, healthcare systems are more unified than ever – EHR systems are integrated with LIMS, PACS, RIS; interoperability standards have grown significantly in both maturity and adoption; and end users are hungry to interact with more data at the right time in a unified manner. More data means more complex operations and more problems – specifically in the quality of the data. In this article, I’ll be focusing on one small but mighty part of the data quality problem in healthcare – third-party data.
At Qualytics, we work with various healthcare organizations who have varying complexity of data operations and data quality issues. One topic that is consistent between these organizations is the overall lack of quality in the data they get from third parties; whether the data is from payors, health systems, wearables or other data generators. Root causes for data quality issues can often be categorized into six high-level topics:
- Operational – this is where we see software releases, product evolution and KPI calculations changing over time impacting our data
- Human Error – data entry is inherently flawed, especially with systems not always consistently validating data inputs. Fat-fingering is a real problem
- Nature of Data – missing data, duplicate data, non-conformant data with inherent flaws
- Technical Issues – dataops is complex! We have 100s, if not 1000s of disparate systems – whether they are SaaS or proprietary – that have complex integrations
- Nature of third-parties– by nature, third-party data means limited control over the data – and should not be trusted by default
- Governance – often inadequate &decentralized; execution lags behind strategy
The issues we experience with incoming data from third parties can fall into all of these categories, but especially Operational, Human Error, Nature of Data and Nature of third-parties categories. Evidently, we need to be able to do something about our third-party data – but what’s the best practice here?
A default line of thinking would be to bring the data in, land it in your systems, and run some QA scripts to validate the data. This line of thinking is inherently flawed for a number of reasons: 1. When data is landed into operational systems, it is nearly impossible to truly delete a bad data record; 2. Relying on manually developed QA scripts puts impetus on the scripts being current, relevant and to have the necessary coverage of all use cases – a very difficult task to accomplish manually. So, this approach doesn’t really work. What else can we do?
An alternative approach would be to first bring the data into a clearinghouse environment – where data is validated with QA scripts before being let into operational systems. This is a better approach, but not scalable. The #2 problem from above persists with manual developed scripts needing to be managed, but exasperated with the fact that now we have added additional complexity into our data ingress with a clearinghouse and additional data pipelines & ops that need to be maintained.
Having faced these issues many times, we have devised a better approach to this problem – and it comes in three distinct steps.
- Flexible 1st mile: incoming data can be in many shapes, sizes and locations. Whether it is files on object storage (think AWS S3), APIs, shared data frames, we need to assert the ruleset in-place as part of the processing steps, enabling #2.
- Ability to infer data quality rules from historical data: historic data’s shapes and patterns can often be utilized in ML methods to automatically infer what the data should look like going forward. This often gets us to 60-70% coverage of the rule base from the get-go, enabling staff to focus on authoring the more complex business logic checks.
- Route anomalous data to an enrichment data store: when the ruleset is asserted on data in-flight, records failing assertions can be routed to a different location than the original destination – an enrichment table – where anomalous data is segregated from the good data. Staff can then focus their efforts on taking corrective actions towards these anomalies with light transformations, quarantining, dropping records, or kicking them back to the originator third-party to address.
Third-party data will continue to be a core part of interoperability between systems and as will the necessity for data quality checks in complex data ingress / egress workflows.