Preloader Icon



Back to Home

A Taxonomy of Healthcare Data Quality

August 29, 2023

Healthcare data supports many initiatives and decisions beyond clinical decision making. Data is relied upon for population metrics, value-based care and quality measure reporting, public health reporting and clinical research. High quality data is necessary for providers to make better diagnoses, choose the most appropriate treatment and provide better overall care. But how does an organization know if the data is reliable?

Charlie Harp, CEO of Clinical Architecture, presented a webinar on the Taxonomy of Healthcare Data Quality as a tool to assess the quality of healthcare data. He discussed the benefits of using a standard classification to assess data and shared a practical example using real-world medication data.

Watch the video to learn more:

The presentation concluded with a question and answer session. Below are additional questions that were asked during the presentation that were not able to be addressed in the time allotted.


I appreciate the measurability and application to standards. Have you considered incorporating the FAIR data principles about machine actionability in the taxonomy? They state that data should be findable, accessible, interoperable, and reusable with some more requirements.

Charlie Harp:

I did not consider FAIR data principles while contemplating the HDQT. This could be because they seem tangential to me. The FAIR principles aim to enhance data sharing and collaboration within the scientific community and enable broader use of data for many research purposes. In other words, FAIR is a guideline for how a data source can organize data to enrich collaboration proactively. What I was looking for was an objective taxonomy for organizing healthcare data quality dimensions based upon the nature of the issue. Allowing me to measure the quality of data I am receiving from others, regardless of their data management practices so that I can determine the reliability of the data and provide logical and actionable feedback to the source for remediation. One of the challenges we have in healthcare, outside of research-oriented use cases, is the people collecting data do not prioritize the collection of the data itself as this activity is seen as a byproduct of the process of providing care. There is also the challenge that software used to collect the data is not under the control of the people entering and managing the data so implementing a thoughtful, data-oriented approach, like the FAIR principles, is not always pragmatically possible. I wish it were, as it would make our jobs much easier.


Have you started to gain insights into quantitative (computable) thresholds that can speak to the level of quality data elements have based on their use case? That is, after using your taxonomy and seeing the results how does the data quality assessment get interpreted (quantified) in a clinical vs financial vs. research context?

Charlie Harp:

We are still experimenting with the HDQT and have just scratched the surface of how it can be applied to various evaluation criteria (clinical, financial, research, etc…). Remember the HDQT is a taxonomy of dimensions intended to organize quantitative results based on the nature of the qualitative issue. In theory, when establishing an evaluation criterion, you would decide for that evaluation which dimension or dimensions applied. When you execute the evaluation, the resulting percentages would present based HDQT, and you would see patterns that would provide insights into the root cause and potential remediation of those issues at the source or in transit. We are just starting to do this and, initially, are using USCDI (United States Core Data for Interoperability) as our evaluation criteria guinea pig


In previous versions of Clinical Architecture Symedical, there were suggestions for manual concept mapping. How can this enrich data sources beyond the scope of interoperability, but also for data quality at rest for analytics, etc.? Can you speak to use cases in research, big data, etc.?

Charlie Harp:

While semantic normalization is necessary for interoperability, it is likewise necessary for enterprise scale analytics. In any use case where you want to combine data from multiple sources and do something meaningful, everything you are dealing with must be normalized to a common set of terminologies and a common frame of reference. If you fail to do this, you will be doing analytics on apples and oranges and the results will be problematic.

Harmonization of understanding is the goal of semantic normalization.

Another common use case is semantic summarization. Like the close cousin of semantic normalization, semantic summarization is being able to take data and instead of mapping it to its equal, mapping it to the parent of its equal or the grandparent of its equal allows you to intelligently roll things up into broader concepts. That can also facilitate a level of understanding of things that share the same characteristics.

The bottom line is semantic understanding is a large part of how we organize, share and understand data in healthcare. The ability to understand the semantic meaning of data can be significantly destabilized by quality failures. The hope is that by identifying quality issues and classifying them by nature and likely cause, so we can either quarantine bad data, remediate it in in flight or report back to the source so that they can incrementally improve.


Medication adherence may be documented in the notes, but not translated into a code…SNOMED or ICD. Similar to the diabetes/metformin example provided by Mr. Harp.

Charlie Harp:

That’s absolutely true. However, the context in which I was talking about medication adherence was relative to the requirement in USCDI which required a code and required that code to be in SNOMED.

So in order to meet the standard of USCDI, the sender either has to supply a code in SNOMED to document the level of adherence or someone would have to use a technology like natural language processing to find the unstructured data and turn that into a code so that it would adhere to the USCDI requirement of having a SNOMED code.


Are there any standards that can be used to assess the accuracy of these NLP tools? Extracting data from text would seem to have a significant margin of error. Also, the data used to train the AI/ML model needs to be high quality in the first place.

Charlie Harp:

I am not aware of a set of standards to specifically validate NLP output in the healthcare setting. The primary way I have seen organizations do this is with human subject matter expert’s reviewing the results or a sampling of the results.

With regard to the quality of the data being provided for input, that’s absolutely true. As I said in my presentation, if the data is bad, the artificial intelligence will be bad.

For other questions, please contact us or leave a comment below.

Stay Up to Date with the Latest News & Updates


Submit a Comment

Your email address will not be published. Required fields are marked *

Share This