Preloader Icon

The Informonster Podcast

Episode 26: Discussing Data Quality with Charlie Harp

April 27, 2023

On this episode of the Informonster Podcast, Charlie discusses the importance of data quality and how it impacts the results of any analysis or decision-making process. He also shares insights from our inaugural survey which indicates that the industry recognizes the importance of data quality but acknowledges that it is a challenging effort.

 Download the Data Quality Survey Report here.


View Transcript

Follow Us

Have a question or topic idea?

Get our News and Updates

Get notified about new podcast episodes, upcoming events and webinars, and more!


I’m Charlie Harp and this is The Informonster Podcast. Today, on The Informonster Podcast, I’m going to talk to you about data quality and healthcare. So to start, let me take you back way back to the mid -which19th century, let’s say 1860-ish and Charles Babbage. Now Charles Babbage was a polymath that meant that he was a physicist, a mathematician, and an inventor. Now, he invented the difference engine, which was the precursor to the modern computers we use every day today. Now, when he was presenting the difference engine to members of then English parliament more than one time he was asked the following question, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” When recounting that later, he said, “I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question, which is a very polite, mid-19th century way of saying, how could you ask that question?

Even back when the very first precursor of the computer was created, the inventor knew that if you put wrong data in, you get bad data out. So let’s jump ahead. Let’s jump ahead a hundred years to about 1958 and a gentleman whose name was George Fuechsel, and I might be pronouncing that wrong, was an IBM trainer who trained people on programming the IBM 305 RAMAC. And one of the things that was part of his teaching mantra was this concept that if we put bad information into our computer models, we will get bad information out of them, which once again is not news. This is something that Babbage knew a hundred years before. So why do people remember George Fel? Because he coined a phrase. Now, this is also debated, but I’m going to go with this. He coined the phrase, “Garbage In, Garbage Out.”

So that got me thinking. I decided to take the phrase “Garbage In, Garbage Out.” and look it up on PubMed and see how many articles have the word Garbage In, Garbage Out, or even just the word garbage in it. And I found starting in 1972, 472 references in PubMed, just PubMed. And what’s interesting is when you look at those references and the articles that came through, the articles happened like onesie-twosie almost starting in 1972, one article every two to three years until 2001, around the early 2000s suddenly the numbers started to increase on an annual basis and it increased over and over and over again until when you look at 2022, the number is 26, I want to say I don’t have the number in front of me, but it’s like 26. So I started thinking, “Well, why all of a sudden the increase starting in 2000?”

And then I went in and overlaid the regulatory initiatives starting with the establishment of NCQA back in 1990. But in the early 2000s, quality measures were introduced. MIPA, the value-based care penalties, mips, macro cures in 2022 was rolled out. So what’s interesting is that the more we started to have to use the data that we produce in healthcare for some meaningful purpose, the more we started to understand this concept of Garbage In, Garbage Out the reason why I’m talking about Garbage In, Garbage Out is because when we think about data quality and healthcare, that’s really what we’re talking about. We want to make meaningful decisions, and we want to do things, but if the data is not good, we’re not going to get a good result. So let’s talk a little bit more about data quality and healthcare. And so what I’m going to do is I’m going to break it down and if you look at the terms data and quality data in healthcare is primarily patient data.

We have master data that articulate what are our clinics and our providers and our specialties and what kind of equipment we do have, what’s our chargemaster. And we have reference data that we get from other places like Standards, RxNorm, SNOMED and things of that nature. But the vast majority of what we have to reckon with in healthcare is patient data, whether it’s their clinical data, demographic data, genomic data, vital signs, lab results, or insurance information. The vast tsunami of data that we have to reckon with in healthcare is patient data. And that’s fair because that’s what healthcare is about, it’s about caring for patients. So as a byproduct of doing that, it’s natural. That would be the biggest contributor to the data that we have to wrangle and make use of in healthcare. When it comes to the word quality, I did one of those things where you go to the dictionary and you look words up, and I don’t normally do that, but in this case, what it was interesting because of quality, which is kind of a fuzzy word, has two main definitions.

One is as a noun. So quality as a noun, I summarize it like this. It’s the totality of features and characteristics of something that affect its ability to satisfy given needs. So quality as a noun is like a gauge that says is it good or not good. And really it’s saying is it fit for purpose or not fit for purpose? So for example, a Phillips screwdriver might be an excellent screwdriver, but if the purpose you’re trying to satisfy is hammering a nail into a piece of wood, it’s a really poor-quality hammer. So when I think of quality in healthcare, it resonates with this noun definition of quality. It’s a measure like height or volume. When it comes to quality, there’s another definition. And that definition is as an adjective and as an adjective, it tends to be used to imply something is of good quality like that’s a quality used car, or when your significant other says they want to spend quality time and you end up in an artsy theater watching a finish adaptation of Mamma Mia.

So to me, quality as an adjective is something somebody says when they’re trying to sell you something and it’s much more arbitrary than when you’re measuring the quality of something which is the noun version. So for me, I’m not a fan of using quality as an adjective in healthcare. I think that in healthcare quality should be a noun, it should be something that we measure, something we improve, not just something that we label as quality. So we’ve got data which we’ve talked about. We’ve got quality.

So the next question is can we just slap a gauge on the side of our servers and have it tell us the quality level of that data? And somebody asked me this question in an interview and I said, yes, but really the answer is no. I was being cheeky at the time. And the reason that the answer is no is that when you look at things like data and you look at healthcare writ large, you can’t really apply a measurement like quality at that level because healthcare data if a majority of the data we have to deal with is patient data, the unit of quality when it comes to healthcare on the largest set of data that we have to work with is at the patient level.

So let’s for example, take a patient, I’m going to call her Mabel. Mabel is an elderly woman. She has some health issues and she’s seeing Dr. Jones to make sure she’s being well taken care of. Dr. Jones knows all about Mabel. She knows where her grandkids live, she knows about her diabetes, and she knows about her osteoporosis. She’s managing all that stuff. She knows her surgical history, she knows what she’s allergic to, and that is the provider-patient relationship that goes back a long, long way. But now today what we do is we take what we know about Mabel and we put it into a computer.

So we try to establish this digital twin of Mabel so that software can use that data to do stuff, whether it generates a quality report or do clinical decision support or remind Mabel to get her foot exam. The idea is that the software operating against Mabel’s digital twin is going to help Dr. Jones make sure that everything’s accounted for, nothing slips through the cracks because that’s what computers are good at. They’re good at looking at a lot of data much faster than a human brain can.

The problem we have is the way we put data into the computer in healthcare is often pretty flawed. We put in unstructured data which the computer can’t organically process. We put in data that’s broader because we don’t have a way to articulate the full notion of what’s happening with Mabel. And so the issue we have at a Mabel level when it comes to data quality is a different flavor of quality, which is fidelity. That idea that Mabel’s digital twin that the computer is using to provide feedback to Dr. Jones and to the health system that’s utilizing that data is an accurate representation of Mabel and Fidelity basically means that it’s the degree to which the detail and the quality of the original matches the copy.

And once again, it’s got the word quality in there and that goes back to being fit for purpose. So really when we’re looking at healthcare data, the ultimate thing you measure in terms of quality is the fidelity of this information. And is it a solid, accurate representation of what’s actually happening with Mabel? Is it current, is it complete and is it actually Mabel’s data? So when you start to think about managing data quality in healthcare, you have to take Mabel and all the other Mabels and combine them together because each one of them could have variability in their quality. They could have mixed-quality data, and they could be a terrible picture of that particular Mabel, but the bottom line is you collect all that data and that becomes your measure of quality. That kind of aggregate summary of the quality of all the Mabels is what you’re looking at.

And that’s why data quality in healthcare is hard because it’s not managing the quality of one thing, it’s managing the quality of millions of things and making sure that those millions of things are an accurate representation of the patient that you’re trying to care for. So we’ve got data and we’ve got quality and we’ve got this idea, and I always tell my team that there’s no such thing as people, it’s individuals in a group, but you still have to wrangle and manage the data in individual level. I can’t fix the quality of data writ large, I can only fix data at an individual level. So you have that concept. The next thing that I’m going to do is I’m going to take you even further back than Charles Babbage. I’m going to take you back to 3 38 BC ish to Aristotle. Now, Aristotle is often quoted as saying the following, “We are what we repeatedly do. Excellence then is not an act but a habit.”

Now the truth is Aristotle didn’t say that. There was a philosopher who said that, who kind of philosopher explained it, but what Aristotle actually said is this, it is not one swallow or a fine day that makes a spring. So it is not one day or a short time that makes a man blessed and happy. So the concept is there, but that’s not what he said. I’m going to Charlie explain it and say this, and this is an original quote, so you can quote me on this if you want. Ensuring that your healthcare data is high quality is a journey and a cultural commitment. It’s not a project or a product. Nobody can come in and fix the data quality of an organization. They can help, they can guide, but the data quality of that organization is up to that organization to wrangle.

So on the topic of quality, clinical architecture at the beginning of this year did what we’re calling our annual healthcare data quality survey. The intent of the survey was really to answer four questions this year. And the first question is, do we as an industry realize that we have a data quality problem? The next question is, do we understand the impact of patient data on our individual enterprise and industry objectives? The third is do we know what factors are contributing to the degradation of our patient data so we can address them? And the fourth is with the launch of TEFCA in the QHINS, how do we feel about the quality of data from other organizations and are we willing to integrate that data into the work that we’re doing in our organizations? So we put the question out there, it was open to anyone.

We push it out through AMIA, we push it out obviously to our customers. We promoted it on LinkedIn with the idea of opening it up to anybody that wants to respond. We had 83 respondents and the majority of them, about 39% were from care provider organizations, 13% identified as academics, 12% from the vendor community, 12% from other 8% from public health, 7% from value-based care, 5% from payer, and 4% from life sciences. Now, obviously, it’s not a super high-resolution accurate picture of the population of healthcare stakeholders, so we probably can consider the results somewhat anecdotal, but at the same time it does resonate with what we’ve seen working with our clients and what we’ve seen happening across the industry. So let’s talk through what the results were at a high level, and if you go to our website, you can download the full quality report with some analysis.

So feel free to go to, click the link for the data quality survey and pull down a copy for yourself. So the first question we asked is, what is the impact of poor quality data on enterprise objectives? And the response we got was interesting. Very few, almost no one said that the impact was no impact. About 28% said the impact of poor quality data was of moderate impact on their objectives, and 71% said that the impact on enterprise objectives was high. So what that tells me is that the industry knows that poor data quality is a problem, which is basically just saying Garbage In, Garbage Out, right? When you look at the next question, we drilled into the detailed objectives around healthcare. And so we asked about care quality, patient satisfaction, performance measures, workforce efficiency, effective use of technology, public health reporting, and financial impact.

And when you look at those detailed areas, and once again download the report and you can check out the details, but the number one area of impact was performance measures. The number two area of impact was public health reporting. Number three was the effective use of technology. Four was workforce efficiency, five is financial impact, six was care quality and last was patient satisfaction. But it’s notable to say that on a weighted average score, none of those areas were below moderate impact. They were all between moderate impact and high impact, which once again tells me that the people that participated in the survey, a majority of them felt like data quality had an impact on these things that we think are important in healthcare. The next question we asked was, what is the overall quality of the patient data in your enterprise? And we categorized it in the following way, poor quality, mixed quality, high quality, and I don’t know.

And once again, this is an aggregate question, what is the quality of the patient data period in your enterprise? And 6% of the people said poor quality, 63% said mixed quality. 23% said high quality and 8% said, I don’t know. So what we’re really saying there is that 69% of people surveyed said the data in their enterprise was mixed or poor. And what that tells us is the people that responded to the survey were being honest that they didn’t really have a good feel or they didn’t feel good about the data in their enterprise. Then we dug in a little bit deeper and we started asking about specific domains of patient data. So we asked, “Well, how do you feel about patient medication data, patient lab results, patient problem and diagnoses, patient procedures, patient demographics, allergies and social determinants of health?”

The result we got back was of those domains, the one they felt the best about in terms of poor quality, mixed quality and high quality was demographics which scored between mixed and high. None of these on a weighted average got a score of high quality. Most of them ranged between mixed and high quality. So number one was demographics, labs and meds were tied for second place. Then problems and diagnoses just above mixed quality procedures, right at mixed quality allergies at between poor and mixed qualities.

And the last one was social determinants of health, which surprises no one. We all know we’re struggling with SDOH information in our industry. I was a little surprised by the allergy being low, but then I remembered that we still collect a lot of allergy data and intolerance data is free text or discreet free text values. And one more thing, it was interesting that when they evaluated the individual domains of data, if you totaled those up and you compared them to what they said about data overall, the summary of the detailed domains was slightly better than how they originally rated themselves on an overall basis.

So the next question was what is the perceived quality of patient data from external sources? So we asked them about their data. Now we asked them about the other data that came comes from outside their organization, and I won’t surprise you to know that 17% said the data coming from elsewhere was poor quality. 63% said it was mixed quality, 6% was rated high quality and 14% was, I don’t know. So the only thing that rated worse than the data quality in my enterprise is the data quality in their enterprise with essentially 80% of the data coming from outside being either mixed or poor quality, 14%. I don’t know if it’s good or not, which doesn’t surprise me when you’re getting data from somewhere else, you’re always wondering about the quality of that data because you don’t control it. It didn’t come from you.

That didn’t actually surprise me. But the thing that’s interesting is the next question and the next question we ask, what is the likelihood you’ll integrate external patient data into your enterprise? So the first question is what do you think of external patient data? And people said, “Eh, not so good.” The second question is, would you integrate it into your data? And in this category, 30% of the participants said, “We’re already doing it.” 29% said, “We’re very likely to do it.” 22% said, “we’re somewhat likely to do it”. And then you flip over to 10% saying, “I’m somewhat unlikely to do it, 9% saying I’m very unlikely to do it and 0% saying I’ll never do it. So what they’re really saying is, I don’t know that I trust the data that comes from outside my organization, but I’m going to take advantage of it. And that makes sense because there’s a lot of momentum around TEFCA in the QHINS and there are a lot of incentives driving people to get data from elsewhere.

But if you don’t trust the data but you feel compelled to take the data, the real question is what are you going to do with that data? Are you going to integrate it into your data to try to improve the fidelity of Mabel’s information or are you going to allow people to view it? So you have your data and if you want to see the data from the thousands of other sources that have touched Mabel in her lifetime, you can see that as a PDF in a viewer, but you can’t actually see it combined with the data that we’ve built because the truth is that kind of external data, unless somebody’s really looking for something, is not typically something that really gets taken advantage of because it takes a lot of time to read, manually read through a PDF of information that came from somewhere else.

And so let’s say questionable value compared to integrating the data and using it in a leverageable meaningful way within your enterprise. So the last question we asked was what contributes to poor quality patient data? And here are the categories we provided. Money and human resources or effort, the standards, the standards cause problems, the software design and system design, interoperability, too much information captured as free text and human error on data input. So those are the categories we provided and we basically said there’s no contribution, moderate contribution and high contribution. The number one contributor according to the participants of the survey was the amount of effort it takes to manage the quality of the data, which resonates go back to all the Mabels to wrangle millions of patients’ worth of data and improve the quality is a monumental effort, especially if you’re leveraging human beings to do the work.

So that didn’t surprise me at all. The number two was interoperability, and I would argue that we’re really talking about semantic interoperability because that’s what people struggle with the most. So the fact that when I get data from somewhere else, it’s in terminologies, I don’t understand, and I’m getting so much data from so many places that do the semantic normalization, the code mapping from their codes to my codes is a challenge. Number three was software design. The design of the software makes managing the quality of the data. Challenge number four was too much information in free text. Number five was data error and people putting in the data incorrectly. And the last one was standards. Now, even though they were rated in this way, it’s still notable that every single one of these, when you look at the weighted average score, everyone of them was higher than moderate contribution which tells you that nobody got off scott free.

But those are the rankings effort, number one, interoperability number two, and the rest kind of fell below those. So let me wrap things up and let’s do a summary of what we’ve learned in this episode of The Informonster Podcast. And I’ll start with my eminence-based insights, my experience in working with folks and what I’ve seen. So you can take it for what it’s worth. If the quality of the data is poor, your results will also be poor. That’s Garbage In, Garbage Out. The next thing is for healthcare, data quality is a noun, not an adjective. If somebody uses quality as an adjective in healthcare, they’re trying to sell you something. The next thing is patient data is the largest contributor. So it’s the biggest challenge to managing the quality of the data is patient data. The next thing is improving the quality of that healthcare patient data happens at the patient level.

You can improve the quality of data writ large in healthcare. You have to do it at a molecular level at the patient level. The yardstick of quality at the patient level is fidelity. How complete and accurate is the representation, is the digital twin of the patient so that when the software comes to a conclusion or suggests an action, the human provider doesn’t say, “What the heck are you doing because you lose credibility and you may not get a second chance to be helpful?”

The last thing for me is improving data quality is a journey and a cultural commitment. And as somebody who sells products in this space, I have to tell you, improving your data quality is not a project. There is no product that magically delivers it. You as an organization have to be committed to not just creating data as a byproduct, but creating data and thoughtfully curating that data to provide valuable insights that can make your job easier and the care you provide to patients better.

Going to the survey-based insights, I think when we ask those questions, the first question is, does the industry understand that poor quality impacts the results Garbage In, Garbage Out? From the participants in the survey, I think the answer is yes, they get it. They know that data quality’s important. The next question was, do you understand that the quality of your data isn’t great? And once again, the participants of the survey, they get it. They know that the data quality is not where it needs to be. The next one was, where do data quality problems come from? And as I said, even though there was a mix of responses, there might have been answers that we didn’t even ask in the survey. The bottom line is people say data quality is a hard effort. It takes a lot of effort. And if you flip the question of what contributes to it, you can also turn that question into what’s stopping you from improving the quality of data.

And what they’re saying is, it’s hard work, which is why at Clinical Architecture, we put a lot of our attention into how we reduce the amount of human effort in doing the things that improve the quality of the data. And then, of course, interoperability, which doesn’t surprise anybody. I don’t think I saw a single presentation at hymns that didn’t have the word interoperability in it. The next question is, how do we feel about data that comes from other places? We don’t trust it, we don’t. We’re not sure that it’s that great, but as the next question implies, we understand that we’re going to have to accept it and we’re going to have to work with it. So, ladies and gentlemen, that is the end of today’s Infomonster Podcast. Thank you so much for tuning in. If you have any comments or questions or feedback, please don’t be shy and I look forward to talking to you again on the next episode of the Informonster Podcast. I am Charlie Harp and this has been Informonster Podcast. Thank you.