What is Data Normalization?

Recently a friend of mine asked me a question.  “What is normalization?” 

One formal definition is “Normalization is the process of reducing data to its canonical (normal) form.  In doing so removing duplicated, invalid, and potentially pre-coordinated data (depending on your definition of the canonical form).” 

While this definition might be technically correct, I think that when we think of normalization in the trenches of healthcare we are actually talking about something slightly different.

In healthcare we deal with data.  This data generally falls into three categories: Data intended for humans (free text information, images, audio, video), data intended for algorithms (data tables, indexes and graphs) and data intended for both (terminology).  The last category, terminology, binds language (words) to codes and allows us to bridge the human world and the algorithm world.  This works fairly well, as long as it stays confined within the walls of my information ecosystem.  When you try to share codes across systems you find out that, more often than not, different systems understand different codes and even though the information is coded, it is not using the codes that the receiving systems understand.

So, we find ourselves in a situation where we need to translate the meaning of the term from one terminology into another. This task has many names: “Mapping”, “Mediation”, “coordination”, “Transcoding”, “Interoperability” and, “Normalization”.  In every case you are taking a term from the source terminology and trying to find the most appropriate match in the destination terminology.  The “rules” that you use to match terms across the semantic rift are not universal.  They will change based on the goals and objectives of the exchange.  This is called “purpose driven mapping” and my colleague, Shaun, is working on an excellent article on that topic, so I will leave the full explanation of that to him, stop stalling and talk about normalization.

I typically use “Normalization” when referring to a situation where there are many sources of terminology feeding into a single (normal) target environment.  This is typically the pattern you see when you are dealing with some type of clinical aggregation environment where you are collecting patient information from many sources so that you can combine and reason over the data in a central location using shared logic and “normal” clinical knowledge.  So the first rule of normalization is “Many sources going to a single (normal) terminology, for a given domain.”

“Normalization” also implies that you started with something not normal and made it normal.  If this is true, then it makes no sense to reverse the process as that would result in “deviation”.  So typically Normalization is considered a one way trip.  You always want to preserve the original terminology from the source, but the goal is to use the normalized data, not to provide a pivot between the deviant and normal worlds.  So the second rule of normalization is “Normalization flows in one direction”.

Now, much of mapping is about conceptual equivalence, but not all of it.  You should always make sure that you have considered the objective of a map before you start mapping because the only thing worse than mapping is mapping again.  When you are normalizing terminology, the rules of mapping are dictated by the nature and specificity of the “Normal” terminology. So the third rule is “There are no rules when it comes to normalization mapping rules, you have to determine the most appropriate way to normalize each inbound terminology”.

Lastly, many people confuse “Normalization”with “Standardization” but this is not a universal bidirectional.  They are the same if you are normalizing to a given standard, but you can normalize to any terminology that you deem normal for your purposes.  For example, suppose I have inbound medications and I want to determine if they are solids or liquids and that is all I care about. I could create a target terminology with three terms, 1=Solid, 2=Liquid and 3=Other.  I would then build a map that normalizes all of my inbound terms to one of those values.  This would be a proper, albeit limited, normalization pattern, using a terminology that I built to suit my purposes.  So the fourth rule is “Standardization is always Normalization but Normalization is not always Standardization”.

To summarize pragmatic definition of Clinical terminology “Normalization” is:

  1. Many sources going to a single ‘normal’ terminology, for a given domain.
  2. Normalization flows in one direction.
  3. There are no fixed rules when it comes to normalization mapping rules, you have to determine the most appropriate way to normalize each inbound terminology, dependent on purpose.
  4. Standardization is always Normalization but Normalization is not always Standardization.

Leave a Reply

Your email address will not be published. Required fields are marked *

− 2 = 8