Turns Out Both Bad Data and a Teaspoon of Dirt May Be Good For You

I think there used to be a saying that ingesting a teaspoon of dirt would actually keep the immune system strong. (Before my time in case you are wondering.)

Now whether this is true or not, I have come to conclude that with respect to context engines … poor quality data (or "dirt") can in fact be quite helpful. Just to be clear, I am not talking about a date of birth value incorrectly placed in a middle name field or a phone number field containing a non-phone value like the phrase "who put the ear muffs on the cookie?"

When incorrect data actually expresses "natural variability", this kind of data error can be helpful to context assembling systems. What do I mean by "natural variability" you might ask? Well I am referring to plausible variations. For example, when the month and day in the date of birth are transposed. Or, sometimes an address will include the word "Drive" while other times this same address may be referred to without it. If someone’s first name is "Marek" (a fairly uncommon name here in the United States) it may periodically be recorded as "Mark" by a confused data entry operator. The list goes on.

When context accumulating systems keep this natural variability – when trying to recognize like objects in the future, accuracy goes up because the system has been able to learn from the natural variability of the past. For example, recognizing that Marek is sometimes also recorded as Mark is in fact helpful.

The other funny thing about this is: How would one know if someone named Marek has decided to now go by the nickname Mark? Well in most cases you will never know this other than observing over time that he used to go by one name and in recent years he seems to going by another.

This is yet another reason why with respect to context engines there is no such thing as a single version of truth.

Jeff Jonas is the chief scientist of IBM Software Group’s Threat and Fraud Intelligence unit and works on technologies designed to maximize enterprise awareness. Jeff also spends a large chunk of his time working on privacy and civil liberty protections. He will be writing a series of guest posts for Netcrime Blog.

For more on Entity Analytics, click here.