To Anonymize or Not to Anonymize, That is the Question

I see a future in which organizations planning to transfer sensitive information from one system of record to some other destination will first ask themselves the question: "Can our data be shared in an anonymized form while achieving materially similar results had the data been transferred in clear text?" And if the answer to this question is "yes," I would then argue, "Why would that organization ever share that sensitive information any other way?"

A new class of technology, "Analytics in the Anonymized Data Space", is making this possible. With this type of technology, information can be anonymized before being transferred between parties, while still permitting sophisticated analysis to be performed on the data even though the data is in a non-human-readable and irreversible form i.e., anonymized.

I think this will become a best practice. When? I don’t know, maybe two years, five years or maybe even twenty years, but someday for sure. It will start with early adopters (already beginning to happen), its use will grow, and finally at some point in time anonymization-based analytics will achieve a critical mass. Thereafter, anonymization will likely be viewed as a best practice. From that moment on, if an organization is not handling its data in such a manner, I would submit they could be considered negligent.

Here is an anonymization scenario:

To stay competitive, banks must understand their customers at least as well as their competition. So, banks send their customer information to data aggregators. The data aggregators then match the bank’s customer data with their private collection of demographics (e.g., marital status) and lifestyle data (e.g., magazines subscriptions). This information is then appended to the original file and then returned to the bank (thus this practice is often called "database marketing appends"). The bank then uses this new information to profile their customers – using this newly found knowledge to improve their customer acquisition and retention programs.

But transferring all customer data to a secondary party causes organizational heartburn. In the example above, the bank’s management recognize sending their customer data to another party comes with some risk: What if an employee at the data aggregator makes an illegal copy of the customer file and secretly sells it? What if a hacker breaks into the data aggregator’s systems and extracts all or portions of the bank’s customer file? What if an employee at the aggregator uses the bank’s customer file to answer very specific questions made by "outsiders" about specific people? What if the aggregator quietly retains portions of the bank’s customer file for use later in unanticipated ways?

As gut wrenching as these risks are, most banks find themselves doing this anyway in an effort to remain competitive.

Emerging innovations which enable advanced analytics to be performed on encrypted or anonymized data will enable the bank to pass non-human readable customer data to the data aggregator. And the data aggregator will then match the bank’s anonymized customer data with their own records – while the bank’s customer records remain anonymized! The demographic and lifestyle data would then be passed back to the bank with a non-personally identifying value (e.g., a customer number).

What is gained? In short, if the data is stolen by a hacker or an agent of the aggregator, they learn nothing useful. A corrupt employee at the data aggregator cannot peruse the customer file for selected information. The aggregator does not learn new information like a new address or phone that the bank knew but the aggregator did not.

What are the risks? Well, there are lots of risks especially in this simplified embodiment (e.g., something called a dictionary attack). But, the basic principle is, if one is going to share information in clear text anyway, then even this simple model reduces to some degree the risk of unintended disclosure.

Luckily, there are a variety of cryptographic and architectural extensions one can use to harden this information sharing model against many different kinds of attacks. [Techie interjection: Commutative encryption, for example, makes it more difficult for any one user to dictionary attack the anonymized values.]

[Another technical note: Anonymization systems that prevent any possible re-identification (e.g., pointers to the original record) come with additional risks, like the inability to fully audit the system and the inability to correctly process deletions. This being the case, I think certain classes of anonymization-based systems must include Source Attribution and Data Tethering. In which case, the original holder of the data can control whether any re-identification is permitted within law and policy.]

Postings on this site don't necessarily represent IBM's positions, strategies or opinions. Jeff Jonas is the chief scientist of IBM Software Group's Threat and Fraud Intelligence unit and works on technologies designed to maximize enterprise awareness; Jeff also spends a large chunk of his time working on privacy and civil liberty protections. He will be writing a series of guest posts for Security Blog.

For more on Entity Analytics; click here.