GDPR compliance in legacy environments

The General Data Protection Regulation (GDPR) that takes effect next May will require businesses to protect the privacy of any personal data that they manage. There are many ways to do this, but the GDPR strongly encourages the use of pseudonymization, which, depending on how the business currently manages the personal data it acquires, may or may not be easy to accomplish. In particular, the implementation of pseudonymization is likely to be harder in legacy IT environments. But there are ways to do it, even in these challenging cases. 

Before looking at the ways to do this, it is useful to address exactly what pseudonymization is. One way to understand it is as though it adds shades of gray between what used to be just black and white. Consider that what we know about the identity of a particular person can range from enough to uniquely identify them to absolutely nothing about them. If we know nothing at all about a person's identity, they have perfect anonymity but absolutely no accountability. If we know everything about their identity, then we have perfect accountability but absolutely no anonymity. Pseudonymity is the range of possibilities between these two cases (including both of the extreme ones), so it may be useful to think of it as implementing a trade-off between anonymity and accountability. Many examples of the available data about a particular person fall in between the two extremes, even if the data is strongly protected.

While it may be generally considered to be non-identifying information, the very fact that a person is a citizen of France, for example, represents significant identity information because only about 9 percent of EU citizens are French. The set of potential identities has been reduced by approximately 91 percent. Similarly, the simple fact that a person has an account at a particular bank is identifying information. Of the roughly 750 million EU citizens, only a small fraction are likely to have an account with any particular bank. Perfect anonymity is very uncommon, perhaps even impossible. Most cases of what we think of as being anonymity are more appropriately considered to be a form of pseudonymity, and many forms of anonymization of personal information are more appropriately considered to be forms of pseudonymization.   

To keep business information useful, the ability to reverse any form of pseudonymization used to protect personal information is often necessary. If we accept this limitation, then there are essentially two ways to implement useful pseudonymization: encryption and tokenization. Encryption provides a way to compute the transformation from unprotected personal information to a pseudonymized version of it while tokenization uses a data store to record the transformation instead of computing it directly. The two techniques are more similar than some technology vendors would like you to think. Tokenization is equivalent to a form of encryption known as the “one-time pad” that was invented by Frank Miller in 1882. Because the two are so similar, we will use the single term “encryption” to refer to both.   

One problem with encryption is that it can change the format of data. In IT environments that have many older components, this can cause lots of unforeseen problems. Some systems expect a credit card number to be 16 digits, for example, and will fail in undesirable ways if they try to process a credit card number that is not a 16-digit value. Suppose that you encrypt the value “5610591081018250” (one of the public values that PayPal uses for testing credit card processing). If you encrypt this value using many forms of encryption, you get an output that looks like random zeroes and ones. In general, this output will not even correspond to printable characters, so we would have to use some additional form of encoding (such as Base64 encoding) to represent the encrypted value which might end up looking something like “DuMRdTVZdd2J05D9ns6WWg==” for example.   

Note how the encryption and subsequent Base64 encoding has changed the format of the input data, which were 16 decimal digits. The output is no longer just decimal digits. It is also longer than 16 characters. As a consequence, many legacy applications will fail very ungracefully if they need to process such an output as a credit card number because the source code and data structures they use were designed to expect only 16 decimal digits. And if it is even feasible to do so, it can be a very costly and time-consuming exercise to change those applications and data stores to accept the encrypted values. The good news, however, is that it is possible to encrypt data in a way that keeps the format of the data unchanged.   

This approach is called “format-preserving encryption” (FPE), and it has been around since 1981, when a US government guide described a use of the venerable DES encryption algorithm (FIPS 74) to encrypt a digit at a time in a way that kept the encrypted value as a digit. This early version of FPE would not be considered a good approach with today’s understanding of what makes encryption secure, but it shows that there has been interest in the general problem of preserving the format of data through encryption for many years.

FPE is an elegant solution to the problem caused by encryption changing the format of data. It avoids the expensive problems of modifying legacy systems to give them the ability to process the encrypted data. Instead of adapting the environment to the data, this approach adapts the data to the environment. And by doing this, you dramatically reduce the need to make any modifications to the legacy environment. Using FPE makes it possible to protect your customers’ data using pseudonymization while still keeping costs as low as possible. 

So when you investigate a way to pseudonymize data to ensure that you are complying with the GDPR, FPE is a technology worth considering. It has been vetted and approved by the American National Institute of Science and Technology (NIST) and it is possible to use this approach in ways that comply with standards like FIPS 140-2 or ISO/IEC 19790, which means that you should have an easy job convincing your auditors that this approach is sound.   

Complying with the GDPR can be hard and expensive. Do not make it harder and more expensive than it needs to be. Protecting your customer’s personal data is good. Protecting it in a way that will minimize the changes required to your current operating environment is even better, and using FPE to implement pseudonymization will help you do just that.   

Luther Martin,Distinguished Technologist for Voltage at Micro Focus 

Image Credit: Wright Studio / Shutterstock