GDPR Technical Series #1: Anonymization and Pseudonymization

Subra Ramesh

GDPR Technical Series #1: Anonymization and Pseudonymization

published: August 14, 2018

A fundamental tenet of the European Union’s General Data Protection Regulation (GDPR)—which went into effect on May 25, 2018—is the recommendation to pseudonymize personal data wherever possible. Articles 4, 6, 25, 32, 40, and 89 as well as Recitals 28, 29, 75, 78, 85, and 156 of the GDPR explicitly call out pseudonymization. Interestingly enough, the GDPR does not refer to anonymization anywhere in the text. While the GDPR deems pseudonymization sufficient, it does not preclude other means of data protection. (Recital 28).

In this blog post, we examine differences between these terms. In the next blog post of our GDPR Series, we cover common element-level protection techniques and map anonymization and pseudonymization to those techniques.

Anonymization vs. Pseudonymization

First, let’s take the ideal case: anonymization. Anonymization is the transformation of data so that the data is no longer identifiable as being associated with a particular person. For anonymization to be effective, identification of the person associated with the data cannot be possible even with the addition of other knowledge about the anonymized data. The problem for data controllers and data processors with most cases of perfect anonymization is that the data is also rendered useless for any other analytics. Elimination of the ability to do valuable analytics could be one explanation for the GDPR’s omission of anonymization. Even then, however, anonymized data could still be useful for development and testing use cases.

Considering the original table below (Fig 1).

Last Name	First Name	Employee ID	Email Address	Title	Start Date	Department	Salary	Vacation Available (days)
Jones	Edward	3565486	ejones234@kmail.com	Technical Manager	09/01/2013	IT	$13,0000	20
Xu	Jason	56544884	j.xu.563@zahoo.com	Architect	07/01/2010	Engineering	$125,000	15
Stanton	Joseph	2484686	Joseph.stanton4599@kmail.com	CEO	01/03/2008	HQ	$400,000	12
Powers	Rebecca	4856459	Beckyp43543@kmail.com	Director	02/02/2011	Sales	$140,000	18

Fig 1 — Table before Anonymization

After Anonymization, the table would look like the one below (Fig 2).

Last Name	First Name	Employee ID	Email Address	Title	Start Date	Department	Salary	Vacation Available (days)
Jenkins	David	34543593	djenkins546@kmail.com	General Manager	07/02/2010	IT	$170,000	15
Cortes	Ramona	63458245	ramonacortes234@zahoo.com	Director	05/02/2008	Engineering	$145,000	15
Helleboid	Jean	56455344	jean.helleboid764@kmail.com	Manager	04/02/2011	HQ	$155,000	10
Watson	Brian	34534887	BrianWatson7679@kmail.com	Manager	06/07/2004	Sales	$123,000	19

Fig 2 — Table after Anonymization

Note that every field is transformed in Fig 2, except “Department.” Assuming each department consists of more than one person, getting back to the original data will not be possible, even with additional external information. However, if IT is a single-person department, then we have the person’s record. In our example, because all the values other than “Department” are anonymized, having the record is useless to anyone with access to the record.

Now consider pseudonymization. Let us assume that in addition to the “Department” column the “Salary” column is also not transformed. The following table (Fig 3) results from that action rather than the table in Fig 2.

Last Name	First Name	Employee ID	Email Address	Title	Start Date	Department	Salary	Vacation Available (days)
Jenkins	David	34543593	djenkins546@kmail.com	General Manager	07/02/2010	IT	$130,000	15
Cortes	Ramona	63458245	ramonacortes234@zahoo.com	Director	05/02/2008	Engineering	$125,000	15
Helleboid	Jean	56455344	jean.helleboid764@kmail.com	Manager	04/02/2011	HQ	$400,000	10
Watson	Brian	34534887	BrianWatson7679@kmail.com	Manager	06/07/2004	Sales	$140,000	19

Fig 3 — Table after Pseudonymization

In this case, if CEO Joseph Stanton (row 3) had not been in the data set, it would have been equivalent to an anonymized set. However, since the “Salary” column has not been transformed, the outlier in that column ($400,000) gives away the CEO’s identity and information. In other words, the knowledge that the CEO is likely to be paid well above everyone else essentially re-identifies the record after pseudonymization.

In the examples above, since the number of fields transformed is substantial in comparison with the total number of fields, the data, while usable for testing, is rendered useless for meaningful analysis. To be able to draw meaningful conclusions, the fields of interest in analysis need to be available without transformation—or at least be in the same range—so that aggregate results are the same.

Degrees of Anonymization

The degree of anonymization and indeed whether a data set is anonymized or pseudonymized depends on the nature of the un-transformed data and how much it might reveal. At some point when there is sufficient additional information giving clues that identify the original value, the transformation would be pseudonymization rather than anonymization. In the example above, the sufficient additional information is the knowledge that the CEO is likely the highest paid employee in the company. Additional information might be public information or data available in other tables or data stores in the organization.

There are measures of the degree of anonymization, such as the family of data similarity measures, including k-anonymity, l-diversity, t-closeness, and other criteria. There are also newer techniques such as differential privacy that de-identify the data. We will go into these measures and techniques in another blog post devoted to that subject.

As we saw in the discussion above, anonymization and pseudonymization are distinct approaches that protect the data as a whole, in the aggregate. The anonymization and pseudonymization effects are achieved by applying transformations at the element level. We will delve into these element level techniques in the next blog post and map those techniques to anonymization and pseudonymization.

Learn more about how PKWARE can protect your data and keep you GDPR compliant. Find out how with a free demo.