Finding EU Personal Data With Regular Expressions (Regexes)

If there is one very important but under-appreciated point to make about complying with tough data security regulations such as the General Data Protection Regulation (GDPR), it’s the importance of finding and classifying the personally identifiable information, or personal data as it’s referred to in the EU. Discovering where personal data is located in file systems and the permissions used to protect it should be the first step in any action plan.

You don’t have to necessarily take our word for it, you can look at GDPR to-do lists from law firms and consulting groups that are heavily involved with advising companies on compliance.

Get the Free Essential Guide to US Data Protection Compliance and Regulations

We’ve already given you a heads up about Varonis GDPR Patterns, which helps you spot this personal data, and now that I’ve chatted and learned more from Sarah and the Varonis product development team, I’ve more to share.

Nobody Does It Better

GDPR Patterns is, of course, built on our Data Classification Framework or DCF. For those new to Varonis, DCF has an enormous advantage over other classification solutions, since it implements true incremental scanning. After the initial scan of the file system, DCF can quickly identify any changes, and then selectively scan those directories or folders that have been accessed. This makes far more sense than starting scanning from scratch!

By the way, for those crazy enough to think they can try rolling their own data scanning software, they can refer to my series of posts on a DIY classification system based on PowerShell. Please learn from my craziness and avoid the urge.

With DCF doing the heavy lifting, GDPR Patterns can focus on spotting EU-style personal data within files. According to the GDPR definition, personal data is effectively anything related to an individual that can identify that person. The definition’s very broad and deceptively vague language covers a lot of territory! (For more excruciating details, please refer to this official EU document.)

Obviously, we’re talking about all the usual suspects: names, addresses, phone numbers, credit card, bank and other account numbers. GDPR personal data also encompasses internet-era identifiers such as IP and email addresses, and futuristic biometric identifiers (DNA, retinal scans) as well.

Many EU Identifiers

The EU comprises 28 countries, and that means many identifiers vary by country. This is where the Varonis product team did the hard work of research, spending months analyzing phone numbers, license plate numbers, VAT codes, passports, driver’s licenses, and national identification numbers across the EU.

Does anybody know what the Hungarian personal identification code, known as Születési szám, looks like?

That would be an 11-digit sequence based on date of birth, gender, a unique number to separate those born on the same date, and a checksum.

Or what about a Slovakian passport number?

That’s 9-characters: 2-digits followed by 7-letters.

Varonis has worked all this out!

We use regular expressions or regexes to do pattern matching when possible. It’s not as easy to craft these regexes as you might think.

If you want to match wits against the people who devised the Dutch license plate numbering scheme, you can click here to see a regex analysis of one sample number. And then you can try a few out on your own to see if you’ve got it. Enjoy!

regex gdpr — A regular expression representing Dutch license plates. Think you understand it? Try your luck with the link above!

Patterns Are More Than Regexes

The research and effort we put into the regular expressions only forms part of the GDPR Patterns solution. Sure, it’s conceivable that someone could work out regexes for a few countries or do Google searches to find these expressions on the web.

However, we’ve crafted our regexes by looking at real-world data samples, and not automatically accepting what’s provided by government agencies and others. Our GDPR regexes have proven themselves in the field!

With so many different alphanumeric patterns, it shouldn’t be surprising there’d be occasional “collisions” — sequences that could be classified into several types of personal data. For example, EU passport numbers vary between 8 and 10 consecutive numbers, so they’d also be caught by an EU phone number regex.

That is why we’ve also added validator algorithms to supplement the regexes. Specifically, GDPR Patterns scans for special keywords that are near or in proximity to the EU personal data: if we find the keyword, it helps zero in on the right GDPR pattern.

For example, when GDPR Patterns finds an 11-digit number, it looks for additional keywords to determine if this represents a national personal ID: “IK” or “ISIKUKOOD” implies Esontia; “Születési szám” or “Személyi szám” or “Személyi azonosító” would of course mean Hungary, etc.

If we don’t find the extra keywords, then we can’t assume the 11 digits are an identification code, and so it would not be classified as GDPR personal data. In other words, the validation algorithms reduce false positives.

In case you’re asking, we do use negative keywords as well. If GDPR Patterns finds one of these types of keywords, it means that personal data caught by the regex expression can’t be classified under that pattern.

More GDPR Patterns Details

The Varonis developers have dived deep into EU identification numbers, driver’s licenses, license plates, and phone numbers, looking at real-world samples to come up with both positive and negative keywords and proximity information.

We’ve integrated GDPR Patterns into our DatAdvantage reports to show which files contain a specific Pattern based on a hit count.

GDPR Patterns is also integrated with DatAlerts so that notifications can be delivered when files are accessed containing personal data. We’ll help you meet the GPDR 72-hour breach notification requirement.

Data Transport Engine will also use GDPR Patterns to archive or remove stale or no longer useful EU personal data, another requirement in GDPR.

Have questions? Contact us for more information.

What should I do now?

Below are three ways you can continue your journey to reduce data risk at your company:

Schedule a demo with us to see Varonis in action. We'll personalize the session to your org's data security needs and answer any questions.

See a sample of our Data Risk Assessment and learn the risks that could be lingering in your environment. Varonis' DRA is completely free and offers a clear path to automated remediation.

Follow us on LinkedIn, YouTube, and X (Twitter) for bite-sized insights on all things data security, including DSPM, threat detection, AI security, and more.

Michael Buckbee Michael has worked as a sysadmin and software developer for Silicon Valley startups, the US Navy, and everything in between.