Varonis announces strategic partnership with Microsoft to accelerate the secure adoption of Copilot.

Learn more

Revealed: Secret PIIs in your Unstructured Data!

Personally identifiable information or PII is pretty intuitive. If you know someone’s phone, social security, or credit card number, you have a direct link to their identity. Hackers use these...
Michael Buckbee
3 min read
Published March 21, 2013
Last updated October 6, 2022

Personally identifiable information or PII is pretty intuitive. If you know someone’s phone, social security, or credit card number, you have a direct link to their identity. Hackers use these identifiers, along with a few more personal details, as keys to unlock data, steal identities, and ultimately take your money. In some of my recent blogging, I’ve referred to the blurring of lines between PII and non-PII data. Case in point: it’s been known for at least 10 years that there are specific pieces of data, which in isolation may appear anonymous, but when taken together they’re just as effective at identifying a person as traditional PII.

The easiest to understand of these so called quasi-PIIs is the trio of full birth date, zip code, and gender. If a company  published a dataset that had been “de-identified” by removing all the standard PIIs, but left those three data items alone, a smart hacker could with very high likelihood find the name and address of the person behind that data.

Get the Free Pentesting Active
Directory Environments e-book

Why would this work?  At a very basic level, the identity thief is effectively doing the work of a detective–essentially going through lists looking for matches. The lists in this case are voting records, which are available from most US towns and counties at a nominal fee– typically around $40. Voting records contain name, address, and most importantly full birth date; zip codes can be easily determined from the address.

By looking for matching birth dates and zip codes, savvy hackers narrow down the search to a few names. Add gender information and for most zip codes in the US, hackers can arrive at a unique name. Of course, the more additional information or clues gathered, especially taken from social media and other web sites, the easier it is to filter out names when there’s more than one candidate.

A quick back of the envelope calculation tells you why one might do very well with this approach. Taking 365 days—ignoring leap years—and multiplying by an average age of 80, it works out that a complete birth date gives 29,200 “bins” to place a zip code’s worth of people. If you have gender information, you double the number of slots, to 58,400.

I can hear nitpickers out there saying that voting rolls contain names of those over the age of 18, so you would have to remove 6570 slots. True enough, but researchers have shown it’s possible to exploit Facebook’s leaky handling of data on school age minors to partially address this gap.

In any case, based on the last US census, there are over 40,000 zip codes, with an average of only 7000 people per zip code. On a gut level, it seems there’s a good chance most of those 7000 people will find themselves alone in one of those 58,400 slots. In other words, the odds are very good that most of them won’t share the same date of birth, zip code, and gender.

The real validation of this type of  hacking attack came from Carnegie Mellon University computer science professor and data privacy expert Latanya Sweeney, who ran the numbers back in 2000. Using then current census data (broken down by zip codes and age groups), she was able to identify 87% of the people in the US working with just those three non-PIIs.

Fortunately, Sweeney’s research and results from other experts have made their way to policy makers. For example, when medical research on patients is published, HIPAA’s Safe Harbor de-identification rules say that no geographic unit smaller than a state can be included in the public data. Full dates (e.g., admission, birth) must also have the year removed.

With US regulations on PII varying by the particular legislation, this is by no means a universal rule. However, the Federal Trade Commission, an influential regulatory agency on privacy matters, has recently issued new best practices on data de-identification. They’ve called for all companies to achieve a “reasonable level of confidence” that their public data can’t be linked back to an individual. Clearly, the combination of birth date, zip code, and gender would fail that test.

Are there other quasi-PII’s out there? Of course! The larger problem is that consumers are sharing all kinds of information about themselves on web sites and social forums. In a possible scenario, think of an online retailer collecting preference data about its customers—sports interests, hobbies, etc.—along with geographic data and perhaps income information.

These data items would not be considered traditional PII.  If hackers pulled this “anonymous” data from a poorly permissioned file on a server, you could imagine them mining various special interest sites, looking for names that match up based on those interests and geo data.  Once they have a match, the next step might be a phishing attack, with the hackers pretending to be the retailer.

For companies that want to stay ahead of the coming stricter de-identification rules—that are being considered here in the US  and will likely become law in the EU—it would be worth their while to start carefully reviewing their non-PII data. Wherever that data might be on their file system.

What you should do now

Below are three ways we can help you begin your journey to reducing data risk at your company:

  1. Schedule a demo session with us, where we can show you around, answer your questions, and help you see if Varonis is right for you.
  2. Download our free report and learn the risks associated with SaaS data exposure.
  3. Share this blog post with someone you know who'd enjoy reading it. Share it with them via email, LinkedIn, Reddit, or Facebook.

Try Varonis free.

Get a detailed data risk report based on your company’s data.
Deploys in minutes.

Keep reading

Varonis tackles hundreds of use cases, making it the ultimate platform to stop data breaches and ensure compliance.

personally-identifiable-information-hides-in-dark-data
Personally Identifiable Information Hides in Dark Data
To my mind, HIPAA has the most sophisticated view of PII of all the US laws on the books. Their working definition encompasses vanilla identifiers: social security and credit card...
new-pii-discovered:-license-plate-pictures
New PII Discovered: License Plate Pictures
After finishing up some research on personally identifiable information I thought, mistakenly, that I was familiar with the most exotic forms of PII uncovered in recent years, including zip code-birth...
add-varonis-to-iam-for-better-access-governance
Add Varonis to IAM for Better Access Governance
Managing permissions is a colossal job fraught with peril, and over-permissive folders are the bane of InfoSec and a hacker’s delight. Many organizations employ IAM (Identity Access Management) to help...
2021-saas-risk-report-reveals-44%-of-cloud-privileges-are-misconfigured
2021 SaaS Risk Report Reveals 44% of Cloud Privileges are Misconfigured
Cloud apps make collaboration a breeze, but unless you’re keeping a close watch on identities, behavior, and privileges across each and every SaaS and IaaS you rely on, you’re a sitting duck.