Live Cyber Attack Lab 🎯 Watch our IR team detect & respond to a rogue insider trying to steal data! Choose a Session


What is Data Classification? Guidelines and Process

Data Security

data classification title

In order to protect your sensitive data, you have to know what it is and where it lives.

Data Classification Defined

Data classification is the process of analyzing structured or unstructured data and organizing it into categories based on the file type and contents.

Get the Free Essential Guide to US Data Protection Compliance and Regulations

Get the Free Pen Testing Active Directory Environments EBook

“This really opened my eyes to AD security in a way defensive work never did.”

Data classification is a process of searching files for specific strings of data, like if you wanted to find all references to “Szechuan Sauce” on your network. Or if you needed to know where all HIPAA protected data lives on your network. Or if you want to prepare for data privacy regulations and need to identify any personally identifiable information (PII) on your data stores.

definition of data classification

Data classification is usually based on a file parser combined with a string analysis system. A file parser allows the data classification engine to read the contents of several different types of files. A string analysis system then matches data in the files to defined search parameters.

RegEx –short for regular expression – is one of the more common string analysis systems that defines specifics about search patterns. For example, if I wanted to find all VISA credit card numbers in my data, the RegEx would look like:

\b(?<![:$._’-])(4\d{3}[ -]\d{4}[ -]\d{4}[ -]\d{4}\b|4\d{12}(?:\d{3})?)\b

That sequence tells the RegEx system that we are looking for a pattern with a 4 digit number starting with the number 4 followed by a dash and a second 4 digit number and… you get the idea. Only a string of characters that matches the RegEx directly generates a positive result.

Although there are some parallels between the two, data classification is not the same as data indexing. Classification looks for identifiers based on patterns and returns a list of files and how many matches it found for each pattern. It doesn’t necessarily index those files. Indexing enables search, and you’ll need to search those matches to fulfill data subject access requests and right-to-be-forgotten requests.

Reasons for Data Classification

reasons to implement data classification

The Center for Internet Security (CIS)- which devotes an entire section to data classification protections – says data classification is important because “in several high-profile breaches over the past two years, attackers were able to gain access to sensitive data stored on the same servers with the same level of access as far less important data.”

Beyond data security concerns, there are several other reasons to implement a data classification process:

  • Identify sensitive files, intellectual property, and trade secrets
  • Secure (and lock down) critical data
  • Track regulated data to comply with regulations like HIPAA, PCI, or GDPR
  • Optimize search capabilities with data indexing
  • Discover statistically significant patterns or trends inside data
  • Optimize storage by identifying duplicate or stale data

Data Classification Process: 4 Steps

Data classification processes differ slightly depending on the objectives for the project. Any data classification project requires automation to process the astonishing amount of data that companies create every day. In general, there are some ubiquitous criteria required to create any data classification process:

  1. Define the objectives of the data classification process. What are you looking for? Why?
  2. Create workflows based on the selected classification tools. How does the classification process work? Is there a process in place to scan new data? Is there a process to create new classification criteria?
  3. Define the categories and classification criteria. What kinds of data should you search for? What process will you follow to validate the classification results?
  4. Define outcomes and usage of classified data. How are the results organized – and how do you plan to make business decisions based on those results?

Data Classification Tips

  • Use automated tools to process large volumes of data quickly
  • Leverage RegExes and Luhn: create custom classification patterns or implement software that does the heavy lifting for you
  • Validate your classification results: nobody likes a false positive.
  • Figure out how to best use your results and apply classification to everything from data security to business intelligence.

Data Classification FAQ

How does Varonis do Data Classification differently?

Varonis has over 400 pre-configured RegExes to discover all manners of PII, PHI, and GDPR data with a fully customizable classification engine you can configure for any business purposes. Varonis monitors over 60 file types out of the box (including documents, spreadsheets, and more), and identifies new data that needs to be re-scanned (without starting the whole thing over) to catch new and recently added sensitive files, including:

  • Personal information: credit card numbers, passport numbers, driver’s license numbers, social security numbers, IBAN, and more
  • Financial records
  • Security file types (.cer, crt, p7b, etc.)
  • Regulated data (GDPR, HIPAA, PII, PHI, PCI, Sarbanes Oxley, GLBA, etc.)

The Varonis Data Classification Engine can process ~100 GB of data in an hour (caveats about your own hardware and network capacity) and includes rigorous false positive checks that reduce the workload to analyze the classification results. Not every 16 character numeric string is a credit card number, for instance, and Varonis knows the difference.

What Comes After Data Classification?

Varonis brings context to that classification. Varonis not only identifies the data that you’re looking for, but shows you who can access to that data – and who is accessing that data. Once you identify and classify sensitive data, you can take action on it: apply labels, lock down permissions, monitor access, alert on suspicious activity, and meet compliance requirements like right-to-be-forgotten. The Varonis Data Classification Engine allows you to protect your most sensitive and important data from unwanted access, accidental data leaks, and security attacks.

See the Data Classification Engine in action with a 1:1 demo.

Jeff Petters

Jeff Petters

Jeff has been working on computers since his Dad brought home an IBM PC 8086 with dual disk drives. Researching and writing about data security is his dream job.


Does your cybersecurity start at the heart?

Get a highly customized data risk assessment run by engineers who are obsessed with data security.