How to Do Data Classification at Scale

One of the important points we make in our recently published Information Entry

Hate computers professionally? Try Cards Against IT.

opy report is that you can’t just decide you have intellectual property, issue NDAs to employees, and leave it at that. Confidential information requires real ongoing work on the company’s part. This is especially true for a class of IP known as trade secrets. By the way, do you know where your trade secrets are at this exact moment?

Trade secrets are essentially non-public information that has economic value. A trade secret, for example, can be software, formulas, strategic business documents, and even customer lists. Think of it as information that competitors would profit from if they got their hands on it.

But trade secrets require a commitment by the company. According to US law, you need to indicate that your secret is valuable by marking it as such—for example, “Internal”, “Not for outside use”, or “Confidential”. And then guard this information through a combination of various mechanisms, including physical security, network protections, encryption, access controls, and employee training—mileage may vary depending on your company’s particular situations

With much trade secret information in digital form, the Varonis Data Governance Suite has an important role to play through its file classification engine, known as the IDU Classification Framework. By scanning for sensitive data contained on your file servers and SharePoint sites, it can help answer the question I posed at the beginning of this post, and ultimately bring you in alignment with a proper IP protection program.

Data Classification Fundamentals

The IDU Classification Framework provides a powerful set of pattern matching capabilities through its own special classification interface. Of course, regular expressions are supported, and matching conditions can be concatenated with the standard Boolean operators, which can reference files types, names, Windows file metadata, as well as the actual contents of the files. Algorithmic verification is performed to mitigate false positives.

Getting back to the problem of finding corporate IP, one could quickly develop an expression for matching against, say, the word “confidential” in files with standard text suffixes—‘.doc’, ‘.txt’, ‘.rtf’, .’pdf’, ‘.xls’,etc. The engine can also scan non-text formats such as images through IFilter interfaces.

But where the engine really shines is how it leverages the permissions and activity metadata collected and stored in the underlying Metadata Framework. In the above example, searching for the word confidential will most certainly turn up tons of results and not be useful, but if your classification engine can find files that contain the word confidential in files that were created or modified by your CEO, executive team, or those most likely to be working with documents that contain trade secrets, your result set becomes far smaller and more usable. You’ll also get results much more quickly, because the scope of what’s being scanned is so much smaller—you don’t need to scan all those files that weren’t created or modified by your target group.

Speed, Scale, and Flexibility

Making full use of the complete audit trail that DatAdvantage users are familiar with, the IDU Classification Framework does true incremental scanning. This means that it doesn’t have to check every single file’s modification date to see if it changed and requires rescanning. Instead, it works from a known list of changed objects provided by DatAdvantage’s powerful auditing functionality. This is a far better approach than a standard (but slow) crawl of the entire file system every single day.

For environments that already have a scanning engine in place, the IDU Classification Framework can absorb classification information as a feed from another product. The UI will display that classification information in a seamless manner, as if the IDU Classification Framework performed the scanning itself.

How to Find the Secrets

Assuming you’ve marked files—perhaps based on input from the legal department—with appropriate header or footer confidentiality warnings, it would be easy, as I mentioned above, to craft a series of IDU Classification expressions to spot all the documents containing your most critical IP.

Employees, though, are very good at cutting and pasting information from existing documents and creating their own files. So how would you spot confidential information without the help of the boilerplate warning?

This solution to this very common scenario requires a little more thought.

There are some obvious cases where any reference to, for example, a code-name project that only a few executives know about—say SkyFall MX—could be tracked with regular expressions.

There are other situations that are more difficult to handle.

Customer and other contact info can be a trade secret if the relationship to the company is not common knowledge. This data may involve many names, addresses, emails, and phone numbers, all of which you’d want to keep secret.

The IDU Classification Framework provides a solution for handling large numbers of keywords through its dictionary option. Using the Classification interface, one can upload a ‘.csv’ file containing all this secret contact data. The Classification software would then match file contents against the entire contact corpus.

Lastly, many IT security teams won’t necessary know what they should be scanning for. Some regulations, like HIPAA and PCI, require the protection of a wide array of data types. Version 5.9 of the Varonis Data Governance Suite provides a way to pick which regulations you need to comply with and it will automatically enable the requisite classification rules for you, eliminating guesswork and cutting down deployment times.

Next Steps

With the results of the included classifications and priorities report, your IT security group would have a list of all the file locations where the confidential data resides, prioritized by risk and exposure. Ideally, this information will be found in folders with appropriate permissions—only C-level executives, or the legal department, or the CFO’s office.

Very likely some of the IP has accidentally jumped over the wall and has made its way into areas of your file server with less restrictive permissions. The root causes here can be entirely benign: executives not aware they have created or copied confidential information.

Though its comprehensive audit logging functions, DatAdvantage even provides a way to trace how the data leaked. DatAdvantage can tell IT admins who created the file, when it was created, and then also offer clues as to the source of the confidential information. For example, the audit log would show all the files the user was looking at just prior to creating the leaky file. The audit log can also help identify the owner of the data, so you can work with the person who knows the data best.

Most of these IP leaks can be addressed through better employee training, more restrictive permissions, and perhaps encryption as a last resort. The larger point is that document classification should have an important part in your IP protection plans and strategies.

The IDU Classification Framework doesn’t take the traditional DLP route (i.e., “Here’s 10,000 sensitive files, good luck!”) — it provides context around the data it classifies, telling you who is using it, who it belongs to, who should and shouldn’t have access, and can alert on abuse. The Data Governance Suite will not only show you what corrective actions to take, it will allow you to simulate and execute them action directly from the DatAdvantage interface.

Welcome to the next generation of data classification and loss prevention.

What should I do now?

Below are three ways you can continue your journey to reduce data risk at your company:

Schedule a demo with us to see Varonis in action. We'll personalize the session to your org's data security needs and answer any questions.

See a sample of our Data Risk Assessment and learn the risks that could be lingering in your environment. Varonis' DRA is completely free and offers a clear path to automated remediation.

Follow us on LinkedIn, YouTube, and X (Twitter) for bite-sized insights on all things data security, including DSPM, threat detection, AI security, and more.

Michael Buckbee Michael has worked as a sysadmin and software developer for Silicon Valley startups, the US Navy, and everything in between.