Machine Learning and Generative AI Use in Varonis

Responsible AI Guiding Principles

AI is a broad term that includes machine learning, Large and Small Language Models (LLMs), and statistical algorithms.

Varonis has adopted a set of responsible AI guiding principles, that guide all the AI based technology.

1. Customer Data is not used for Training AI Models: Varonis does not train models on customer data; instead, it uses other sets of data, which are not based on customer data or are anonymized and derived indicators for model training.

2. Transparency: Varonis is transparent and documents which AI models are used for various features and how they are being used. Such transparency is accomplished both through the information provided by Varonis to its customers and through the product.

3. Data residence: As a principle Varonis ensures that customer data sent to AI complies with data residence per the customer’s choice of data locality, with some exceptions. See details in in the privacy whitepaper and in the Varonis Data Processing Addendum.

4. Opt-out: The use of GenAI is tied to specific features and products. Customers that do not wish to use those GenAI based features can disable the functionality (for GenAI features used by the customer) or avoid licensing specific services or platforms (for GenAI used by Varonis).

5. Explainability: When technically feasible, Varonis provides explainability of model decision as part of data evaluation by model.

6. Model Development Lifecycle: Varonis uses a structured model development lifecycle, which includes:

a. Tracking of which data was used for each model version training
b. Versioning of the models
c. Testing models for precision
d. Penetration testing of models
e. Testing models for fairness and safety
f. Auditing all model evaluations
g. Monitoring of model performance in production to detect drifts from precision

Varonis Features that Utilize AI

Supervised Machine Learning for Threat Detection

Varonis employs machine learning, a subset of artificial intelligence (AI), to enhance cybersecurity in various domains. Machine learning uses algorithms and statistical techniques to enable computer systems to learn from data and make predictions or decisions without explicit programming.

1. Generating Security Alerts Based on Abnormal User Behavior: One crucial application of machine learning in cybersecurity is detecting security threats by analyzing user behavior. Varonis uses machine learning algorithms to monitor and analyze user or service account activities within an organization's network. By learning what constitutes "normal" behavior for users and service accounts, the system can identify deviations from these patterns, flagging them as potential security threats. These deviations can include unusual file access, login times, or data transfers, helping organizations respond promptly to potential breaches.

2. Learning Peers’ Association: Another important aspect of threat detection is understanding the relationships between users and their peers within the network. Varonis' machine learning models can learn and identify which users typically interact or share data with each other. This information helps in detecting suspicious activities, such as unauthorized data sharing or privilege escalation.

3. Learning Normal Working Hours: Machine learning is used to establish typical working hours for users within an organization. By analyzing historical data, the system can identify when users typically access resources, and any deviations from these patterns can trigger alerts. For example, if a user logs in during an unusual time, it could indicate a security incident.

4. Personal Devices Identification: Varonis' machine learning algorithms can also learn and keep track of the devices being used by each user. This information is crucial for detecting unauthorized access or compromised devices. The system can raise a security alert if a user suddenly logs in from an unfamiliar device.

Unsupervised Machine Learning for Threat Detection

We are using unsupervised machine learning, which deals with unlabeled data. The machine learning (ML) algorithms are not provided with explicit output labels. Instead, they discover patterns, structures, or relationships within the data.

Unsupervised machine learning plays a crucial role in cybersecurity threat detection, and it is the preferred option due to its ability to address certain challenges unique to the cybersecurity domain.

Some examples are:

1. Anomaly Detection: Unsupervised machine learning is particularly well-suited for anomaly detection in cybersecurity. Anomalies in network traffic, system behavior, or user activities can indicate security threats such as intrusions or breaches. Since these anomalies often represent previously unknown attack patterns, using supervised learning with pre-labeled data is impractical. Unsupervised learning algorithms can identify deviations from established patterns without knowing what constitutes a threat.

2. Data Exploration and Discovery: Cybersecurity datasets are vast and diverse, containing a wide range of data types, including logs, network traffic, system configurations, and user behavior. Unsupervised learning techniques, like clustering and dimensionality reduction, help to understand these complex datasets. They can reveal patterns, group similar events together, and reduce the data's dimensionality to focus on the most relevant features.

3. Zero-Day Attacks: Zero-day attacks involve exploiting vulnerabilities that are previously unknown. Since there are no predefined attack signatures or labels for these threats, unsupervised learning is crucial. Anomaly detection algorithms can identify unexpected and potentially malicious activities, providing an early warning system against emerging threats.

4. Insider Threat Detection: Detecting insider threats, where legitimate users abuse their access, often requires unsupervised learning. These threats are challenging to identify using supervised learning because malicious insiders can act within the bounds of their legitimate permissions. Unsupervised techniques can help by modeling normal user behavior and flagging deviations that might indicate insider threats.

5. Continuous Learning: The threat landscape is constantly evolving, and new attack techniques emerge regularly. Unsupervised learning models can adapt to changing conditions and identify novel threats as they arise, making them suitable for continuous monitoring and detection.

Which data is used for threat detection based on pre-trained models?

The main categories of inputs are elaborated below:

1. Monitored Events:

  • File Servers: Events raised by file servers provide insights into file access, modification, and sharing activities. This data is vital for detecting suspicious or unauthorized file access, insider threats, and data exfiltration attempts.
  • Network Traffic: Events from network devices such as DNS, firewalls, proxies, and VPNs offer information about network communication patterns. Analyzing these events can reveal potential anomalies, intrusions, or malicious traffic attempting to breach the network.
  • SaaS Services: Events from SaaS services like Microsoft 365, Zoom, Okta, and others are essential for monitoring user activities within cloud-based applications. Detecting unusual activities, login attempts, or data access patterns in these services helps identify potential cyber threats in the cloud environment.
  • Mail Servers: Events from mail servers like Exchange and Exchange Online provide insights into email communications. Analyzing these events can help identify email-based threats such as phishing, malware attachments, or suspicious email forwarding.

2. Entity Data:

  • User Information: User data, such as login activity, access privileges, and historical behavior, is used to create user profiles. Machine learning models can detect anomalies in user behavior, identifying potentially compromised accounts or insider threats.
  • Device Information: Information about servers and endpoints, including their configurations and patch levels, helps assess vulnerabilities and potential attack surfaces. Anomalies in device behavior or configurations can indicate security risks.
  • IP/Domain/URL Information: Monitoring IP addresses, domains, and URLs is essential for detecting malicious network activity, including connections to known malicious hosts, suspicious domain registrations, or attempts to access malicious websites.
  • Threat Intelligence: Incorporating threat intelligence feeds, which provide real-time information on known threats and vulnerabilities, helps cybersecurity systems stay updated and respond promptly to emerging threats. Machine learning models can leverage this intelligence to identify and mitigate threats based on known attack patterns.

Athena AI – Generative AI Assisted Alert Investigation

V2_Light copy_Outline Text for Web@2x

Athena AI assists SOC to use Varonis systems using natural language queries to understand posture of the data and conduct investigation of security alerts. To improve AI accuracy, when Athena AI is asked a question, Varonis determines organizational context by retrieving from the customer metadata information, which is relevant to the query – information about accounts, roles, resources, permissions and classification, and known alerts. Then, the questions sent to AI are augmented with the retrieved “organizational context,” allowing Athena AI to be aware of the specific customer environment and provide highly relevant, precise, and tailored answers.

Managed Data Detection and Response

Customers that purchase MDDR license from Varonis are relying on Varonis security organization to monitor their data and investigate alerts.

As part of alert investigations, MDDR specialists utilize Athena AI technology to investigate alerts more quickly, verify Athena AI's recommendations' correctness and implement remediation.

There is no way to turn off usage of Athena AI for MDDR. Customer that does not wish to use GenAI should avoid purchasing MDDR license from Varonis.

AI Based Data Classification

Model Training

Varonis creates classification models based on public and licensed data sets, never training models on customer data. We utilize versioning of the models and data sets, to ensure traceability of each model.

dce_classifiers-classifiers-training

Classification

dce_classifiers-classifiers_high_level_process

Analyzing the content of files is crucial for assessing their risk and sensitivity. Machine learning and Generative AI models can classify files based on content, identifying documents that contain sensitive information, personally identifiable information (PII), or that violate data policies like GDPR, HIPAA, PCI DSS, or CCPA. This classification enables proactive data protection and compliance enforcement.

Explainability – File Analysis

Varonis utilizes a combination of traditional machine learning and Large Language Model queries in order to classify data contents. As part of that Varonis sends samples of the data to OpenAI LLMs. The samples are not stored in either Varonis SaaS or OpenAI.

Each classification has explainability features, provided via File Analysis, which explain why the models reached a specific classification decision.

Which Generative AI models Athena AI Uses?

Varonis does not currently train custom Generative AI (LLM) models and relies on Azure OpenAI.

For Large Language Models, Varonis is currently using OpenAI gpt4, gpt4o, gpt4o_mini and gpt4o1.

FAQs

Varonis uses unsupervised machine learning, which deals with unlabeled data. Unsupervised machine learning plays a crucial role in cybersecurity threat detection and is the preferred option due to its ability to address certain challenges unique to the domain.

Have questions? Contact us.

Have questions? Contact us.

Report a vulnerability
https://hackerone.com/varonis

Report security issue
soc@varonis.com

 

trust-center-conversion-panel