Machine Learning and Generative AI Use in Varonis

Using machine learning for cybersecurity in Varonis

Varonis employs machine learning, a subset of artificial intelligence (AI), to enhance cybersecurity in various domains. Machine learning involves using algorithms and statistical techniques to enable computer systems to learn from data and make predictions or decisions without explicit programming.

Examples of machine learning use at Varonis:

  1. Generating Security Alerts Based on Abnormal User Behavior: One crucial application of machine learning in cybersecurity is detecting security threats by analyzing user behavior. Varonis uses machine learning algorithms to monitor and analyze user or service account activities within an organization's network. By learning what constitutes "normal" behavior for users and service accounts, the system can identify deviations from these patterns, flagging them as potential security threats. These deviations can include unusual file access, login times, or data transfers, helping organizations respond promptly to potential breaches.
  2. Learning Peers’ Association: Another important aspect of threat detection is understanding the relationships between users and their peers within the network. Varonis' machine learning models can learn and identify which users typically interact or share data with each other. This information helps in detecting suspicious activities, such as unauthorized data sharing or privilege escalation.
  3. Learning Normal Working Hours: Machine learning is used to establish typical working hours for users within an organization. By analyzing historical data, the system can identify when users typically access resources, and any deviations from these patterns can trigger alerts. For example, if a user logs in during an unusual time, it could indicate a security incident.
  4. Personal Devices Identification: Varonis' machine learning algorithms can also learn and keep track of the devices being used by each user. This information is crucial for detecting unauthorized access or compromised devices. The system can raise a security alert if a user suddenly logs in from an unfamiliar device.

Application of learned properties for security alert generation

The learned properties mentioned above serve as the foundation for generating security alerts. Varonis' machine learning models continuously analyze user behavior, device usage, and peer associations. When any anomaly or suspicious activity is detected, the system triggers security alerts, which can prompt further investigation and response from cybersecurity professionals.

In summary, Varonis uses machine learning to bolster its cybersecurity efforts by continuously monitoring and analyzing various aspects of user behavior and system activity. By identifying anomalies and deviations from established norms, Varonis helps organizations proactively detect and respond to potential security threats, ultimately enhancing their overall cybersecurity posture.

What is the model type (supervised, unsupervised, reinforcement learning)?

We are using unsupervised machine learning, which deals with unlabeled data. The machine learning (ML) algorithms are not provided with explicit output labels. Instead, they discover patterns, structures, or relationships within the data.;

Unsupervised machine learning plays a crucial role in cybersecurity threat detection, and it is the preferred option due to its ability to address certain challenges unique to the cybersecurity domain.

Some examples are:

  1. Anomaly Detection: Unsupervised machine learning is particularly well-suited for anomaly detection in cybersecurity. Anomalies in network traffic, system behavior, or user activities can indicate security threats such as intrusions or breaches. Since these anomalies often represent previously unknown attack patterns, using supervised learning with pre-labeled data is impractical. Unsupervised learning algorithms can identify deviations from established patterns without knowing what constitutes a threat.
  2. Data Exploration and Discovery: Cybersecurity datasets are vast and diverse, containing a wide range of data types, including logs, network traffic, system configurations, and user behavior. Unsupervised learning techniques, like clustering and dimensionality reduction, help to understand these complex datasets. They can reveal patterns, group similar events together, and reduce the data's dimensionality to focus on the most relevant features.
  3. Zero-Day Attacks: Zero-day attacks involve exploiting vulnerabilities that are previously unknown. Since there are no predefined attack signatures or labels for these threats, unsupervised learning is crucial. Anomaly detection algorithms can identify unexpected and potentially malicious activities, providing an early warning system against emerging threats.
  4. Insider Threat Detection: Detecting insider threats, where legitimate users abuse their access, often requires unsupervised learning. These threats are challenging to identify using supervised learning because malicious insiders can act within the bounds of their legitimate permissions. Unsupervised techniques can help by modeling normal user behavior and flagging deviations that might indicate insider threats.
  5. Continuous Learning: The threat landscape is constantly evolving, and new attack techniques emerge regularly. Unsupervised learning models can adapt to changing conditions and identify novel threats as they arise, making them suitable for continuous monitoring and detection.

What are the inputs to the models?

The main categories of inputs are elaborated below:

1. Monitored Events:

  • File Servers: Events raised by file servers provide insights into file access, modification, and sharing activities. This data is vital for detecting suspicious or unauthorized file access, insider threats, and data exfiltration attempts.
  • Network Traffic: Events from network devices such as DNS, firewalls, proxies, and VPNs offer information about network communication patterns. Analyzing these events can reveal potential anomalies, intrusions, or malicious traffic attempting to breach the network.
  • SaaS Services: Events from SaaS services like Microsoft 365, Zoom, Okta, and others are essential for monitoring user activities within cloud-based applications. Detecting unusual activities, login attempts, or data access patterns in these services helps identify potential cyber threats in the cloud environment.
  • Mail Servers: Events from mail servers like Exchange and Exchange Online provide insights into email communications. Analyzing these events can help identify email-based threats such as phishing, malware attachments, or suspicious email forwarding.

2. File Content:

  • Content Analysis: Analyzing the content of files is crucial for assessing their risk and sensitivity. Machine learning models can classify files based on content, identifying documents that contain sensitive information, personally identifiable information (PII), or that violate data policies like GDPR, HIPAA, PCI DSS, or CCPA. This classification enables proactive data protection and compliance enforcement.

3. Entity Data:

  • User Information: User data, such as login activity, access privileges, and historical behavior, is used to create user profiles. Machine learning models can detect anomalies in user behavior, identifying potentially compromised accounts or insider threats.
  • Device Information: Information about servers and endpoints, including their configurations and patch levels, helps assess vulnerabilities and potential attack surfaces. Anomalies in device behavior or configurations can indicate security risks.
  • IP/Domain/URL Information: Monitoring IP addresses, domains, and URLs is essential for detecting malicious network activity, including connections to known malicious hosts, suspicious domain registrations, or attempts to access malicious websites.
  • Threat Intelligence: Incorporating threat intelligence feeds, which provide real-time information on known threats and vulnerabilities, helps cybersecurity systems stay updated and respond promptly to emerging threats. Machine learning models can leverage this intelligence to identify and mitigate threats based on known attack patterns. 

Frequently Asked Questions

Are the model outputs predicted via a quantitative method, technique, or theory?

We use ML algorithms with quantitative methods. Quantitative methods involve using numerical data and mathematical techniques to arrive at conclusions or predictions. Techniques could include various algorithms and statistical methods, while theories might involve established principles or frameworks used to guide the modeling process. 

Is the model performance data available (including performance metrics, thresholds, performance test results, etc.)?

Model performance data is available internally and in constant analysis by the Varonis Security Research department. The team derives new threat model creation, model fixes, updates, and tuning, which are periodically delivered to customers. This is done to ensure that alerts are provided in high fidelity. 

Are there any model development documents, such as whitepapers and/or technical documentation, for these models available to share with customers?

No, those documents are internal as they contain intellectual properties.

Does Varonis use machine learning models trained on customer data for other customers?

Cybersecurity is a dynamic field due to the continuous evolution of cyber threats. Hackers are always coming up with new techniques and strategies to bypass security measures. Hence, the attack patterns change constantly. This means that the type of cyberattack one customer experiences could be very different from what another customer encounters. Even large customers or organizations, with their vast and complex networks, may not face every type of cyberattack.

To build robust defense systems against this wide array of cyber threats, we use machine learning models. These models are designed to learn from data and improve over time. However, the models need to be trained on relevant data to be effective.

This is where the aggregated and anonymized data from all customers comes into play. By collecting data about cyberattacks from a large number of customers, we provide a comprehensive picture of the various attack patterns. This data is anonymized to ensure customer privacy is maintained.

The machine learning models are then trained on this anonymized and aggregated data. This way, the models can learn from a wide range of attack patterns, not just those experienced by a single customer. As a result, these models can better predict and identify different types of cyberattacks, thereby providing stronger protection.

Once trained, these models are then used to protect all customers. Regardless of the type of attacks a customer has previously faced, they benefit from the collective learning of the models. This approach allows Varonis to stay one step ahead of hackers and offer their customers the best possible protection against cyber threats.

Varonis generative AI capabilities

image-png-May-28-2024-03-30-13-4815-PM

What is my data used for and can it leak to others?

Customer textual data is sent to LLMs to provide product functionality for your benefit. No customer textual data is ever used to train our models. We can also use textual data for support. We do not retain or use it for other purposes. Varonis uses raw textual data to derive risk indicators, on which models are then trained.

How does Varonis ensure response accuracy and minimize hallucinations?

Varonis performs extensive testing prior to releasing AI features, ensuring high precision and accuracy of all features. In addition, because models can drift over time, Varonis monitors their performance in production to detect and enhance such drifts. Varonis also controls “temperature mode” of the models, which reduces hallucinations, balancing it with precision.

That said, the nature of generative AI is that responses may not be accurate or even plain wrong. Therefore, we advise that a human should always review the recommended remediation actions prior to implementing them.

What data sets or LLMs does the AI use?

Varonis SaaS is built on top of Microsoft Azure Open AI and uses the latest and best public LLM models that are provided via Azure OpenAI platform. We have a written contract with Microsoft, which ensures they, as a sub-processor, are compliant with our security policies and are not retaining our customer data or using it to train models.

How does Varonis ensure AI doesn’t divulge information that a user shouldn’t see?

Varonis relies on safeties in Azure OpenAI models, which have a layer of safeties at both the training stage and the serving stage.

Does Varonis intelligence respect data residency?

Yes, Varonis always stores data in the geography selected by customers and never transfers any customer metadata outside of that geography.

Can we opt out of Varonis’ generative AI features?

Yes, generative AI features are currently available in preview mode for selected customers. You can opt in and out of generative AI features at any time via the settings screen with Varonis’ Web Data Analytics user interface.

Does Varonis sell its models to anyone?

Varonis does not license its models, and only uses them to protect its customers.

What security measures are in place to protect the AI and ML models?

  1. Varonis AI and MI models are continuously scanned against OWASP Machine Learning Security:
    1. https://mltop10.info/about_owasp.html
    2. https://owasp.org/www-project-ai-security-and-privacy-guide/
  2. Our entire infrastructure is monitored by the Varonis SOC and compliant with ISO, HIPAA, PCI, and AICPA requirements.

Have questions? Contact us.

Have questions? Contact us.

Report a vulnerability
https://hackerone.com/varonis

Report security issue
soc@varonis.com

 

trust-center-conversion-panel