This article is part of the series "My Big Fat Data Breach Cost Series". Check out the rest:
Data breach costs are very expensive. No, wait they’re not. Over 60% of companies go bankrupt after a data breach! But probably not. What about reputational harm to a company? It could be over-hyped but after Equifax, it could also be significant. And aren’t credit card fraud costs for consumers a serious matter? Maybe not! Is this post starting to sound confusing?
When I was tasked with looking into data breach costs, I was already familiar with the great Verizon DBIR vs. Ponemon debate: based on data from 2014, Ponemon derived an average cost per record of $201 while Verizon pegged it at $.58 per record. In my book, that’s an enormous difference. But it can be explained if you dive deeper.
After looking at one too many research paper, presentation and blog post on the subject of data breach costs, I started to see that once you absorb a few underlying ideas, you understand what everyone is yakking about.
That’s a roundabout way of saying that this will be a multi-part series.
Averages Can Cause Non-Average Problems
The first issue to take up is the average of a data sample. In fact, this blog’s favorite statistician Kaiser Fung lectured us on this point a while back. When looking at a data set, a simple average of the numbers works well enough as long as the distribution of the number is not too skewed – has a spike or clump at the tail end.
But as Fung points out, when this is not the case, the average leads to inconsistencies, as in the following hypothetical data set of breach record counts over two years:
|Company||Number of records breached (2015)||Number of records breached (2016 )|
For 2015, the average of 1172 is off by several multiples for seven of the ten companies! And if we compare this average to the following year’s average of 930, we could incorrectly conclude that breach counts are down.
Why? If we look at those seven companies, we see all their breach counts went, ahem, up.
This usually leads to a discussion of how numbers are distributed in a dataset, and that the median number, where 50% or less of the data can be found, is a better representation than an average — especially for skewed data sets. Kaiser is very good at explaining this.
For those who want to get a head start on the next post in this series, they can scan this paper, which has the best title on a data security topic I’ve come across, Sex, Lies and Cyber-crime Surveys. This was written by those crazy folks at Microsoft. If you don’t want to read it, the point is this: for skewed data, it’s important to analyze how each percentile contributes to the overall average.
Guesstimating Data Breach Costs
How does Ponemon determine the cost of a data breach? Generally, this information is not easily available. However, in recent years, theses costs have started to show up in annual reports for public companies.
But for private companies and for public companies that are not breaking breach costs out in their public financial reporting, you have to do more creative number crunching.
Ponemon surveys companies, asking them to rate the costs for common post-breach activities, including auditing & consulting, legal services, and identity protection fees. Ponemon then categorizes costs into whether they’re direct — for example, credit monitoring — or fuzzier indirect or opportunity costs — extra employee time or potential lost business.
It turns out that these indirect costs represent about 40% of the average cost of a breach based on their 2015 survey. These costs mean something, but they’re not really accounting costs. More on that next time.
Recently, other researchers have been able to get a hold of far better estimate of the direct breach costs by examining actual cyber insurance claims. Companies, such as Advisen and NetDiligence, have this insurance payout data and have been willing to share it.
The cyber insurance market is still immature and the actual payouts after deductibles and other fine print don’t represent the full direct cost of the breach. But this is, for the first time, evidence of direct costs.
Anyway, the friendly people over at RAND — yes, the very same company who worked this out — used these data sets to guesstimate an average breach cost per incident of about $6 million – wonks should review their paper. This tracks very closely with Ponemon’s $6.5 million per incident estimate for roughly the same period.
Before you start shouting into your browser, I realize I used an average above to estimate a very skewed (and as we’ll see heavy-tailed) set.
In any case, several studies including the RAND one, have focused on per incident costs rather than per record costs. At some point, the Verizon DBIR team also began to de-emphasize the count of records exposed, realizing that it’s hard to get reliable numbers from their own forensic data.
In the 2015 DBIR report, the one where they announced their provocative $.58 per record breach cost claims, the researchers relied on, for the first time, a dataset of insurance claim data from NetDiligence.
Let me just say that the DBIR’s average cost ratio is heavily influenced by a few companies with humongous breached record counts — likely in the millions — reflected in the denominator and smaller total insurance payouts for the numerator. As we saw in my made-up example above, the average in this case is not very revealing.
Why not use multiple averages customized over different breach count ranges? I hope you’re beginning to see it’s far better to segment the cost data by record count: you look up in a table to find the costs appropriate for your case. And Verizon did something close to that in the 2015 DBIR to come with a table of data that’s nearer Ponemon’s average for the lower tiers:
Counting breach data record provides some insight into understanding total costs, but there are other factors: the particular industry the company is in, regulations they’re under, credit protection costs for consumers, and company size. For example, take a look at this breach cost calculator based on Ponemon’s own data.
Linear Thinking and Its Limits
You can understand why the average breach cost per record number is so popular: it provides a quick although unreliable answer for the total cost of a particular breach.
To derive the $201 average cost per record, Ponemon simply added up the costs (both direct and indirect) from their survey and divided by the number of records breached as reported by the companies.
This may be convenient for calculations but as a predictor, it’s not very good. I’m gently walking around the topic of linear regressions, which is one way to draw a “good” straight line through the dataset.
Wonks can check out Jay Jacobs’ great post on this in his Data Driven Security blog. He shows a linear regression beating out the simple Ponemon line with its slope of 201 — by the way, he gained direct access to Ponemon’s survey results. Jacobs’ beta is $103, which you can interpret as the marginal cost of an additional breached record. But even his regression model is not all that accurate.
I want to end this post with this thought: we want the world to look linear, but that’s not the way it ticks.
Why should breach costs go up by a fixed amount for each additional record stolen? And for that matter, why do we assume that 10% of the companies in a data breach survey will contribute 10% to the total costs, the next 10% will add another 10%, etc.
Sure for paying out credit monitoring costs for consumer and replacing credit cards that were reissued by litigious credit card companies, costs add up on a per record basis.
On the other hand, I don’t know too many attorneys, security consultants, developers, or pen testers who say to new clients, “We charge $50 a data record to analyze or remediate your breach.”
Jacobs found a better non-linear model — technically log-linear which is fancy way of saying the record count variable has an exponent in it. In the graph below — thank you Wolfram Alpha! — I compared the regression line (based on the Ponemon data) against the more sophisticated model from Jacobs. You can gaze upon the divergence or else click here to explore on your own.
If you made it this far, congratulations!
In the next post, I hope all this background will payoff as I try to connect these ideas to come up with a more nuanced way to understand data breach costs.