This article is part of the series "My Big Fat Data Breach Cost Series". Check out the rest:

If I had to summarize the first post in this series in one sentence, it’s this: as a single number, the average is not the best way to understand a dataset. Breach cost averages are no exception! And when that dataset is skewed or “heavy tailed”, the average is even less meaningful.

With this background, it’s easier to understand what’s going on with the breach cost controversy as its being played out in the business press. For example, this article in Fortune magazine, does a good job of explaining the difference between Ponemen’s breach costs per record stolen and Verizon’s statistic.

### Get the Free Essential Guide to US Data Protection Compliance and Regulations

**Regression Are Better**

The author points out that Ponemon does two things that overstate their cost per record average. One, they include indirect costs in their model — potential lost business, brand damage, and other opportunity costs. While I’ll get to this in the next post, Ponemons’ qualitative survey technique is not necessarily bad, but their numbers have to be interpreted differently.

The second point is that Ponemon’s $201 per record average is not a good predictor, as is any raw average, and for skewed datasets sets it’s especially not a very useful number.

According to our friends at the Identity Theft Resource Center (ITRC), which tracks breach stats, we’re now reached over a 1000 breach incidents with over a *171 million* records taken. Yikes!

Based on Ponemon’s calculations, American business has experienced $201 x 171 million or about $34 *billion* worth of data security damage. That doesn’t make any financial sense.

Verizon’s average of $.58 per record is based on reviewing actual insurance claim data provided by NetDiligence. This average is also deficient because it likely understates the problem — high deductibles and restrictive coverage policies play a role.

Verizon, by the way, has said this number is also way off! They were making a point about averages being unreliable (and taking a little dig at Ponemon).

The Fortune article then discusses Verizon’s log-linear regression, and reminds us that breach costs don’t grow at a linear rate. We agree on that point! The article also excerpts the table from Verizon that shows how different per record costs would apply for various ranges. I showed that same table in the previous post, and further below we’ll try to do something similar with incident costs.

In the last post, we covered the RAND model’s non-linear regression, which incorporate other factors besides record counts. Jay Jacobs also has a very simple model that’s better than a strict linear line. Verizon, RAND, and Jacobs’ regressions are all far *better* at predicting costs than just a single average number.

I’ll make one last point.

The number of data records involved in a breach can be hard to nail down. The data forensics often can’t accurately say what was taken: was it 10,000 records or a 100,000? The difference may amount to whether a single file was touched, and a factor of ten difference can change $201 per record to $20!

A more sensible approach is to look at the costs *per incident*. This average, as I wrote about last ime, is a little more consistent, and is roughly in the $6 million range based on several different datasets.

**The Power of Power Laws**

Let’s gets back to the core issue of averages. Unfortunately, data security stats are very skewed, and in fact the distributions are likely represented by power laws. The Microsoft paper, Sex, Lies and Cyber-Crime Surveys, makes this case, and also discusses major problems — under-sampling and misreporting — of datasets that are based on power laws: in short, a few data points have a *disproportionate* effect on the average.

Those who are math phobic and curl up into fetal position when they see an equation or hear the word “exponent” can skip to the next section without losing too much.

Let’s now look at the table from the RAND study, which I showed last time.

Note that the median cost per for an incident — see the bottom total — is $250,000 while the average cost of $7.84 million is an astonishing 30 times as great! And the maximum value for this dataset contains a monster-ish $750 million incident. We ain’t dealing with a garden variety bell-shaped or normal curve.

When the data is guided by power law curves, these leviathans exist, but they wouldn’t show up in data conforming to the friendlier and more familiar bell curve.

I’m now going to fit a power law curve to the above stats, or at least to the average — it’s a close enough fit for my purpose. The larger point is that you can have a fat-tailed dataset with the same average!

A brief word from our sponsor. Have I mentioned lately how great Wolfram Alpha is? I couldn’t have written this post without it. If I only had this app in high school. Back to the show.

The power law has a very simple form: it’s just the variable x, representing in this case the cost of an incident, taken to a negative exponent power of alpha: x-^{α}.

Simple. (Please don’t’ shout into your browser: I know there’s a normalizing constant, but I left it out to make things easier.)

I worked out an alpha of about -2.15 based on stats in the above table. The alpha, by the way, is the key to all the math that you have to do.

However, what I really want to know is the weight or percentage of the total costs for all breach incidents that *each* segment of the sample contributes. I’m looking for a representative average for each slice of the incident population.

For example, I know that the median or 50% of the sample — that’s about 460 incidents — has incident costs below $1.8 million. Can I calculate the average costs for this group? It’s certainly not $7.84 million!

There’s a little bit more math involved, and if you’re interested, you can learn about the Lorenz curve here. The graph below compares the unequal distribution of total incidents costs (the blue curve) for my dataset versus a truly equal distribution (the 45-degree red line).

As you ponder this graph — and play with it here — you see that the blue curve doesn’t really change all that much up to around the 80% or .8 mark.

For example, the median at .5 and below represents 9% of the total breach costs. Based on the stats in the above table, the total breach cost for all incidents is about $7.2 billion ($7.84 million x 921). So the first 50% of my sample represents a mere $648 million ($7.2 billion x .09). If you do a little more arithmetic, you find the average is about $1.4 million per incident for this group.

The takeaway for this section is that most of the sample is not seeing an average incident cost close to $7.8 million! This also implies that at the tail there are monster data incidents pushing up the numbers.

## The Amazing IOS Blog Data Incident Cost Table

I want to end this post with a simple table (below) that breaks average breach costs into three groups: let’s call it Economy, Economy Plus, and Business Class. This refers to the first 50% of the data incidents, the next 40%, and the last 10%. It’s similar to what Verizon did in their 2015 DBIR for per record costs.

Economy | Economy Plus | Business Class | |

Data incidents | 460 | 368 | 92 |

Percent of Total Cost | 9% | 15% | 74% |

Total Costs | $648 million | $1 billion | $5.33 billion |

Average costs | $1.4 million/incident | $2.7 million/incident | $58 million/incident |

If you’ve made it this far, you deserve some kind of blog medal. Maybe we’ll give you a few decks of Cards Against IT if you can summarize this whole post in a single, concise paragraph and also explain my Lorenz curve.

In the next, and (I promise) last post in this series, I’ll try to tell a story based on the above table, and then offer further thoughts on the Verizon vs. Ponemon breach cost battle.

Story telling with just numbers can be dangerous. There are limits to “data-driven” journalism, and that’s where Ponemon’s qualitative approach has some significant advantages!