Live Cyber Attack Lab 🎯 Watch our IR team detect & respond to a rogue insider trying to steal data! Choose a Session


[Podcast] Statistician Kaiser Fung: Investigate The Process Behind A Numerical Finding (Part 1)

IT Pros


Leave a review for our podcast & we'll send you a pack of infosec cards.

In the business world, if we’re looking for actionable insights, many think it’s found using an algorithm.

However, statistician Kaiser Fung disagrees. With degrees in engineering, statistics, and an MBA from Harvard, Fung believes that both algorithms and humans are needed, as the sum is greater than its individual parts.

Moreover, the worldview he suggests one should cultivate is numbersense. How? When presented with a numerical finding, go the extra mile and investigate the methodology, biases, and sources.

For more tips, listen to part one of our interview with Kaiser as he uses recent headlines to dissect the problems with how data is analyzed and presented to the general public.


Cindy Ng: Numbersense essentially teaches us how to make simple sense out of complex statistics. However, statistician Kaiser Fung said that cultivating numbersense isn’t something you can learn in a book. But there are three things you can do. First is you shouldn’t take published data as face value. Second, is to know what questions to ask. And third is to have a nose for doctored statistics.

And so, the first bullet is you shouldn’t take published data at face value. And so like to me, that means it takes more time to get to the truth that matters, to the matter, to the issue at hand. And I’m wondering also like to what extent does the volume of data, big data, affects fidelity because that certainly affects your final result?

Kaiser Fung: There are lots of aspects to this. I would say, let’s start with the idea that, well it’s kind of a hopeless situation because you pretty much have to replicate everything or check everything that somebody has done in order to decide whether you want to believe the work or not. I would say, well, in a way that’s true but then over time you develop kind of a shortcut. Then part of it is that if you have done your homework on one type of study, then you could apply all the lessons very easily to a different study that we don’t have to actually repeat all that.

And also organizations and research groups tend to favors certain types of methodologies. So once you’ve understood what they are actually doing and what are the assumptions behind the methodologies, then you could…you know, you have developed some idea about whether if you’re a believer in the assumptions or their method. Also the time, you know I have certain people who’s work I have come to appreciate. I’ve studied their work, they share some of my own beliefs about how do you read data and how to analyze data.

And it’s this sense of, it also depends on who is publishing the work. So, I think that’s part one of the question is encourage people to not just take what you’re told but to really think about what you’re being told. So there are some shortcuts to that over time. Going back to your other issue related to the volume of data, I mean I think that is really causing a lot of issues. And it’s not just the volume of data but the fact that the data today is not collected with any design or plan in mind. And often times, the people collecting the data is really divorced from any kind of business problem or divorce from the business side of the host. And the data has just been collected and now people are trying to make sense of it. And I think you end up with many challenges.

One big challenge is you don’t end up solving any problems of interest. So I just had a read up my blog, that will be something just like this weekend. And this is related to somebody’s analysis of the…I think this is Tour de France data. And there was this whole thing about, “Well, nowadays we have Garmin and we have all these devices, they’re collecting a lot of data about these cyclists. And there’s nothing much done in terms of analysis,” they say.

So which is probably true because again, all of that data has been collected with no particular design in mind or problem in mind. So what do they do? Well, they basically then say, “Well, I’m going to analyze the color of the bike that have actually won the Tour de France over the years.” But then that’s kind of the state of the world that we’re in. We have the data then we try to portrait it by forcing it answer some questions that we’re supposed to create.

And often times these questions are actually very silly and doesn’t really solve any real problems, like the color of the bike is. I don’t think anyone believe it impacts whether you win or not.

I mean, that’s just an example of the types of problems that we end up solving. And many of them are very trivial. And I think the reason why we are there is that when you just collect the data like that, you know, let’s say you have a lot of this data about…I mean, let’s assume that this data measures how fast the wheels are turning, the speed of your bike, you know, all that type of stuff. I mean, if the problem is that when you don’t have an actual problem in mind, you don’t actually have all of the pieces of the data that you need to solve a problem. And most often what you don’t have is like an outcome metric.

You have a lot of these sort of expensive data but there’s no measurement of that thing that you want to impact. And then in order to do that, you have to actually merge in a lot of data or try to collect data from other sources. And you probably often times cannot find appropriate data so you’re kind of stuck in this loop of not having any ability to do anything. So I think it’s the paradox of the big data age is we have all these data but it is almost impossible to make it useful in a lot of cases. there are many other reasons why the volume of data is not helping us. But I think…what flashed in my head right now because of … is that one of the biggest issues is that the data is not solving any important problems.

Andy Green: Kaiser, so getting back to what you said earlier about not sort of accepting what you’re told, and I’m also now become a big fan of your blog, Junk Charts. And there was one, I think it’s pretty recent, you commented on a New York Times article on CEO executives, CEO pay.
And then you actually sort of looked a little deeper into it and you came to sort of an opposite conclusion. In fact, can you just talk about that a little bit because the whole approach there is kind of having to do with Numbersense?

Kaiser Fung: Yeah. So basically what happened was there was this big headline about CEO pay. And it was one of these sort of is counter-intuitive headlines that basically said, “Hey, surprise…” Sort of a surprise, CEO pay has dropped. And it even gives a particular percentage and I can’t remember what it was in the headline. And I think the sort of Numbersense part of this is that like when I read something like that, because it’s sort of like the…for certain topics like this particular topic since I have an MBA and I’ve been exposed to this type of analysis, so I kind of have some idea, though it’s some preconceived notion in my head about where CEO pay is going. And so it kind of triggers a bit of a doubt in my head.

So then what you want to do in these cases, and often times, I think this is an example of very simple things you can do, If you just click on the link that is in the article and go to the original article and start reading what they say, and in this particular case, you actually only need to read like literally the first two bullet points of the executive summary of the report. Because then immediately you’ll notice that actually CEO pay has actually gone up, not down. And it all depends on what metric people use it.

And that they’re both actually accurate from a statistic perspective. So, the metric that went up was the median pay. So the middle person. And then the number that went down was the average pay. And then here you basically need a little bit of statistical briefing because you have to realize that CEO pay is an extremely skewed number. Even at the very top, I think they only talk about the top 200 CEOs, even the very top the top person is making something like twice the second person. Like, this is very, very steep curve. So the average is really meaningless in this particular case and the median is really the way to go.

And so, you know, I basically blogged about it and say, you know, that that’s a really poor choice of a headline because it doesn’t represent the real picture of what is actually going on. So that’s the story. I mean, that’s a great…yes, so that’s a great example of what I like to tell people. In order to get to that level of reasoning, you don’t really need to take a lot of math classes, you don’t need to know calculus, you know…I think it’s sort of the misnomer perpetuated by many, many decades of college instruction that statistics is all about math and you have to learn all these formulas in order to go anywhere.

Andy Green: Right. Now, I love the explanation. And also, it seems that if the Times had just shown a bar chart and it would have been a little difficult but what you’re saying is that at the upper end, there are CEOs making a lot of money and that they just dropped a little bit. And correct me if I’m wrong, but everyone else did better, or most like 80% of the CEOs or whatever the percentile is, did better. But those at the top, because they’re making so much, lost a little bit and that sort of dropped the average. But meanwhile, if you polled CEOs, whatever the numbers, 80% or 90% would say, “Yes, my pay has gone up.”

Kaiser Fung: Right. So yeah. So I did look at the exact numbers there. I don’t remember what those numbers are but in conceptually speaking, given this type of distribution, it’s possible that just the very top guy having dropped by a bit will be sufficient to make the average move. So the concept that the median is the middle guy has actually moved up. So what that implies is that the bulk, you know, the weight of the distribution has actually gone up.

There are many different actual numbers that made this in levels of aspect that you can talk about. That’s the first level of getting the idea that you rarely talk in the median. And if you really want to dig deeper, which I did in my blog post, is that you also have to think about what components drive the CEO pay, because if the accounting, not just the fixed-base salary but maybe also bonuses and also maybe they even price in any of the stock components and you know the stock components are going to be much more volatile.

I mean it all points to the fact that you really shouldn’t be looking at the averages because it’s now so affected by all these other ups and downs. So to me, it’s a basic level of statistical reasoning that unfortunately hasn’t seem to have improved in the journalistic world. I mean, even in this day and age when there’s so much data, they really need to improve their ability to draw conclusions. I mean,…that’s a pretty simple example of something that can be improved. Now we also have a lot of examples of things that are much more subtle.

I’d like to give an example, a different example of this, and it also comes from something that showed up in the New York Times some years ago. But this is a very simple scatter plot that was plotting or trying to explain or trying to correlate the average happiness of people in different countries. And that’s typically measured by survey results. So you base your happiness from a scale of zero to ten or stuff like that. And then they want to correlate that with the what they call the progressiveness of the tax system in each of these countries.

So,the thing that people don’t understand is by making this scatter plot, you have actually imposed upon your reader a particular model of the data. And in this particular case, it is the model that says that happiness can be explained by just one factor which is the tax system. So in reality, they are a gazillion other factors that affects somebody’s happiness. And you really…and if you know anything about statistics, we would learn that it multivariable regression which would actually control all the other factors. But when you do a scatter plot, you haven’t adjusted for anything else. So it’s like the very simple analysis could be extremely misleading.

Andy Green

Andy Green

Andy blogs about data privacy and security regulations. He also loves writing about malware threats and what it means for IT security.

Cindy Ng

Cindy Ng

Cindy is the host of the Inside Out Security podcast.


Does your cybersecurity start at the heart?

Get a highly customized data risk assessment run by engineers who are obsessed with data security.