Version 1.0 available!

Help

I Am Not A Math Person! What do I Do?

In the immortal words of Douglas Adams, don’t panic. Most people, including math people have a very poor statistical intuition. There’s a reason why betting and scratch games and casinos make a decent amount of money for their operators.

The idea here is to have a quick access to tools that point you towards a more advanced analysis, or maybe confirm/deny your intuition about some numbers you’ve been told.

When I deal with a brand new piece of data, there are a few general questions that I ask myself:

  • Does the data seem weird in any way?
  • Do these numbers follow a “good sense” rule for what I know should be in there?
  • Is there something in there that should exist but doesn’t?
  • Is there something in there that shouldn’t exist but does?

Let’s look at a few pointers

Is The Data Natural, Synthetic, or just Weird?

“Natural” data tends to follow rules: for example, most points should be towards the average, and few should be at the extremes. It looks like a bell curve. So if I’m looking at the sales for an e-commerce site, and I know the average checkout order to be around 4 items, the curve should have very few carts around 1 item, a few more around 2, up until 4, then it goes down at 5, and trends to “no one buys more than 7 items”.

Note that I didn’t include the empty baskets, because if I include them (in the case of conversions, for instance), the large majority will just bounce off the site and buy nothing, and so my bell curve will have a huge spike at zero, and a steep downwards slope from there.

So I just load the CSV, check the Gaussian curve for the basket size, and see what’s what.

If I see two spikes in my graph, then either something’s off or my intuition was wrong about the site. In both cases, it’s an interesting fact to check up on.

Another example is growth (or decay): if I see an exponential growth in revenue, I should see some corresponding growth in sales. So I fire up the Regression analysis and look at my two curves.

That’s also how you can spot data that has been doctored or is synthetic: the curve you should see is there, but it’s a bit too perfect, and matches the predictions almost exactly. Nothing in nature (human or actual) is that smooth. It should prickle your ear or thumb or whatever metaphor you use for your intuition.

And there’s data that doesn’t make any sense. Visualizing it might help you put into words why it’s weird. Maybe there is an obvious connection between two pieces of data that isn’t there? When we worked on bike sharing datasets with some friends, we naturally assumed there would be a difference between weekday usage and weekend usage. Turns out there were none, the numbers were roughly the same. That’s weird, so we needed to dig deeper.

What Sort of Vizualization Should I Start With?

I don’t know your data. No one does, but you and your team. That’s why the only valid answer here is “it depends”. However, there are a few things that can help you with the first steps.

Is The Column I’m Looking At Independant From Everything Else?

Am I looking at a number in relation to another or on its own? For example, if I want to stock up my t-shirt store, I might want to look at the size of the t-shirts I’ve already sold. Then look at the Gaussian curve and stock up most what the average buyer gets. This is unrelated to anything else: I have all the information I need just by looking at the past sizes sold. But if I’m looking at growth, I’m looking at my sales in relation to time, so there are two columns I need to look at.

Most of the visualizations will take 1 or 2 columns to look at, just remember that it’s pretty rare for a number to mean anything without context.

Am I Starting With An Assumption?

Is your goal here to verify something you think to be true? Then specify it. You have an inkling that there is a seasonal rythm to your data? Chart it with the date as the X value. You think two things are strongly tied together? Chart one according to the other.

Just remember it’s your data. It should mean something to you.

And if you’re exploring a dataset (the fancy term for a long list of numbers), well, maybe start by looking at the Covariance to see if two columns seem to be strongly correlated or not. Then look for patterns: maybe there are two spikes in the graph that indicate anomalies in an otherwise smooth curve? Maybe almost all of the data is noise, and the dataset is useless as-is so you need to request more numbers?

Am I Looking At A Specific Indicator?

Maybe you’re on a mission. Maybe the only number of interest to you is the number of red pandas born this month. OK, then. Look at that number in relation to every other column in your dataset! On its own is it a bell curve? It’s a natural phenomenon, so it should probably look like a bell curve, with a lot of average months, and few exceptional ones. What does it mean if it doesn’t follow a bell curve? What if I look at the number of births in relation to the temperature this month? Oh, it seems like another bell curve where most births arrive around that temperature? Or if I look at it in relation to the date, is it going up, down? Is it stable? Is the species in danger of being extinct?

Just explore, see the relationships between your “main” interest and all the other “potentially interesting” things in there.

I’m Having Fun! I Find Interesting Things! Where Do I Go From There?

As usual, it depends…

If your data analysis is enough to help you make decisions, make your decisions, and come back to see if the data shows the change you wanted.

If your data analysis triggered your curiosity and you need to dig deeper, maybe you should combine multiple datasets together to check things out? Like, ok, you found out the temperature seem to follow a seasonal pattern, but what if it also followed a pattern according to the latitude? Or the distance to the ocean?

If what you need are more powerful math tools, it is a somewhat long journey. Statistics is not intuitive at all (despite what most people think - ha! see what I did there?), and the math isn’t that hard, but the methodology is. If there is an error because you want to go too fast, it tends to compound to bigger and bigger mistakes, so it’s all about slow and steady and methodical, and there’s no substitute for the 10000h rule in hat regard. But I promise the math isn’t super hard!