This post will unpack the importance of considering using Median instead of Mean when analyzing data in clear, non-technical terms.
You are looking at some of your sales averages and are beyond excited about the average amount of sales per customer. This is the average you use to build KPIs and forecast sales in many enterprise reporting. However, the truth is, this average does not truly reflect the number of sales each one of your customers spent.
Now you might ask yourself how and why that is, and I will explain that in one moment. First, let’s revisit the average and mean definitions. Average represents a group of numbers' middle point, aka the sum, divided by how many numbers were added. Median is the actual middle number out of a list of numbers.
That was easy, right? Now let’s hop back into the question of, “Why is my average wrong?''
Let’s first look at a group of numbers: 100, 80, 60, 30, 500, 30, 1000, 70
The average of these numbers is 233.75. (We added, then divided the sum by the number of numbers we used; in the case eight.)
The issue with applying the average is that it is being affected by very large or, sometimes, by small numbers that can inflate or deflate our average. We call these numbers outliers.
If we solve for the median number for the same list, we will fall at 75 ((70)+80/2). Look at the drastic difference between the average number 233.75 and the median number 75! Which one best represents the middle of our data?
As stated, the median number would be best to accurately reflect the middle point of your data. The reason is that data is being pulled far away from the true middle point of the data by outliers. We would say our data is being skewed in the technical world If we were to remove the outliers via an outlier test (1,000 & 500), our data’s average number would then be approximately 61.67—a much closer representation to the median than the original mean.
So, when do we use median over mean?
The simple answer: if your data is not skewed, then the mean would be the most appropriate method for approximating your dataset’s middle point. However, if your data is skewed, then you should use the median in its place. Standard methods to tell if your data is skewed are by looking at a frequency distribution, boxplot, or comparing your median and mean numbers. Still, you will always want to explore your data before deciding which method fits your desired outcome better.
Next week, we will review the above methods to see if outliers affect your dataset!
-Shahid
Comments