Thursday, February 25, 2010

Mean, Mode, Median

For some this is well-known information, for some a reminder, and for anyone in the media something brand new ;-)

Newspapers are full of statistics and the biggest statistical tool they like to throw about is "average" as in 'People in the UK eat an average of 12g of salt a day' now we all know what the average is and how it's calculated - you add the amounts you're dealing with and divide by the number of amounts.

The proper term for this average is arithmetic mean which as with "average" is normally shortened to "mean". There is a geometric mean, but let's not blow too many minds.

Another term that occasionally pops up is "median" the statistics office like this one so it often appears in the media as part of a quoted publication. The median is a simpler concept than mean - take the number of amounts arrange them in order and pick a value such that half appear on either side.

Finally one that almost never appears is mode. This is even easier - take all the amounts, group them together and count the numbers in each group the group with the most in it is the mode.

So what's the point of knowing this? Well as mentioned the media loves to deal with 'average' and occasionally medians, but you rarely see both at the same time in the same article so why's that?

Let me take the simplest option - you have six people in a room and they each have an hourly wage that corresponds to their name. So Mr One earns £1/hour Mr Two £2/hour etc. What's the mean?


So on average the room earns £3.50/hour. What's the median?

1, 2, 3...4, 5, 6

Three on each side leaves the median halfway between 3 and 4, hey it's £3.50/hour again. So that was easy, but what if Mr Six leaves the room and Mr OneHundred enters; how does that change things?

Mean 1+2+3+4+5+100=115  115/6=£19.17
Median 1, 2, 3...4, 5, 100 =£3.50/hour

Now that's a big difference. Now if you were just quoted the average wage of £19.17/hour would that be an accurate representation of that room? Would £3.50/hour? Hopefully you've answered "No" the true picture only appears when you have both figures. In this instance the lower median figure shows that there's something pulling the figures higher and that'd be our Mr OneHundred. What happens if most of the people leave and Mr OneHundred has some friends over?

1+96+97+98+99+100=491 491/6=£81.33/hour
1,96,97...98,99,100 = £97.50/hour

In this instance we see the higher median means something is pulling the figures lower, in this case Mr One. Let's try one more.

1+2+2+98+99+100=302 302/6=£50.33/hour
1,2,2...98,99,100 = £50/hour

So both our mean and median show up roughly the same so roughly £50/hour tells us all we need to know right? So what's the mode?


The group with the most members is 2 so £2/hour is our mode. So a mean of £50.33/hour a median of £50/hour and mode of £2/hour means that for most people in that room what might get quoted doesn't match to anything like what they earn.

It's a neat trick to play when you don't understand the figures used. Going back to the first room with Mr OneHundred in it; if I wanted to show how the room was prospering I'd use the average £19.17/hour; if I were calculating poverty rates I'd use 60% of the median i.e. 60% of £3.50/hour.

So next time someone quotes either the 'average' or median; ask why they're not telling you the other one?


Don B said...

The other statistic that used to come across a lot when I was at work was the decile.

Teachers,Health Staff and Social Services staff were aften concerned about their pupils, clients and patients who were in the bottom decile of whatever statistic they are quoting.

I was always concerned about the distortions that the upper decile created particularly in wealth and income. I always considered myself fortunate in what I earned in that I could see that most of my junior staff earned only half what I earned but even they earned more twice that of my clients. However when I looked the other way it irritated me that even after 45 years working in Local Government I was still not earning the "average" wage. The reason of course lies in what the press were quoting as the average wage and the total, total distortion of the top decile on the average as you have outlined. When you have individuals earning 10 and 20 times more in a week than I as a Principal Officer earned in a year average wage becomes meaningless and engenders greed and envy when they see average being so abused. I have never seen any statistics on the mean wage being quoted.

(Sorry, rant over)

Orphi said...

Don't forget the harmonic mean as well. ;-) And then there are weighted means, generalised functional means, RMS and so forth.

The mean (of whatever type), median and mode are all measures of central tendancy. They all attempt to tell you what the “typical” value is. But that's just one number. A valid question is to ask how much variation there is.

The usual way to measure variation is the standard deviation. The text books will tell you that 68% of the stuff they measured is within the range (mean − standard deviation) to (mean + standard deviation). If you widen the range, 95% are within ± 2 standard deviations of the [arithmetic] mean, and 99.7% of the stuff is within ± 3 standard deviations.

So, for example, if you have a mean of £3.20 and an SD of £0.20, then “most” people earn a few quid. But if you have a mean of £3.20 and an SD of £100, that means some people earn a hell of a lot more than £3.20!

However, the percentages I quoted above apply only if the data follows a normal distribution, which brings me nicely to my next point.

The different means, the mode and median, the SD, these are all just single numbers. The histogram gives you the complete picture; the other measurements are merely single points on this graph.

For those that don't know, a histogram is a graph of value versus frequency. So supposing we're still fussing over wages, the histogram would have wage along one axis, and the number of people earning that wage along the other axis. The distribution refers to the shape of this graph; the normal distribution is the well-known bell-curve that no real-world dataset actually follows. ;-)

You will of course already be aware that the mean of something can be an impossible value. (E.g., apparently the mean number of children a family has is 1.8 — despite the self-evident fact that nobody actually has 1.8 children.) But there is a thing called a bimodel distribution.

For example, it could be that 100 employees earn £3.00, and 100 employees earn £300. The mean is then £151.50, and the median is identical (!!), and the standard deviation is about 150 also. This suggests that most people earn somewhere near £150, even though in this example nobody earns anywhere near this amount!

This fact becomes patently obvious from a cursory glance at the histogram, which would show a large spike at £3 and another at £300, with nothing inbetween. Trouble is, no newspaper article will ever include a histogram. They will include individual statistics like the mean, or perhaps “X% of people earn more than £Y”.

In fairness, the mean, standard deviation and other parameters can be estimated. To draw a histogram, you'd have to actually ask every single person how much they earn — which is usually infeasible. But simple numerical parameters like the mean can be estimated by only asking a few people. (Although obviously, the more people you ask, the better the estimate. Note also that if you don't pick the people randomly enough, you get very inaccurate results — useful if you want the results to come out a specific way to prove the point you want to make!)

In summary, the statistics you read in a newspaper are approximately useless. Usually you have no idea what was measured, how big a sample was used, what assumptions were applied, and hence how reliable the number you're seeing actually is. The newspaper just presents it as a black and white fact, when usually it's an estimate, and one which is only part of the whole picture anyway…

FlipC said...

@Orphi - and there I was trying to keep it simple :-) Yes in theory media should show SD and chi-squared etc., but these are the same people who calculate the odds of having three (non-triplet) child on the same day as 1 in 50,000,000; baby steps my friend, baby steps.

@Don - I'd like to say that great minds think alike, but I don't want to elevate myself that high so I'll attribute it to coincidence. After I'd finished this entry I thought "Perhaps I should deal with quintiles (and tertiles and dectiles and...) next?".

Ah yes I did give an example of how to misuse mean or median which in retrospect was a little misleading itself. As I mentioned the statistics office do like median (and -tiles) whereas it's other 'reports' that use the mean. I just tied the two together please treat it as a purely hypothetical example.

Orphi said...

My point is that the mean and all those other numbers are just one data point; the histogram is the whole picture. Unfortunately, it's quite hard to esimate a histogram, whereas a mean can be easily estimated.

OK, I rephrase: My real point is that newspaper statistics are mostly nonesense anyway. ;-)

FlipC said...

It's sad to say that the media seem intent on only giving you the minimum amount of information they think you'll require to form an opinion. So yes a full histogram would be nice, but let's try to ease them along nice and gently :-P

Orphi said...

Even a histogram requires explanation.

OK, so this histogram shows how much people earn. Before or after tax? In which year? Which region(s)? How big a sample? Are the people you surveyed all Daily Mail readers?

Statistics is tricky stuff. It's like that claim “80% of our customers saved money by switching to us!” Wait, WTF? You mean 20% of your customers bothered to switch to you even though it was more expensive?! So… you're telling me only stupid people bring you business?