WEBVTT
1
00:00:00.000 --> 00:00:04.960
Alright, the five number summary, what's all doubt about?
2
00:00:04.960 --> 00:00:11.080
This extends the median and range that we looked at in a previous module one stage further.
3
00:00:11.080 --> 00:00:14.400
Let's just recall what median's all about.
4
00:00:14.400 --> 00:00:18.720
Here I've got some figures on 10 patients who've had their blood pressure measured while
5
00:00:18.720 --> 00:00:21.880
they were resting, which is what supine means.
6
00:00:21.880 --> 00:00:26.160
And as you can see, there's arranging numbers and I've put them in order of size for you.
7
00:00:26.160 --> 00:00:31.320
To find the median in this set of data, because it's ten numbers, you have to find the middle
8
00:00:31.320 --> 00:00:34.960
two and then you add them up and divide by two.
9
00:00:34.960 --> 00:00:40.920
So the median is 120.5 millimeters of mercury, which is the units of blood pressure.
10
00:00:40.920 --> 00:00:47.800
There is a more systematic way of knowing where to find the median in a set of numbers.
11
00:00:47.800 --> 00:00:51.640
You basically just add one to the number of results.
12
00:00:51.640 --> 00:00:57.680
So 10 plus 1 is 11. You divide the answer by two and then the answer to that sum tells
13
00:00:57.680 --> 00:01:00.120
you the position of the median.
14
00:01:00.120 --> 00:01:04.520
So 10 plus 1 is 11.11 divided by 2 is 5.5.
15
00:01:04.520 --> 00:01:09.760
And we interpret the 5.5 number as being halfway between the fifth and sixth number, which
16
00:01:09.760 --> 00:01:12.560
means that sum we did before.
17
00:01:12.560 --> 00:01:16.080
That's a way of finding the median by procedure, by recipe.
18
00:01:16.080 --> 00:01:19.880
When you've got like hundreds of results, you tend to use that method rather than trying
19
00:01:19.880 --> 00:01:23.200
to count through them and find the middle.
20
00:01:23.200 --> 00:01:25.760
Now in the case with nine patients, it still works.
21
00:01:25.760 --> 00:01:27.160
The same rule still works.
22
00:01:27.160 --> 00:01:28.640
I had one to the number of results.
23
00:01:28.640 --> 00:01:30.520
9 plus 1 is 10.
24
00:01:30.520 --> 00:01:33.720
Divide the answer by two, 10 divided by 2 is 5.
25
00:01:33.720 --> 00:01:37.360
Well, the fifth number is the median and there we go.
26
00:01:37.360 --> 00:01:40.920
So that's kind of easy to spot when it's a small number of results, but it's quite hard
27
00:01:40.920 --> 00:01:42.400
when you get hundreds.
28
00:01:42.400 --> 00:01:46.200
So that's the median of a not number of results.
29
00:01:46.200 --> 00:01:48.280
Now we're going to extend the idea.
30
00:01:48.280 --> 00:01:55.520
The five numbers is the maximum, the minimum, the median, then two new kinds of number.
31
00:01:55.520 --> 00:01:57.200
Let's have a look at those.
32
00:01:57.200 --> 00:01:59.920
Here's my data set again, my 10 patients.
33
00:01:59.920 --> 00:02:03.840
I've divided them into two halves, the bottom half and the top half.
34
00:02:03.840 --> 00:02:06.640
Bottom half's got five results, top half's got five results.
35
00:02:06.640 --> 00:02:09.880
The dividing line is the median, of course.
36
00:02:09.880 --> 00:02:13.320
We can find the median of the bottom half.
37
00:02:13.320 --> 00:02:14.320
That's quite easy.
38
00:02:14.320 --> 00:02:19.120
You know, 5 plus 1 is 6, divide 6 by 2 is 3, it's the third number.
39
00:02:19.120 --> 00:02:26.400
We can also find the median of the top half, the same principle, third number on the top half.
40
00:02:26.400 --> 00:02:29.760
Those are actually given names that are called the quartiles.
41
00:02:29.760 --> 00:02:31.480
Why quartile?
42
00:02:31.480 --> 00:02:36.040
Because those numbers divide the data set into quarters if you think about it.
43
00:02:36.040 --> 00:02:42.800
So the lower quartile, 25% of the results are less than that or equal to it and 75% of
44
00:02:42.800 --> 00:02:45.680
the results are larger.
45
00:02:45.680 --> 00:02:49.680
The 122 millimeters of mercury is the upper quartile.
46
00:02:49.680 --> 00:02:55.920
So that's where 75% of people are less and 25% of people are more.
47
00:02:55.920 --> 00:02:58.200
Now those values are pretty stable.
48
00:02:58.200 --> 00:03:04.280
Suppose we have somebody with massive hypertension included in this group and they add an enormous
49
00:03:04.280 --> 00:03:06.080
systolic blood pressure.
50
00:03:06.080 --> 00:03:09.600
That wouldn't affect the quartiles that much, but it would affect the maximum in the
51
00:03:09.600 --> 00:03:14.480
data set quite a lot, we'll come back to that later on.
52
00:03:14.480 --> 00:03:19.800
So to label up my numbers, I've got my lower quartile, the median and the other quartile,
53
00:03:19.800 --> 00:03:22.800
and we can of course include the minimum and the maximum.
54
00:03:22.800 --> 00:03:24.800
And that's your five number summary.
55
00:03:24.800 --> 00:03:31.520
So to recap, you've got the minimum value, the maximum value, you've got the lower quartile
56
00:03:31.520 --> 00:03:34.920
and the upper quartile and you've got the median.
57
00:03:34.920 --> 00:03:40.280
See how the quartiles have two different kind of abbreviations, LQ and UQ are the common
58
00:03:40.280 --> 00:03:44.560
sense ones that you'll find in most GCSE textbooks.
59
00:03:44.560 --> 00:03:48.000
More advanced textbooks tend to use Q1 and Q3.
60
00:03:48.000 --> 00:03:52.040
Q1 is the first quartile and Q3 is the third quartile.
61
00:03:52.040 --> 00:03:57.760
Logically, the median is actually the second quartile and you may manage to find a book that
62
00:03:57.760 --> 00:04:01.160
refers to it as that, but it's very unusual.
63
00:04:01.160 --> 00:04:08.000
Now, here's one for your turn, even number of data items, heights of children, just stop
64
00:04:08.000 --> 00:04:11.280
the video, take a couple of seconds and see if you can work it out.
65
00:04:11.280 --> 00:04:15.720
Again, I've put the numbers in order of size for you.
66
00:04:15.720 --> 00:04:28.320
Okay, I got 156 and 157 because it's an even number of data sets, even number of results.
67
00:04:28.320 --> 00:04:35.640
Now add those up, oh, my lower quartile is 153, the upper quartile is 163 and I pressed
68
00:04:35.640 --> 00:04:38.640
the button too quick and got two slides at once.
69
00:04:38.640 --> 00:04:43.560
My bottom number is 144 and my top number is 172.
70
00:04:43.560 --> 00:04:50.440
The median is actually 156.5 because it's halfway between the two data items, halfway between
71
00:04:50.440 --> 00:04:54.640
the two data values because it's an even number of data.
72
00:04:54.640 --> 00:04:55.640
Okay.
73
00:04:55.640 --> 00:05:00.880
Now, we do have a problem, what happens when you start off with an odd number of values?
74
00:05:00.880 --> 00:05:03.040
What do you do about the median?
75
00:05:03.040 --> 00:05:08.160
Let's have a look at a set data set here with an odd number of results in it.
76
00:05:08.160 --> 00:05:10.960
It's actually got 13 heights.
77
00:05:10.960 --> 00:05:13.160
Okay, of children.
78
00:05:13.160 --> 00:05:21.600
Now, the median is 13 plus 1 divided by 2, the seventh number and the list, so 156.
79
00:05:21.600 --> 00:05:27.880
One textbooks will find the quartiles by excluding the median.
80
00:05:27.880 --> 00:05:32.720
They'll define the bottom half and they'll define the top half.
81
00:05:32.720 --> 00:05:36.560
The bottom half in that definition has six numbers.
82
00:05:36.560 --> 00:05:43.400
So the median of the bottom half or the lower quartile of the whole data set will be the
83
00:05:43.400 --> 00:05:47.680
average, the mean of the third and fourth numbers.
84
00:05:47.680 --> 00:05:52.560
In fact, they're actually both the same, so to be 153 centimeters.
85
00:05:52.560 --> 00:05:59.560
The median of the top half will equally be halfway between 163 and 163, so that's going
86
00:05:59.560 --> 00:06:05.960
to be 163 because by sheer coincidence in this data set, they happen to be the same numbers.
87
00:06:05.960 --> 00:06:11.680
Okay, now, the other convention that some other textbooks use is to include the median
88
00:06:11.680 --> 00:06:13.760
in both data sets.
89
00:06:13.760 --> 00:06:18.120
That's actually the convention that I'm going to use because it helps when you've got quite
90
00:06:18.120 --> 00:06:19.400
small data sets.
91
00:06:19.400 --> 00:06:21.720
It helps even things out a bit.
92
00:06:21.720 --> 00:06:25.800
The difference between that and the other method gets less important when you've got large
93
00:06:25.800 --> 00:06:27.320
data sets.
94
00:06:27.320 --> 00:06:34.600
So now the bottom half has seven numbers in it, so the median is the fourth number and
95
00:06:34.600 --> 00:06:36.680
that's 153.
96
00:06:36.680 --> 00:06:44.800
The top half has seven numbers in it and the median of that one is 163.
97
00:06:44.800 --> 00:06:47.680
Interpreting the five number summary, just a few suggestions.
98
00:06:47.680 --> 00:06:49.000
What does it all mean?
99
00:06:49.000 --> 00:06:51.520
We'll spend a lot more time on this later on.
100
00:06:51.520 --> 00:06:55.560
Well, here's the summary for the height data set you've just seen.
101
00:06:55.560 --> 00:06:59.240
Here's a little table organised with the five numbers in it.
102
00:06:59.240 --> 00:07:01.240
The first thing you can do is work out the range.
103
00:07:01.240 --> 00:07:07.640
The range is the max minus the minimum value, which I make to be 24 centimetres.
104
00:07:07.640 --> 00:07:09.440
That's quite unstable.
105
00:07:09.440 --> 00:07:16.640
It would only take 168 centimetre child to be replaced by a basketball player of 200 centimetres
106
00:07:16.640 --> 00:07:19.160
and that range figure would double.
107
00:07:19.160 --> 00:07:23.880
So that's not a particularly good measure of how spread out your data is.
108
00:07:23.880 --> 00:07:29.280
A better one has the resplendent title of interquartile range.
109
00:07:29.280 --> 00:07:33.560
So that's the difference between the third quartile and the first quartile.
110
00:07:33.560 --> 00:07:37.080
Range is always used in statistics between the difference between a large and a small
111
00:07:37.080 --> 00:07:38.080
number.
112
00:07:38.080 --> 00:07:40.080
So that one is 10 centimetres.
113
00:07:40.080 --> 00:07:41.960
That's a much more stable measurement.
114
00:07:41.960 --> 00:07:46.760
You can add quite a few basketball players before the upper quartile starts to change.
115
00:07:46.760 --> 00:07:51.640
That's a more meaningful measure of how spread out your data is.
116
00:07:51.640 --> 00:07:58.160
It's also easier to visualize because the range, if you like, between the upper quartile
117
00:07:58.160 --> 00:08:02.160
and the lower quartile includes the middle half of your group.
118
00:08:02.160 --> 00:08:08.280
So you know that the middle half of your group varies by 10 centimetres.
119
00:08:08.280 --> 00:08:10.600
Now the median is 156.
120
00:08:10.600 --> 00:08:12.560
That's a pretty typical result.
121
00:08:12.560 --> 00:08:14.840
That's a good typical result for your data.
122
00:08:14.840 --> 00:08:21.440
But notice how it's much nearer to the first quartile than it is to the third quartile,
123
00:08:21.440 --> 00:08:24.000
Q3, the upper quartile.
124
00:08:24.000 --> 00:08:29.200
That tells you something about the shape of the distribution of the heights.
125
00:08:29.200 --> 00:08:33.160
Now later on we're going to look at something called the box and whisker plot.
126
00:08:33.160 --> 00:08:38.040
That thing that looks a bit like a hypodermic syringe is called a box and whisker plot.
127
00:08:38.040 --> 00:08:42.160
As you might have guessed, the lines on the outer part of the diagram, the two vertical
128
00:08:42.160 --> 00:08:48.600
lines at just about 144 and just about 168 are the minimum and the maximum.
129
00:08:48.600 --> 00:08:55.880
The box in the middle, the red box, the lower part of the red box is the first quartile.
130
00:08:55.880 --> 00:09:01.360
The end of the red box is the upper quartile or third quartile and the line in the middle
131
00:09:01.360 --> 00:09:02.360
is the median.
132
00:09:02.360 --> 00:09:06.640
That's a very simple visual representation of the distribution of your data.
133
00:09:06.640 --> 00:09:10.040
We'll look at drawing one of those in a bit.
134
00:09:10.040 --> 00:09:11.040
Your turn now.
135
00:09:11.040 --> 00:09:15.760
I want you to go and find a dataset with 12 or 16 or some of the multiple of four values
136
00:09:15.760 --> 00:09:17.160
and find the five number summary.
137
00:09:17.160 --> 00:09:19.720
You'll see what I'm on about when you do it.
138
00:09:19.720 --> 00:09:24.240
Then find a dataset with an even number results which has an odd number in each half and then
139
00:09:24.240 --> 00:09:27.520
find the five number summary of that one.
140
00:09:27.520 --> 00:09:31.320
Then finally, if you're in for a challenge, can you make up a dataset?
141
00:09:31.320 --> 00:09:32.320
Just make up.
142
00:09:32.320 --> 00:09:33.640
Get us spreadsheet and make one up.
143
00:09:33.640 --> 00:10:03.600
The 12 numbers were the median is exactly twice the mean.