Calculating Medians With SQL
Given that median is such a valuable statistical measure, it’s baffling that Microsoft’s SQL Server and other relational databases (MySQL, PostgreSQL) don’t have a built-in MEDIAN function. Well, this week, after working through a data set in SQL Server — and deciding I didn’t want to push the data into SPSS to find medians — I hit the web to find a T-SQL workaround.
I found a ton of solutions (some from people with no clue about the difference between median and average), but the one below — adapted from a post by Adam Machanic at sqlblog.com — was the best. It produces accurate results and is fairly speedy to boot.
Here’s an example. Consider this table with student grades from two courses:
We’d like to find the median grade in each class. Here’s the script:
DECLARE @tmp TABLE ( ID INT IDENTITY(1,1), Class VARCHAR(50), FirstName VARCHAR(50), Grade DECIMAL(5,1) ) INSERT INTO @tmp (Class, FirstName, Grade) VALUES ('Math', 'Bob', 65) INSERT INTO @tmp (Class, FirstName, Grade) VALUES ('Math', 'Joe', 72) INSERT INTO @tmp (Class, FirstName, Grade) VALUES ('Math', 'Sally', 95) INSERT INTO @tmp (Class, FirstName, Grade) VALUES ('Science', 'Bob', 65) INSERT INTO @tmp (Class, FirstName, Grade) VALUES ('Science', 'Joe', 81) INSERT INTO @tmp (Class, FirstName, Grade) VALUES ('Science', 'Sally', 81) INSERT INTO @tmp (Class, FirstName, Grade) VALUES ('Science', 'Mike', 72)</code> SELECT x.Class, CAST(AVG(x.Grade) AS DECIMAL(6,1)) AS 'Median' FROM ( SELECT Class, FirstName, Grade, ROW_NUMBER() OVER ( PARTITION BY Class ORDER BY Grade ASC, ID ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY Class ORDER BY Grade DESC, ID DESC) AS RowDesc FROM @tmp ) AS x WHERE RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1) GROUP BY x.Class ORDER BY x.Class
That’s a lot. Let’s break down what’s happening.
First, to set up the data, we’re creating a temporary table and inserting seven rows.
Then comes the main SELECT, which includes a subquery that’s the meat of the goodness behind this. The subquery uses the ROW_NUMBER function to create ascending and descending row identifiers based on the ordering of the grades. It’s easier to visualize if you see what the subquery’s creating:
The RowAsc and RowDesc fields reflect the ordered row numbers of the grades per each class (i.e., partition). Given those, it’s easy to find the median. For the math class, with an odd number of students, the median occurs where RowAsc equals RowDesc. For the science class, with an even number of students, the median is the average of the two grades where RowAsc and RowDesc are within one of each other.
(Whenever you’re looking for the median in an ordered list that has an even number of values, averaging the two middle numbers gives the answer.)
That’s what our main query does when it pulls from the subquery result set. It looks for:
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
and then averages what it finds. The answer is 72 for the math class and 76.5 for science.
This solution works with T-SQL and Microsoft SQL Server. With minor tweaks, it also runs fine in PostgreSQL 8.4, which also implements the ROW_NUMBER function. MySQL does not support that function, so you’ll have to search for another option.
Good stuff and pretty handy for extracting large numbers of median values without having to resort to a stats program.