Calculating Medians With SQL

Given that median is such a valuable statistical measure, it’s baffling that Microsoft’s SQL Server and other relational databases (MySQL, PostgreSQL) don’t have a built-in MEDIAN function. Well, this week, after working through a data set in SQL Server — and deciding I didn’t want to push the data into SPSS to find medians — I hit the web to find a T-SQL workaround.

I found a ton of solutions (some from people with no clue about the difference between median and average), but the one below — adapted from a post by Adam Machanic at sqlblog.com — was the best. It produces accurate results and is fairly speedy to boot.

Here’s an example. Consider this table with student grades from two courses:
.

IDClassFirstNameGrade
1MathBob65
2MathJoe72
3MathSally95
4ScienceBob65
5ScienceJoe81
6ScienceSally81
7ScienceMike72

We’d like to find the median grade in each class. Here’s the script:

 

 
DECLARE @tmp TABLE (
   ID INT IDENTITY(1,1),
   Class VARCHAR(50),
   FirstName VARCHAR(50),
   Grade DECIMAL(5,1)
)
 
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Math', 'Bob', 65)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Math', 'Joe', 72)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Math', 'Sally', 95)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Science', 'Bob', 65)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Science', 'Joe', 81)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Science', 'Sally', 81)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Science', 'Mike', 72)</code>
 
SELECT
x.Class,
CAST(AVG(x.Grade) AS DECIMAL(6,1)) AS 'Median'
FROM
   (
     SELECT
     Class,
     FirstName,
     Grade,
     ROW_NUMBER() OVER (
     PARTITION BY Class
     ORDER BY Grade ASC, ID ASC) AS RowAsc,
     ROW_NUMBER() OVER (
     PARTITION BY Class
     ORDER BY Grade DESC, ID DESC) AS RowDesc
     FROM @tmp
   ) AS x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY x.Class
ORDER BY x.Class

That’s a lot. Let’s break down what’s happening.

First, to set up the data, we’re creating a temporary table and inserting seven rows.

Then comes the main SELECT, which includes a subquery that’s the meat of the goodness behind this. The subquery uses the ROW_NUMBER function to create ascending and descending row identifiers based on the ordering of the grades. It’s easier to visualize if you see what the subquery’s creating:
.

ClassFirstNameGradeRowAscRowDesc
MathSally95.031
MathJoe72.022
MathBob65.013
ScienceSally81.041
ScienceJoe81.032
ScienceMike72.023
ScienceBob65.014

The RowAsc and RowDesc fields reflect the ordered row numbers of the grades per each class (i.e., partition). Given those, it’s easy to find the median. For the math class, with an odd number of students, the median occurs where RowAsc equals RowDesc. For the science class, with an even number of students, the median is the average of the two grades where RowAsc and RowDesc are within one of each other.

(Whenever you’re looking for the median in an ordered list that has an even number of values, averaging the two middle numbers gives the answer.)

That’s what our main query does when it pulls from the subquery result set. It looks for:
 

RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

and then averages what it finds. The answer is 72 for the math class and 76.5 for science.

This solution works with T-SQL and Microsoft SQL Server. With minor tweaks, it also runs fine in PostgreSQL 8.4, which also implements the ROW_NUMBER function. MySQL does not support that function, so you’ll have to search for another option.

Good stuff and pretty handy for extracting large numbers of median values without having to resort to a stats program.

2 comments on “Calculating Medians With SQL” »

  1. Here’s the PostgreSQL code:

    CREATE TEMPORARY TABLE tmp (
       ID SERIAL,
        Class varchar(50),
        FirstName varchar(50),
        Grade decimal(5,1)
    );

    INSERT INTO tmp (Class, FirstName, Grade)
        VALUES (‘Math’, ‘Bob’, 65);
    INSERT INTO tmp (Class, FirstName, Grade)
        VALUES (‘Math’, ‘Joe’, 72);
    INSERT INTO tmp (Class, FirstName, Grade)
        VALUES (‘Math’, ‘Sally’, 95);
    INSERT INTO tmp (Class, FirstName, Grade)
        VALUES (‘Science’, ‘Bob’, 65);
    INSERT INTO tmp (Class, FirstName, Grade)
        VALUES (‘Science’, ‘Joe’, 81);
    INSERT INTO tmp (Class, FirstName, Grade)
        VALUES (‘Science’, ‘Sally’, 81);
    INSERT INTO tmp (Class, FirstName, Grade)
        VALUES (‘Science’, ‘Mike’, 72);

    SELECT
        x.Class,
        CAST(AVG(x.Grade) AS DECIMAL(6,1)) AS Median
    FROM
    (
        SELECT
        Class,
        FirstName,
        Grade,
        ROW_NUMBER() OVER (
           PARTITION BY Class
           ORDER BY Grade ASC, ID ASC) AS RowAsc,
        ROW_NUMBER() OVER (
           PARTITION BY Class
           ORDER BY Grade DESC, ID DESC) AS RowDesc
        FROM tmp
    ) AS x
    WHERE
        RowAsc IN (RowDesc, RowDesc – 1, RowDesc + 1)
    GROUP BY x.Class
    ORDER BY x.Class

  2. Check the PostgreSQL-wiki as well:
    http://wiki.postgresql.org/wiki/Aggregate_Median



Comments? Questions?