Calculating Medians With SQL

Given that median is such a valuable statistical measure, it’s baffling that Microsoft’s SQL Server and other relational databases (MySQL, PostgreSQL) don’t have a built-in MEDIAN function. Well, this week, after working through a data set in SQL Server — and deciding I didn’t want to push the data into SPSS to find medians — I hit the web to find a T-SQL workaround.

I found a ton of solutions (some from people with no clue about the difference between median and average), but the one below — adapted from a post by Adam Machanic at sqlblog.com — was the best. It produces accurate results and is fairly speedy to boot.

Here’s an example. Consider this table with student grades from two courses:
.

ID Class FirstName Grade
1 Math Bob 65
2 Math Joe 72
3 Math Sally 95
4 Science Bob 65
5 Science Joe 81
6 Science Sally 81
7 Science Mike 72

We’d like to find the median grade in each class. Here’s the script:

 

 
DECLARE @tmp TABLE (
   ID INT IDENTITY(1,1),
   Class VARCHAR(50),
   FirstName VARCHAR(50),
   Grade DECIMAL(5,1)
)
 
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Math', 'Bob', 65)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Math', 'Joe', 72)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Math', 'Sally', 95)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Science', 'Bob', 65)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Science', 'Joe', 81)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Science', 'Sally', 81)
INSERT INTO @tmp (Class, FirstName, Grade)
   VALUES ('Science', 'Mike', 72)</code>
 
SELECT
x.Class,
CAST(AVG(x.Grade) AS DECIMAL(6,1)) AS 'Median'
FROM
   (
     SELECT
     Class,
     FirstName,
     Grade,
     ROW_NUMBER() OVER (
     PARTITION BY Class
     ORDER BY Grade ASC, ID ASC) AS RowAsc,
     ROW_NUMBER() OVER (
     PARTITION BY Class
     ORDER BY Grade DESC, ID DESC) AS RowDesc
     FROM @tmp
   ) AS x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY x.Class
ORDER BY x.Class

That’s a lot. Let’s break down what’s happening.

First, to set up the data, we’re creating a temporary table and inserting seven rows.

Then comes the main SELECT, which includes a subquery that’s the meat of the goodness behind this. The subquery uses the ROW_NUMBER function to create ascending and descending row identifiers based on the ordering of the grades. It’s easier to visualize if you see what the subquery’s creating:
.

Class FirstName Grade RowAsc RowDesc
Math Sally 95.0 3 1
Math Joe 72.0 2 2
Math Bob 65.0 1 3
Science Sally 81.0 4 1
Science Joe 81.0 3 2
Science Mike 72.0 2 3
Science Bob 65.0 1 4

The RowAsc and RowDesc fields reflect the ordered row numbers of the grades per each class (i.e., partition). Given those, it’s easy to find the median. For the math class, with an odd number of students, the median occurs where RowAsc equals RowDesc. For the science class, with an even number of students, the median is the average of the two grades where RowAsc and RowDesc are within one of each other.

(Whenever you’re looking for the median in an ordered list that has an even number of values, averaging the two middle numbers gives the answer.)

That’s what our main query does when it pulls from the subquery result set. It looks for:
 

RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

and then averages what it finds. The answer is 72 for the math class and 76.5 for science.

This solution works with T-SQL and Microsoft SQL Server. With minor tweaks, it also runs fine in PostgreSQL 8.4, which also implements the ROW_NUMBER function. MySQL does not support that function, so you’ll have to search for another option.

Good stuff and pretty handy for extracting large numbers of median values without having to resort to a stats program.

2 Responses to “Calculating Medians With SQL”

  1. Anthony says:

    Here’s the PostgreSQL code:

    CREATE TEMPORARY TABLE tmp (
       ID SERIAL,
          Class varchar(50),
          FirstName varchar(50),
          Grade decimal(5,1)
    );
    
    INSERT INTO tmp (Class, FirstName, Grade)
          VALUES ('Math', 'Bob', 65);
    INSERT INTO tmp (Class, FirstName, Grade)
          VALUES ('Math', 'Joe', 72);
    INSERT INTO tmp (Class, FirstName, Grade)
          VALUES ('Math', 'Sally', 95);
    INSERT INTO tmp (Class, FirstName, Grade)
          VALUES ('Science', 'Bob', 65);
    INSERT INTO tmp (Class, FirstName, Grade)
          VALUES ('Science', 'Joe', 81);
    INSERT INTO tmp (Class, FirstName, Grade)
          VALUES ('Science', 'Sally', 81);
    INSERT INTO tmp (Class, FirstName, Grade)
          VALUES ('Science', 'Mike', 72);
       
    SELECT
          x.Class,
          CAST(AVG(x.Grade) AS DECIMAL(6,1)) AS Median
    FROM
    (
          SELECT
          Class,
          FirstName,
          Grade,
          ROW_NUMBER() OVER (
                PARTITION BY Class
                ORDER BY Grade ASC, ID ASC) AS RowAsc,
          ROW_NUMBER() OVER (
                PARTITION BY Class
                ORDER BY Grade DESC, ID DESC) AS RowDesc
          FROM tmp
    ) AS x
    WHERE
          RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
    GROUP BY x.Class
    ORDER BY x.Class
  2. Frank says:

    Check the PostgreSQL-wiki as well:
    http://wiki.postgresql.org/wiki/Aggregate_Median

Leave a Reply