‘Practical SQL’ Book in Early Release

My first book, Practical SQL: A Beginner’s Guide to Storytelling with Data, is out in early release from No Starch Press starting today! If you pre-order from No Starch, you can download the Introduction and first four chapters now. You’ll get additional chapters regularly until the final version comes out in February 2018.

Practical SQL is for people who encounter data in their everyday lives and want to know how to analyze or transform it. The book covers real-world data and scenarios, from analyzing U.S. Census demographics to the duration of taxi rides in New York City. I’ve aimed the exercises at beginning SQL coders, and all the code and data can be downloaded via No Starch’s site.

That database you’ll use is the free, open-source PostgreSQL, along with the pgAdmin 4 graphical user interface. We cover all the basics you’ll find in standard ANSI SQL along with PostgreSQL-specific features such as full text search and GIS.

More to come as additional chapters hit early release!

App launch: 2014 elections forecast

Election Forecast


With more than 1,300 candidates, 507 races, top-line campaign finance data and poll averages for select races, the 2014 midterm elections forecast app we launched in early September is probably the most complex mash-up of data, APIs and home-grown content built yet by our Interactive Applications team at Gannett Digital.

We’re happy with the results — even more because the app is live not only at USA TODAY’s desktop and mobile websites but across Gannett. With the rollout of a company-wide web framework this year, we’re able to publish simultaneously to sites ranging from the Indianapolis Star to my alma mater, The Poughkeepsie Journal.

What’s in the forecast? Every U.S. House and Senate race plus the 36 gubernatorial races up in November with bios, photos, total receipts and current poll averages. For each race, USA TODAY’s politics team weighed in on a forecast for how it will likely swing in November. Check out the Iowa Senate for an example of a race detail page.

Finally, depending on whether you open the app with a desktop, tablet or phone, you’ll get a version specifically designed for that device. Mobile-first was our guiding principle.

Building the backend

This was a complex project with heavy lifts both on design/development and data/backend coding. As usual, I handled the data/server side for our team with assists from Sarah Frostenson.

As source data, I used three APIs plus home-grown content:

— The Project Vote Smart API supplies all the candidate names, party affiliations and professional, educational and political experience. Most of the photos are via Vote Smart, though we supplemented where missing.

— The Sunlight Foundation’s Realtime Influence Explorer API supplies total receipts for House and Senate candidates via the Federal Election Commission.

— From Real Clear Politics, we’re fetching polling averages and projections for the House (USAT’s politics team is providing governor and Senate projections).

The route from APIs to the JSON files that fuel the Backbone.js-powered app goes something like this:

  1. Python scrapers fetch data into Postgres, running on an Amazon EC2 Linux box.
  2. A basic Django app lets the USAT politics team write race summaries, projections and other text. Postgres is the DB here also.
  3. Python scripts query Postgres and spits out the JSON files, combining all the data for various views.
  4. We upload those files to a cached file server, so we’re never dynamically hitting a database.

Meanwhile, at the front

Front-end work was a mix of data-viz and app framework lifting. For the maps and balance-of-power bars, Maureen Linke (now at AP) and Amanda Kirby used D3.js. Getting data viz to work well across mobile and desktop is a chore, and Amanda in particular spent a chunk of time getting the polling and campaign finance bar charts to flex well across platforms.

For the app itself, Jon Dang and Rob Berthold — working from a design by Kristin DeRamus — used Backbone.js for URL routing and views. Rob also wrote a custom search tool to let readers quickly find candidates. Everything then was loaded into a basic template in our company CMS.

This one featured a lot of moving parts, and anyone who’s done elections knows there always are the edge cases that make life interesting. In the end, though, I’m proud of what we pulled off — and really happy to serve readers valuable info to help them decide at the polls in November.

Generate JSON From SQL Using Python

Let’s say you want to generate a few hundred — or even a thousand — flat JSON files from a SQL database. Maybe you want to power an interactive graphic but have neither the time nor the desire to spin up a server to dynamically generate the data. Or you think a server adds one more piece of unnecessary complexity and administrative headache. So, you want flat files, each one small for quick loading. And a lot of them.

A few lines of Python is all you need.

I’ve gone this route lately for a few data-driven interactives at USA TODAY, creating JSON files out of large data sets living in SQL Server. Python works well for this, with its JSON encoder/decoder offering a flexible set of tools for converting Python objects to JSON.

Here’s a brief tutorial:

1. If you haven’t already, install Python. Here’s my guide to setup on Windows 7; if you’re on Linux or Mac you should have it already.

2. In your Python script, import a database connector. This example uses pyodbc, which supports connections to SQL Server, MySQL, Microsoft Access and other databases. If you’re using PostgreSQL, try psycopg2.

3. Create a table or tables to query in your SQL database and write and test your query. In this example, I have a table called Students that has a few fields for each student. The query is simple:

SELECT ID, FirstName, LastName, Street, City, ST, Zip
FROM Students

4. Here’s an example script that generates two JSON files from that query. One file contains JSON row arrays, and the other JSON key-value objects. Below, we’ll walk through it step-by-step.

Calculating Medians With SQL

Given that median is such a valuable statistical measure, it’s baffling that Microsoft’s SQL Server and other relational databases (MySQL, PostgreSQL) don’t have a built-in MEDIAN function. Well, this week, after working through a data set in SQL Server — and deciding I didn’t want to push the data into SPSS to find medians — I hit the web to find a T-SQL workaround.

I found a ton of solutions (some from people with no clue about the difference between median and average), but the one below — adapted from a post by Adam Machanic at sqlblog.com — was the best. It produces accurate results and is fairly speedy to boot.

Here’s an example. Consider this table with student grades from two courses:

ID Class FirstName Grade
1 Math Bob 65
2 Math Joe 72
3 Math Sally 95
4 Science Bob 65
5 Science Joe 81
6 Science Sally 81
7 Science Mike 72

We’d like to find the median grade in each class. Here’s the script:

Essential SQL Queries (To Me, At Least)

For a session of five-minute “lightning talks” at this week’s 2010 Investigative Reporters and Editors conference in Phoenix, I contributed “Five Essential Queries for SQL Server.” Aside from the basic SELECT statement, these are five techniques that, at least for me, either solved a tricky problem or made coding life more efficient. They came to me after some trial and error or from using the coder’s best friend, Google.

I realize that many journalists prefer the open source (free) MySQL to Microsoft’s product, so I’ve replicated the five queries below in MySQL syntax. You can download script files for either syntax here:

Five essential queries (MS SQL Server)
Five essential queries (MySQL)

Feedback and your ideas are welcome. Here they are:

1. Create a temporary table with identity column.
Temp tables are handy for storing and manipulating data when you need a table but don’t want to make it part of your actual schema. In SQL Server, the table variable is held in memory and disappears once the query finishes executing.

   id INT IDENTITY(1,1), 
   FirstName VARCHAR(50)
INSERT INTO @tmp (FirstName) VALUES ('Bob')
INSERT INTO @tmp (FirstName) VALUES ('Joe')
INSERT INTO @tmp (FirstName) VALUES ('Sally')