Exploring Relationships with the Census

The folks at the Knight/Mozilla OpenNews Source blog recently asked me to write about a Census topic of my choosing, and I chose to focus on a lesser-traveled piece of Census data: relationships.

The post, Understanding Households and Relationships in Census Data, walks through the definitions the Census Bureau uses for householders and relatives, how it asks the questions and tabulates the results, and some of the key tables that report the data. Thanks to the OpenNews team for letting me dust off my Census know-how!

I’m joining DocumentCloud!

Career news! I’ve been named Director of Product Development for DocumentCloud, the open source tool that hundreds of newsrooms worldwide use to catalog, analyze and publish PDF files and other documents. The platform — created via a Knight News Challenge grant — is now part of the non-profit Investigative Reporters and Editors, which in turn is housed in the Missouri School of Journalism at the University of Missouri.

In my role, I’ll work with an expanded DocumentCloud team and advisory board to improve the basic platform and add premium features. Support for this effort comes via a grant from the John S. and James L. Knight Foundation, announced last summer.

I’m super excited about the project. I’ve been a long-time fan of DocumentCloud — at USA TODAY, we used it extensively, for example, in Ghost Factories and other investigative projects — and it’s become an indispensable tool. Beyond that, I have been involved with IRE via conferences, bootcamps and teaching for many years, and it’s an honor to join the staff.

More news on the project to come!



Enter the Rift: Taking journalism to VR

As I write, my voice is hoarse from three days showing Harvest of Change — a Des Moines Register/Gannett Digital series that used the Oculus Rift and 360-degree video — to hundreds of journalists at the Online News Association conference in Chicago.

The demos capped a two-week sprint that included a media day in New York City, publishing five versions of the software and then catching some media buzz, which alternately praised and scoffed at the effort. Such whirlwinds are fleeting, but highlights are milestones. So, while it’s fresh, here’s a recap.

First, a scene from the Midway at ONA:

That’s Rosental Alves, director of the Knight Center for Journalism in the Americas at the University of Texas at Austin, trying out the project. We set up three Oculus workstations, and for three days the chairs were rarely empty. On the last day, as we packed up, we figured between 400 and 500 people had tried it.

Most people came out of curiosity, or with skepticism, but left impressed. Some were compelled by Amy Webb, who said in a Saturday ONA session that our experience was a must-see. Apparently, we even made the unofficial ONA bingo card.

The story behind this story

The project came together over the summer. When I wasn’t coding backend data for an election forecast, I was heading a small team visiting the dusty back roads of Iowa, both in person and in the Oculus headset. Lots has been written about the Oculus Rift, especially since its acquisition for $2 billion by Facebook, but the focus so far has been on gaming. But after journalism innovation professor Dan Pacheco of Syracuse University introduced us to the Rift, Gannett Digital decided to build its first VR explanatory journalism project. Continue…

App launch: 2014 elections forecast

Election Forecast


With more than 1,300 candidates, 507 races, top-line campaign finance data and poll averages for select races, the 2014 midterm elections forecast app we launched in early September is probably the most complex mash-up of data, APIs and home-grown content built yet by our Interactive Applications team at Gannett Digital.

We’re happy with the results — even more because the app is live not only at USA TODAY’s desktop and mobile websites but across Gannett. With the rollout of a company-wide web framework this year, we’re able to publish simultaneously to sites ranging from the Indianapolis Star to my alma mater, The Poughkeepsie Journal.

What’s in the forecast? Every U.S. House and Senate race plus the 36 gubernatorial races up in November with bios, photos, total receipts and current poll averages. For each race, USA TODAY’s politics team weighed in on a forecast for how it will likely swing in November. Check out the Iowa Senate for an example of a race detail page.

Finally, depending on whether you open the app with a desktop, tablet or phone, you’ll get a version specifically designed for that device. Mobile-first was our guiding principle.

Building the backend

This was a complex project with heavy lifts both on design/development and data/backend coding. As usual, I handled the data/server side for our team with assists from Sarah Frostenson.

As source data, I used three APIs plus home-grown content:

— The Project Vote Smart API supplies all the candidate names, party affiliations and professional, educational and political experience. Most of the photos are via Vote Smart, though we supplemented where missing.

— The Sunlight Foundation’s Realtime Influence Explorer API supplies total receipts for House and Senate candidates via the Federal Election Commission.

— From Real Clear Politics, we’re fetching polling averages and projections for the House (USAT’s politics team is providing governor and Senate projections).

The route from APIs to the JSON files that fuel the Backbone.js-powered app goes something like this:

  1. Python scrapers fetch data into Postgres, running on an Amazon EC2 Linux box.
  2. A basic Django app lets the USAT politics team write race summaries, projections and other text. Postgres is the DB here also.
  3. Python scripts query Postgres and spits out the JSON files, combining all the data for various views.
  4. We upload those files to a cached file server, so we’re never dynamically hitting a database.

Meanwhile, at the front

Front-end work was a mix of data-viz and app framework lifting. For the maps and balance-of-power bars, Maureen Linke (now at AP) and Amanda Kirby used D3.js. Getting data viz to work well across mobile and desktop is a chore, and Amanda in particular spent a chunk of time getting the polling and campaign finance bar charts to flex well across platforms.

For the app itself, Jon Dang and Rob Berthold — working from a design by Kristin DeRamus — used Backbone.js for URL routing and views. Rob also wrote a custom search tool to let readers quickly find candidates. Everything then was loaded into a basic template in our company CMS.

This one featured a lot of moving parts, and anyone who’s done elections knows there always are the edge cases that make life interesting. In the end, though, I’m proud of what we pulled off — and really happy to serve readers valuable info to help them decide at the polls in November.

Updates from the Lands of Life & Work

Apologies for the lengthy radio silence. It’s been a busy and complicated couple of months — so busy that I never did write the 2013 year-end wrap I’d planned. Life and work served up some changes from the predictable, and writing fell off the table. A dose of reality.

In the past, each of these nuggets might have been posts of their own, but to get caught up here’s a mix of work and life highlights in the old USA TODAY Newsline format:

Mass killings interactive: After a year-long effort, last December we published an immersive data viz called Behind the Bloodshed: The Untold Story of America’s Mass Killings. Inspired by the events surrounding the Newtown, Conn., school shooting, it lays out the facts about mass killings over the last 8+ years: They happen often and are most often the result of family issues. My team at Gannett Digital collaborated with USA TODAY’s database team, and a post I wrote for Knight-Mozilla OpenNews’ Source blog explains our tech and process. We and our readers were super-happy with the results. We won the journalistic innovation category of the National Headliner Awards and made the short-lists for the Data Journalism Awards and the Online Media Awards.



NICAR 2014: The annual IRE data journalism conference, held in Baltimore this year, was great. About 1,000 attendees made for the largest turnout ever, and a “Getting Started With Python” session I taught was packed (here’s the Github repo). Highlights always include catching up with friends and colleagues, and as usual I focused on sessions with practical takeaways, such as learning more about d3.js and Twitter bots. Chrys Wu, as always, rounds up everything at her site. Next, I’m hoping to catch the IRE conference in San Francisco in June.

Relaunching our platform: For the last four months at work, I’ve taken a detour away from interactives to help our team that’s extending our publishing platform across all our community news and TV station properties. In short, versions of the complete makeover USA TODAY got in 2012 are now appearing on sites ranging from the Indianapolis Star to Denver’s KUSA-TV. It’s more than cosmetic, though, as Gannett Digital’s also moving all the sites to a shared CMS and Django- and Backbone-powered UX. In addition to desktop, there’s all-new mobile web, Android, iPad and iPhone apps. It’s been tiring but rewarding. In the process of personally launching the Wilmington News Journal, Springfield News-Leader, Montgomery Advertiser and several other sites, I’ve gained a better view of the breadth of Gannett’s journalism and found some great opportunities for collaboration.

Other cool work things: While I was relaunching websites, the rest of our interactives team collaborated with USA TODAY’s Brad Heath on his project exploring how felons can escape justice by crossing state lines. I’ve started refactoring the scraper behind our tropical storm tracker to get it ready for the upcoming season. We’ve been bringing Mapbox training to our newsrooms, which has given me the chance to finally dig deeper into TileMill and the Mapbox API. And you might have heard we have some big elections coming up in November. Finally, I recently tried both Google Glass and the Oculus Rift. Check back in five years on whether they’ve changed/saved journalism, but overall the experience reminded me of how I felt when I began using a web browser. Clunky but filled with potential.

Family life: The biggest event of the last while was another detour. At the end of 2013, my wife was hit by a devastating illness that required a lengthy hospital stay and convalescence. In the interests of privacy, I’m not going to post details. But it’s true that these events are life-changers — sitting in the hospital ICU with a first-row seat to a life-and-death drama changes perspectives and priorities quickly. I am sure the lessons we learned from and about people and life will play a big role in how the rest of 2014 plays out for us. (Oh, and before you ask: she’s doing better now.)

Goals for the rest of the year: Between illness and detours, it feels like the year is just getting started. I hope to post more often with Python, data and tech tips. I’ve bought Two Scoops of Django and JavaScript: The Definitive Guide for light summer reading (right), and I continue to plug away on a writing project that I hope to finish soon. And that’s in addition to lots of family and fun stuff we have in sight.

Thanks for hanging in, and please stay in touch!

NICAR ’14: Getting Started With Python

For a hands-on intro to Python at IRE’s 2014 NICAR conference, I put together a Github repo with code snippets just for beginners.

Find it here: https://github.com/anthonydb/python-get-started

For more Python snippets I’ve found useful, see:

Finally, if you’d like an even deeper dive, check out journalist-coder Tom Meagher’s repository for the Python mini bootcamp held at this year’s conference.

Thanks to everyone who showed up!

Setting up Python in Windows 8.1

One of my family members recently acquired a Windows 8.1 laptop, and I was curious as to whether Python setup was as easy as when I wrote about installing it on Windows 7. Turns out, it is — and not much different. Which could spawn a whole conversation about Windows OS development, but that’s for another day …

Here’s your quick guide, modified from my earlier Win 7 post:

Set up Python on Windows 8.1

1. Visit the official Python download page and grab the Windows installer. Choose the 32-bit version. A 64-bit version is available, but there are compatibility issues with some modules you may want to install later.

Note: Python currently exists in two versions, the older 2.x series and newer 3.x series (for a discussion of the differences, see this). This tutorial focuses on the 2.x series.

2. Run the installer and accept all the default settings, including the “C:\Python27″ directory it creates.


FTP with Python: Quick file stashing

Recently at work, our team[1] built an app to track Atlantic and Pacific tropical storms during hurricane season. To keep the map current — and avoid manual processing — we run a Python script at regular intervals to check shapefiles at the National Hurricane Center and convert them to GeoJSON. After conversion, we push the GeoJSON to our cached storage environment. Using flat files keeps things fast and tidy.

To publish the files, we ship via FTP. A colleague told me about Python’s built-in FTP support, and I was really happy with how easy it was for me to write a script to handle the task.

Here’s a sample script that logs into an FTP server and uploads all files in a local directory. Grab it, along with other useful bits, at my python-snippets repository on GitHub.

# Store all files in a local directory to a server
# using FTP
from ftplib import FTP
import os
# ftp settings
settings = {
    'ftp': {
        'url': 'ftp.some-server.com',
        'username': 'your-account-name',
        'password': 'your-password',
        'remote-directory': '/path/to/files'
# local paths
paths = {
   'local-directory': 'my-files/'
# list of local files
files = os.listdir(paths['local-directory'])
# connect and store
for f in files:
    ftp = FTP(settings['ftp']['url'])
    ftp.login(settings['ftp']['username'], settings['ftp']['password'])
    ftp.storbinary('STOR ' + f, open(paths['local-directory'] + f, 'rb'))

[1] Python hackery by Andrea Fuller, mapping by Juan Thomassie, design/production direction by Kristin DeRamus and Maureen Linke. I rode devops/producer this time around.

Diary Entry, May 6, 1971

This is a page from my elementary school diary. In some ways, life hasn’t changed much. Also: Thank you, teachers, for letting me run the Bell & Howell projector. That was huge.



Favorite Albums List, 1963-?

I bought my first record sometime in the late 1960s/early 1970s. It was a 7-inch, 45 RPM disc featuring “I Like Science” b/w “We Need the Rain.” Artist unknown.

Nerd from the start is what I say. I also ran the projector in elementary school.

Since then, some of the music I bought has endured — as in my beloved progressive rock — but a lot hasn’t. I regret telling my boss at WPDH-FM, where I played music in the ’80s, that I would always listen to Judas Priest.

Today, here’s what has lasted — for me. To make this list, the album in general or the song specifically still has to raise the heart rate:

A Map of the World — Pat Metheny
The Fire Theft — The Fire Theft
Seconds Out — Genesis

Python: Pull Twitter, Facebook User Data

Updated July 4, 2013: The Twitter portion of this post has been revised to reflect Twitter’s retirement of v1 of its API and its move to v1.1. Since writing the original, I also discovered the excellent python-twitter library, which is extremely useful for handling the now-required authentication with the API and also providing a wrapper around the API itself.

Original post (updated):

The APIs offered by the two social media giants, Twitter and Facebook, offer plenty of possibilities for data gathering and analysis. From tweets and status messages to numbers of followers and friends, photos, locations and more, there’s a lot of information waiting.

Given my nascent interest in Python, I decided to explore the APIs via some simple scripts, fetching Twitter profile and Facebook page data and writing the values to a SQLite database.

These examples are simple but offer a framework for you (and me) to build upon. SQLite support is built into Python, but for the Facebook script you must install the Requests library if you don’t have it.

Facebook Page Data

This script (available on Github) pulls the number of “likes” and “talking about this” for each Facebook page specified in the list called names_list. It creates a SQLite database called social_data.db if none exists and also creates a table to hold the data.

# Fetch Facebook page metrics via Social Graph API into a SQLite DB
# Grabs the number of likes and "talking about" numbers
import requests
import sqlite3
import os
from datetime import datetime
# These are the accounts for which you will fetch data
names_list = [
# API base URL
base_url = 'https://graph.facebook.com/'
# Function to add row to accounts table
def insert_db(handle, likes, talking):
    conn = sqlite3.connect('social_data.db')
    cur = conn.cursor()
        INSERT INTO fbaccounts VALUES (?,?,?,?);
        ''', (datetime.now(), handle, likes, talking))
# Create the database if it doesn't exist
if not os.path.exists('social_data.db'):
    conn = sqlite3.connect('social_data.db')
# Create the table if it's not in the db
conn = sqlite3.connect('social_data.db')
cur = conn.cursor()
cur.execute('''CREATE TABLE IF NOT EXISTS fbaccounts 
    (FetchDate Date, Handle Text, Likes Integer, Talking Integer)
# Iterate over handles and hit the API with each
for user in names_list:
    url = base_url + user 
    print 'Fetching ' + user
    response = requests.get(url)
    profile = response.json()
    handle = profile['name']
    likes = profile['likes']
    talking = profile['talking_about_count']
    insert_db(handle, likes, talking)

Twitter profile data

This script (also on Github) uses the python-twitter library to fetch some basic profile data — screen name, followers and description — into a SQLite DB. To get the keys and access tokens required now by version 1.1 of the Twitter API, you’ll need to register an application under your profile. Start at https://dev.twitter.com/

Packing The Stands at Nationals Park

Bryce Harper’s two home runs and Stephen Strasburg’s seven scoreless innings rightly earned the headlines in Monday’s opening-day win for the Washington Nationals. But the data journalist in me couldn’t help but want to apply a little percent change action to the proceedings.

So, I did, especially after I noticed in the boxscore that Nationals Park, on a Monday, was jammed to 108% capacity. A few minutes of research and Excel later, I had some findings:

  • The Nats’ opening day attendance of 45,274 was 11% higher than the team drew at last year’s home opener, when they beat the Reds 3-2 in 10 innings.
  • The attendance wasn’t a record for Nationals Park, but it was close — about 700 below the record set on the last, heartbreaking (if you’re a Nats fan) game of the 2012 NL division series against the St. Louis Cardinals.
  • The day was, however, a regular-season record for the park, which opened in 2008.

After I figured this out, I went a-Googling to see if anyone else had the same scoop. Didn’t find the percent change, but I did see a mention of the attendance record in a post on We Love DC and a mention on a MASN Sports blog that didn’t qualify it against the post-season record. Nothing from major sports media (please comment below if I missed some).

I’m not about to play a baseball writer — it’s one of the few jobs I haven’t had in journalism — but the basics of ballpark attendance gets too little attention, I think. And yet the money flowing through the turnstiles means a lot for a team and a city, as does the mental boost for the players who hear the cheers.

Can the Nats keep it up? I’ll be watching this chart at baseball-reference.com.


Get JSON from Excel using Python, xlrd

Powering interactive news applications off flat files rather than a call to a database server is an option worth considering. Cutting a production database and data access layer out of the mix eliminates a whole slice of complexity and trims development time. Flat files aren’t right for every situation, but for small apps they’re often all you need.

These days, most of the apps I help build at Gannett Digital consume JSON. Simpler apps — such as the table/modal displays we deployed in February for our Oscar Scorecard and Princeton Review Best Value Colleges — run off one or two JSON files. The more complex — e.g., last year’s Ghost Factories: Poison in the Ground — run off hundreds of them. Updating content is as easy as generating new JSON and uploading it to our file server.

I wrote last year how to use Python to generate JSON files from a SQL database. Lately, I’ve been using Python to make JSON out of Excel spreadsheets.

The key ingredient is the Python library xlrd. It’s part of a suite of Excel-related tools available from www.python-excel.org, including the xlwt library to create Excel files.

(Another choice is openpyxl, which has similar features and works with newer .xlsx formatted Excel files. I’ve used both with equal success. Take your pick.)

Basic xlrd operations

Let’s say we have an Excel workbook containing a small table repeated over three worksheets. The table in each case looks like this:


Here are some snippets of code — just scratching the surface — to interact with it programmatically:

# Demonstrates basic xlrd functions for working with Excel files
# (Excel 97-2003)
import xlrd
# Open the workbook
wb = xlrd.open_workbook('excel-xlrd-sample.xls')
# Print the sheet names
print wb.sheet_names()
# Get the first sheet either by index or by name
sh = wb.sheet_by_index(0)
# Iterate through rows, returning each as a list that you can index:
for rownum in range(sh.nrows):
    print sh.row_values(rownum)
# If you just want the first column:
first_column = sh.col_values(0)
print first_column
# Index individual cells:
cell_c4 = sh.cell(3, 2).value
# Or you can use:
#cell_c4 = sh.cell(rowx=3, colx=2).value
print cell_c4
# Let's say you want the same cell from x identical sheets in a workbook:
x = 2
while x >= 0:
    sh = wb.sheet_by_index(x)
    cell_x = sh.cell(2, 3).value
    print cell_x
    x = x - 1

From Excel to JSON

Pretty cool stuff. Now, let’s convert our sample spreadsheet to JSON. I’ll borrow some of the techniques I discussed when outlining how to use Python to build JSON from a SQL database:

import xlrd
from collections import OrderedDict
import simplejson as json
# Open the workbook and select the first worksheet
wb = xlrd.open_workbook('excel-xlrd-sample.xls')
sh = wb.sheet_by_index(0)
# List to hold dictionaries
cars_list = []
# Iterate through each row in worksheet and fetch values into dict
for rownum in range(1, sh.nrows):
    cars = OrderedDict()
    row_values = sh.row_values(rownum)
    cars['car-id'] = row_values[0]
    cars['make'] = row_values[1]
    cars['model'] = row_values[2]
    cars['miles'] = row_values[3]
# Serialize the list of dicts to JSON
j = json.dumps(cars_list)
# Write to file
with open('data.json', 'w') as f:

Here’s the breakdown: We open the workbook, select the sheet and iterate through the available rows (which xlrd conveniently counts using its nrows method).

Add each cell to a key/value pair in a dictionary, then add each dictionary to a list. Dump the list to JSON and write to a file.

Of course, a spreadsheet this simple doesn’t need a Python script to make its way to JSON. Just use Mr. Data Converter for something like this. But as soon as your JSON requirements gain complexity, the ability to use Python to nest nodes, build strings and transform data on the fly make this approach very appealing.

The 2012 Best-Selling Books

Watching book sales has been an endlessly fascinating piece of my journalistic life, and the recent analysis I published of 2012’s top-selling titles from USA TODAY’s Best-Selling Books list is no exception.

The top 10 (posted via a nifty datatables.js embedded table), reflects the year’s most prominent trend — two series from two authors accounted for seven of the top 10 slots and about 25% of the sales the list tracked.

The table also includes annual top 100 lists back to 2007.

One tidbit that didn’t get a lot of notice: constant favorite “To Kill a Mockingbird” passed 800 weeks on the USA TODAY list. From my view, the most welcome books news of the year.


Tracking Hurricane Sandy

Update 7:30 p.m. Tuesday, Oct. 30: At my place. Sandy brought a sleepless night in the basement, with winds howling like a freight train and a battery-powered backup sump pump the only thing standing between us and a flooded house. We are thankful that the battery held out all night, even as the water poured in.

All day Monday and Tuesday, my colleagues at USA TODAY did an incredible job with Sandy coverage: Check it out.

Excel: Extract text with FIND and MID

Data analysis begins with usable data, and that means every piece organized nicely into its own field where we can count, sort and otherwise test it out.

What if you get a spreadsheet where the pieces of data are all packed in one field? Say, something like this (which I cobbled together from Major League Baseball data in honor of the Nationals’ first playoff appearance):

NAME: Sean Burnett POS: RP AGE: 30 WT: 200 BORN: Dunedin, FL SALARY: 2350000
NAME: Tyler Clippard POS: RP AGE: 27 WT: 200 BORN: Lexington, KY SALARY: 1650000
NAME: Ross Detwiler POS: SP AGE: 26 WT: 174 BORN: St. Louis, MO SALARY: 485000
NAME: Christian Garcia POS: RP AGE: 27 WT: 215 BORN: Miami, FL SALARY: N/A
NAME: Gio Gonzalez POS: SP AGE: 27 WT: 205 BORN: Hialeah, FL SALARY: 3335000
NAME: Mike Gonzalez POS: RP AGE: 34 WT: 215 BORN: Robstown, TX SALARY: N/A
NAME: Ryan Mattheus POS: RP AGE: 28 WT: 215 BORN: Sacramento, CA SALARY: 481000
NAME: Craig Stammen POS: RP AGE: 28 WT: 200 BORN: Coldwater, OH SALARY: 485000
NAME: Drew Storen POS: RP AGE: 25 WT: 180 BORN: Indianapolis, IN SALARY: 498750
NAME: Jordan Zimmermann POS: SP AGE: 26 WT: 218 BORN: Auburndale, WI SALARY: 2300000

Let’s say you want to extract the city of each player’s birth into a separate field. The varying length of each player’s name means the birth place isn’t always in the same position in the string, so a typical text-to-columns operation won’t work. So, how to do it?

The answer lies in two very handy Excel functions: FIND and MID.

FIND locates characters you specify and returns its numeric place in the string.

MID returns X characters from a string beginning at a location X you specify.

For example, we can locate the position where each city name begins by using FIND to locate the string “BORN:” in each cell. The city name itself always starts six characters after the position of that string, so we add six to the result:


In the first row above, the functions returns 50. In the second row, 52. We’ll feed that value to the MID function as the starting point for our extraction.

MID takes three arguments: Text or cell, position to start, number of characters to return. So, we use the above FIND function as the second argument and, for now, extract 10 characters:


That gets us part of the way there. We’re starting in the right spot, but 10 characters isn’t always the length of the city and state, so it leads to choppy results:

Dunedin, F
St. Louis,
Miami, FL 
Hialeah, F

What we need to do is tell MID the exact number of characters to return each time even though the length of the city varies. We can figure this out using FIND again.

Catch the Cat: A Simple Python Game

New cat in the house? Check.

Son interested in programming? Check.

Dad eager to learn more Python? Check.

Add it up and you get a little project the two of us worked up over a couple of nights this summer — a Python-based text game called Catch the Cat. You can download the code from its repository on Github. To use it, from a command prompt run:

python catchthecat.py

The game’s simple: The cat’s on the loose in the house, and you have to catch it. Every time you move to a new room, the cat moves too — at random. To catch the cat, you both have to arrive in the same place.

OK, so we’re not going to unseat EA Sports here, but we met our goals: we had fun and learned some tricks. While I was using the exercise to help my son work through the logic choices required in game play, I got to add few things to my Python toolkit:

— Using Python’s SQLite integration to build a leaderboard database.

— Writing classes for the cat and player objects.

Have you written Python-based games? Suggestions for improvement here?

Ghost Factories: Behind the Project

This is a cross-post of a recent item I wrote for Investigative Reporters and Editors’ On the Road blog. “Ghost Factories” was perhaps the most fun, interesting and well-executed project I’ve done at USA TODAY, largely because the people and process worked so well. This covers all the moving parts:

*  *  *

In April, after USA TODAY published its Ghost Factories investigation into forgotten lead smelters, we heard from several people who wanted to know more about how the project came together — particularly the online package that included details on more than 230 of the former factories.

The following is an expanded version of a post originally sent to IRE’s NICAR-L mailing list:

Alison Young was the lead reporter who conceived the idea for the project. In late 2010, she came to me with a couple of PDFs showing a list of suspected lead smelter sites, which I parsed into a spreadsheet and plotted on a Google map for her to research. Then she started digging, as one of our editors said, “Armed only with faded photographs, tattered phone directories, obscure zoning records, archival maps, fuzzy memories of residents and shockingly incomplete EPA studies.”

Ghost Factories

In December 2010, she began filing the first of more than 140 FOIA requests. The requests produced thousands of pages of government documents related to the sites, and to catalog them she created a project inside DocumentCloud. The product was extremely helpful both for organizing documents and for presentation. Brad Heath of our investigative team would later use the DocumentCloud API to integrate metadata from the documents — particularly their titles —  into our database so we could present them online. He also used the API to batch-publish all 372 documents that were included in the project. (He did most of the work using python-documentcloud, a Python wrapper by the Los Angeles Times’ Ben Welsh that makes it easy to interact with the API programmatically.)