Ghost Factories: Behind the Project

This is a cross-post of a recent item I wrote for Investigative Reporters and Editors’ On the Road blog. “Ghost Factories” was perhaps the most fun, interesting and well-executed project I’ve done at USA TODAY, largely because the people and process worked so well. This covers all the moving parts:

*  *  *

In April, after USA TODAY published its Ghost Factories investigation into forgotten lead smelters, we heard from several people who wanted to know more about how the project came together — particularly the online package that included details on more than 230 of the former factories.

The following is an expanded version of a post originally sent to IRE’s NICAR-L mailing list:

Alison Young was the lead reporter who conceived the idea for the project. In late 2010, she came to me with a couple of PDFs showing a list of suspected lead smelter sites, which I parsed into a spreadsheet and plotted on a Google map for her to research. Then she started digging, as one of our editors said, “Armed only with faded photographs, tattered phone directories, obscure zoning records, archival maps, fuzzy memories of residents and shockingly incomplete EPA studies.”

Ghost Factories

In December 2010, she began filing the first of more than 140 FOIA requests. The requests produced thousands of pages of government documents related to the sites, and to catalog them she created a project inside DocumentCloud. The product was extremely helpful both for organizing documents and for presentation. Brad Heath of our investigative team would later use the DocumentCloud API to integrate metadata from the documents — particularly their titles —  into our database so we could present them online. He also used the API to batch-publish all 372 documents that were included in the project. (He did most of the work using python-documentcloud, a Python wrapper by the Los Angeles Times’ Ben Welsh that makes it easy to interact with the API programmatically.)

12 Tangents Later, I Publish a Django Site

Last week, I deployed my first live Django app. Time from start to finish: three years.

Cue the sound of snickers and a thousand eye-rolls. Go ahead. But I confess: From the moment I said, “I want to build something using Django” to the moment I restarted Apache on my WebFaction server and watched the site load for real in my browser, 36 months passed through the hourglass of time.

You see, I got diverted along the way. I’ll tell you why. But first, two things:

1. Learning is wonderful, thrilling, maddening and rewarding. If you’re a journalist and want to see new worlds, let me encourage you to take a journey into code.

2. The site is right here and the code is here. It falls way short in the Awesome Dept., and it will not save journalism. But that’s not why I built it, really.

* * *

The tale began March 2009 in Indianapolis at the Investigative Reporters and Editors Computer-Assisted Reporting conference. That’s the annual data journalism hoedown that draws investigative journalists, app coders and academics for a couple of days of nerdish talk about finding and telling stories with data.

Generate JSON From SQL Using Python

Let’s say you want to generate a few hundred — or even a thousand — flat JSON files from a SQL database. Maybe you want to power an interactive graphic but have neither the time nor the desire to spin up a server to dynamically generate the data. Or you think a server adds one more piece of unnecessary complexity and administrative headache. So, you want flat files, each one small for quick loading. And a lot of them.

A few lines of Python is all you need.

I’ve gone this route lately for a few data-driven interactives at USA TODAY, creating JSON files out of large data sets living in SQL Server. Python works well for this, with its JSON encoder/decoder offering a flexible set of tools for converting Python objects to JSON.

Here’s a brief tutorial:

1. If you haven’t already, install Python. Here’s my guide to setup on Windows 7; if you’re on Linux or Mac you should have it already.

2. In your Python script, import a database connector. This example uses pyodbc, which supports connections to SQL Server, MySQL, Microsoft Access and other databases. If you’re using PostgreSQL, try psycopg2.

3. Create a table or tables to query in your SQL database and write and test your query. In this example, I have a table called Students that has a few fields for each student. The query is simple:

SELECT ID, FirstName, LastName, Street, City, ST, Zip
FROM Students

4. Here’s an example script that generates two JSON files from that query. One file contains JSON row arrays, and the other JSON key-value objects. Below, we’ll walk through it step-by-step.

NICAR 2012: Words and Nerds

Briefly, some recaps from my week at the 2012 National Institute for Computer-Assisted Reporting conference, held in late February in St. Louis:

The basics: 2012 marked my 10th NICAR conference, an annual gathering of journalists who work with data and, increasingly, with code to find and tell stories. It’s sponsored by Investigative Reporters and Editors, a nonprofit devoted to improving investigative journalism. Panels ranged from data transparency to regular expressions.

Catch up: Best way to review what you learned (or find out what you missed) is by reading Chrys Wu’s excellent collection of presentation links and via IRE’s conference blog.

Busy times: Our USA TODAY data journalism team served on a half-dozen panels and demos. With Ron Nixon of The New York Times and Ben Welsh of the Los Angeles Times, I led “Making Sure You Tell a Story,” a reminder to elevate our reporting, graphics and news apps. (Here are the slides from me and Ben.) I also joined Christopher Groskopf for a demo of his super-utility csvkit, which I’ve written about. And, finally, I spoke about USA TODAY’s public APIs and how building them helps newsrooms push content anywhere.

Award!: Our team was excited to pick up the second-place prize in the 2011 Philip Meyer Awards for the Testing the System series by Jack Gillum, Jodi Upton, Marisol Bello and Greg Toppo. Truly an honor.

Surprise Award!: At the Friday evening reception, I received an IRE Service Award for my work contributing 2010 Census data to IRE for sharing data with members on deadline and eventually for use in IRE’s site. Colleague and master of all things Census Paul Overberg also was honored, along with the NYT’s Aron Pilhofer, the Chicago Tribune’s Brian Boyer and others. Out of the blue and humbling.

On the Radar: I ran into O’Reilly Radar’s Alex Howard at the conference — the side conversations are always a bonus of these things — and he later emailed me some questions about data journalism. My responses ended up in two pieces he wrote: “In the age of big data, data journalism has profound importance for society” and “Profile of the data journalist: the storyteller and the teacher.”

The 2011 Best-Selling Books

In 2011, a year when consumers unboxed millions of e-readers, fiction dominated even more of USA TODAY’s Best-Selling Books list. Colleague Carol Memmott and I reported today that 78% of the titles in the weekly book lists last year were fiction, up from 67% in 2007. The finding is one of several covered in our annual look at trends off the book list:

“People are interested in escape,” says Carol Fitzgerald of the Book Report Network, websites for book discussions. “In a number of pages, the story will open, evolve and close, and a lot of what’s going on in the world today is not like that. You’ve got this encapsulated escape that you can enjoy.”

We’ve posted the 100 top-selling titles of 2011 in a handy data table that includes the annual lists back to 2007.

Again Towards The Analog

The feeling came a few weeks ago as I drove along a back road near the Potomac River. I was in the lowlands, about to cross from Virginia to Maryland, driving alone during a day in which I’d purposely disconnected from email, Twitter and most things digital.

I think we see things differently on those days.

My car rounded a bend, and through the trees I could see the river. The scene was perfection: bare trees arrayed on a grassy plain, standing watch next to the Potomac. If I’d shot a photo, it would have brushed up against Ansel Adams in intent if not quality. It took my breath, and I gave thanks.

Soon I was on a bridge crossing the river and then into Maryland. But the scene stayed in mind as I drove toward my destination, the road now winding through rustic small towns that seemed to take me even farther from the office.

I’ve thought back on those minutes often as 2011 disappeared into time past. I’ve thought how I need many more of those minutes.

And In Local News … Editor’s Acquitted

So, you’re the 67-year-old editor of a small-town newspaper who also happens to do the books for a local businessman.

The local businessman’s not just your boss. He’s also the owner/landlord of your newspaper’s office, your residence, your son’s residence and your daughter’s business. You live in one of those in-grown places that dot America, a place where everyone whispers everyone’s business.

One day, you’re arrested. The charge: embezzling $9,000 from this businessman-boss-landlord.

The arrest happens in the middle of the day. Somehow, the local police chief decides to give you a perp walk in handcuffs down a main street of your little town, where everyone knows you and you know everyone. And, somehow, a freelance photographer just happens to be there, takes photos of you perp-walking, and sells them to a rival weekly newspaper, which of course publishes them.

You, the newspaper editor, say it’s all a mistake. Of course she didn’t steal anything … it was an accident!

The town’s in an uproar. Scandal! And on top of it a perp walk right in town for a 67-year-old lady!

‘Goshen’ WordPress Theme on Github

At the start of 2011, I simplified the first WordPress theme I’d built for this site and turned it into something far more minimalistic. I went from two sidebars to one, lost the bulky header and turned from color to black and white. Part of this was a desire for simplicity; part was my reaction to my lack of design sense. Color is not my strong suit, and I shouldn’t be caught trying to pretend.

Since then, I’ve made a few tweaks, but one thing I hadn’t done all year was post the theme — which I call Goshen — for anyone to use. Today I fixed that and pushed the files up to their own repository on Github. You can download the files and hack away. (In your WordPress install, under /wp-content/themes/, create a folder called Goshen and unzip the files there; then you can activate the theme via the dashboard.)

I’ll continue to tweak when I have time. I can’t say enough about how much WordPress theme hacking has taught me about HTML, CSS, templates and web design. If you want to start from scratch, I recommend this excellent tutorial. You’ll discover that WordPress themes have only a few moving parts. Mastering them will let you make your site exactly what you want it to be.


Scraping CDC flu data with Python

Getting my flu shot this week reminded me about weekly surveillance data the Centers for Disease Control and Prevention provides on flu prevalence across the nation. I’d been planning to do some Python training for my team at work, so it seemed like a natural to write a quick Python scraper that grabs the main table on the site and turns it into a delimited text file.

So I did, and I’m sharing. You can grab the code for the CDC-flu-scraper on Github.

The code uses the Mechanize and BeautifulSoup modules for web browsing and html parsing, respectively. Much of what I demonstrate here I started learning via Ben Welsh’s fine tutorial on web scraping.

We’re still early in flu season, but if you watch this data each week you’ll see the activity pick up quickly.

Update 10/22/2011: Ben Welsh has lent some contributions to this scraper, adding JSON output and turning it into a function. Benefits of social coding 101 …

Setting up Python in Windows 7

An all-wise journalist once told me that “everything is easier in Linux,” and after working with it for a few years I’d have to agree — especially when it comes to software setup for data journalism. But …

Many newsroom types spend the day in Windows without the option of Ubuntu or another Linux OS. I’ve been planning some training around Python soon, so I compiled this quick setup guide as a reference. I hope you find it helpful.

Set up Python on Windows 7

Get started:

1. Visit the official Python download page and grab the Windows installer. Choose the 32-bit version. A 64-bit version is available, but there are compatibility issues with some modules you may want to install later. (Thanks to commenters for pointing this out.)

Note: Python currently exists in two versions, the older 2.x series and newer 3.x series (for a discussion of the differences, see this). This tutorial focuses on the 2.x series.

2. Run the installer and accept all the default settings, including the “C:\Python27” directory it creates.


csvkit: A Swiss Army Knife for Comma-Delimited Files

If you’ve ever stared into the abyss of a big, uncooperative comma-delimited text file, it won’t take long to appreciate the value and potential of csvkit.

csvkit is a Python-based Swiss Army knife of utilities for dealing with, as its documentation says, “the king of tabular file formats.” It lets you examine, fix, slice, transform and otherwise master text-based data files (and not only the comma-delimited variety, as its name implies, but tab-delimited and fixed-width as well). Christopher Groskopf, lead developer on the Knight News Challenge-winning Panda project and recently a member of the Chicago Tribune’s news apps team, is the primary coder and architect, but the code’s hosted on Github and has a growing list of contributors.

As of version 0.3.0, csvkit comprises 11 utilities. The documentation describes them well, so rather than rehash it, here are highlights of three of the utilities I found interesting during a recent test drive:

My First Earthquake

I was looking at my watch because the meeting was scheduled for an hour, and the hour was nearly over.

We were in a second-floor conference room in the USA TODAY building in McLean, Va. That side of our glass-enclosed HQ faces the intersection of the Dulles Toll Road and the Capital Beltway, and for the last few years we’ve been front-row-center to the construction of new HOT lanes for the Beltway and the work going on for the new Metro Silver Line.

Loud noises are not uncommon.

At 1:50 p.m. I checked the time. I have a bad habit of frequently and obviously looking at my watch, which implies that I am bored or inpatient. I’m not; I just like to know what time it is. I’ve always been a clock-watcher. I’m always on time. So, I looked, mentally noting that I had a free hour until my next meeting at 3.

A moment later, the floor began to vibrate. There was a sound, rumbling, like the bulldozers and cranes that had been outside for months, but somehow different.

“Is that a crane coming toward the building?”

I stood to push back the shade and look out the window. I never got that far. The room began shaking from side to side, and people in the next room started exclaiming.

Earthquake, I thought. I dove under the conference table and lay on my side while the room pulsed.

Part of me was in disbelief. They always said earthquakes don’t happen here.

And then it was over, and someone said, “Let’s get out of here!” And then we were outside, everyone trying to make a call on a cell phone and no one getting through.

Some Favorite WordPress Plugins

With the 100-degree heat broiling the East Coast this weekend, I decided to stay inside and make some design and performance tweaks to my site. I added Google +1 buttons to posts and the index page, and I also tweaked some of the settings in my plugins.

Speaking of those, here’s what I’ve been using to make life easier:

Akismet: Gets rid of a ton of comment spam for various Russian “services” so I can spend my time doing other things. You’ll need to sign up for an API key, but otherwise it’s simple and effective.

Contact Form 7: After trying a few contact plugins, I settled on Contact Form 7 and have had great results. It powers my Contacts page, which I prefer to use instead of posting an email address. For spam filtering, I implemented the quiz feature, but the plugin also supports CAPTCHA. I rarely get spam.

Google XML Sitemaps: Generates a sitemap.xml file that Google and other search engines use to index the site. Lets me include or exclude content and control how often to update the file.

A Facelift for a Book List

The USA TODAY Best-Selling Books list has a new look and added interactivity, part of a relaunch of books coverage. It’s been a fun project that has been on my front burner for about three months.

I get to work with all kinds of data at USA TODAY, but the book list has been a constant. When I arrived at USAT in 1997, one of the first projects I took on was to build and analyze an archive of the list to mark its fifth anniversary. Since then, as that archive grew to hold nearly 18 years of data, we’ve used it to anchor stories about authors and trends in publishing. We’re awfully proud of the list, and people in the publishing industry tell us it’s one of the most accurate accounts of Americans’ weekly reading habits.

Last year, we opened the archives up to developers via a Best-Selling Books API. This year, giving the list itself a facelift was the next logical step.

We were fortunate to assemble a crack team of designers, developers and product managers who, in a short time, conceptualized, designed, redesigned, and coded an entirely new collection of book-related pages for our site. What’s new:

A Price That Minimizes Risk

Do pricing trends in music and books have any resonance for news and, in particular, investigative journalists?

When recently made a new album by Explosions in the Sky available for $2.99 for 24 hours, it caught my attention.

Until then, I hadn’t bought any of the band’s albums. I’d been mildly interested in EitS since it played an episode of Austin City Limits, but given my limited music-purchase budget, I hadn’t prioritized one of its albums over buying new releases by my favorite artists.

But $2.99 made it too easy. I clicked “buy.”

Later, I thought about the psychology of the buy. Why did $2.99 win me when $4.99 or $5.99 might not have? As I type, the price is back up to $7.99 for a download. Had I stumbled on that title today at that price, I would have passed.

But $2.99 hooked me. Why?


For the last many years, I’ve had an idea for a project. At work, in meetings and casual conversations, if an opening came up for me to tout my vision, I’d take it. Launch the pitch, follow up with an email.

“I’ve said it before, but we really should …”

Sometimes, I wondered whether people were thinking not about my grand idea but rather, “How can I get away from this man?” Mostly, they encouraged me — even though at the end of our talk it would be clear that other priorities held sway, and my pet idea had to go back to the shelf.

And so it did. Until about two weeks ago.

That’s when a spark out of nowhere set fire to the pile of kindling I’d been setting up all that time. Suddenly I found myself giving my pitch and hearing, “Let’s do this.”

And so for the last two weeks I’ve found myself in a room with the very people I’ve been bugging — some of the smartest, most creative people in my company — each one focused on turning this idea into something you’ll be able to see.

And the best part is that the end product is going to be way better than I ever imagined. Because now it won’t be my idea, but OUR idea.

A pile of kindling. A random spark.

Never give up.

Lessons From a Census Factory

After two months of processing Census data and writing about it here, I’m ready for a nice break. But before I go off to explore other topics, I thought I’d wrap this episode of Census 2010 with a look at how my teammates and I processed the data. My deepest thanks to my colleagues for doing such a great job. And many thanks to the journalists across the U.S. who offered encouragement as we shared our work with the journalism community.

*   *   *   *

On a Thursday afternoon in the first week of February, three of us from our newsroom’s database team gathered at my computer and tried our best to subdue the butterflies swarming in our stomachs. What we were about to do, we hoped, would not only help us cover the year’s biggest demographic story but also help journalists across the country do the same.

That’s because weeks earlier, somewhere in the midst of poring through Census technical manuals and writing a few thousand lines of SAS code, we’d had a bright idea:

Let’s share this.


Census 2010 State Stories: Week 8

The eighth and final (phew!) week of Census 2010 P.L. 94 redistricting data releases brought data nerds back to east coast states — including one of the largest, New York. Here’s my final roundup of interesting stories and data applications made by journalists for this round of the Census:

District of Columbia: With 39,000 fewer black people since 2000, the nation’s capital is on the verge of seeing blacks lose majority status there, The Washington Post wrote. Its story explained:

The demographic change is the result of almost 15 years of gentrification that has transformed large swaths of Washington, especially downtown. As housing prices soared, white professionals priced out of neighborhoods such as Dupont Circle began migrating to predominantly black areas such as Petworth and Brookland.

The Post offered a ward-by-ward graphic explaining the city’s population changes, and its interactive map was updated to include D.C. along with Maryland and Virginia.

Maine: The state, which is 94% white, lost population in its north and eastern counties, The Bangor Daily News reported. On that page, note the BDN’s use of a Census Bureau-provided interactive map — one of many cases where news orgs picked up a government-issued graphic.

Census 2010 State Stories: Week 7

This week’s release of nine states’ worth of Census data took us from corner to corner of the U.S. — from Alaska to Florida — with a bunch of upper Midwest states thrown in. Only eight states plus Washington, D.C., are left.

My USA TODAY colleague Paul Overberg and I continued pulling each state’s data for our interactive map and state profile pages, and our shop continued to write at least one story about each state. This week, reporter Dennis Cauchon’s story on North Dakota’s population boom was picked up by the Drudge Report and became our site’s top story for a day and a half. Who’d have thought?

Here’s a rundown of interesting stories and interactives:

Smart story: Rob Chaney of Montana’s The Missoulian wrote about Huson, one of 85 new “places” designated by the Census Bureau in the 2010 count. Shows what you can do if you can think non-numbers about a numbers story. Don’t miss the final quote.