NoVa-Py Talk: Building a Python Package

One of the most popular uses of the API for DocumentCloud, the document research/publishing platform where I work, is to bulk-upload hundreds or thousands of documents. People usually hack their own code together to do this, sometimes using the Python or Ruby wrappers for the API.

After talking with users and hearing their thoughts about the workflow — a desire to have a record of each file’s URL once uploaded, for example — I saw an opportunity to add some luxury to the process. A couple of months, a lot of research, and a few bruises later, I had my first Python package: pneumatic.

pneumatic does a few things to make life easier. It grabs information about each uploaded file and saves it in a SQLite database, which you can dump to csv. It uses Python’s multiprocessing module to try to add some speed (recognizing that this is a network-bound task). And it scans all subfolders for files, which is handy when you obtain a collection of files organized that way.

Learning about Python packaging was as much a part of the project as creating the library itself. The folks at the Northern Virginia Python Users Group were kind enough to invite me to share what I learned recently. Click through the title card to view the slides.

BaPP

 

I’m joining DocumentCloud!

Career news! I’ve been named Director of Product Development for DocumentCloud, the open source tool that hundreds of newsrooms worldwide use to catalog, analyze and publish PDF files and other documents. The platform — created via a Knight News Challenge grant — is now part of the non-profit Investigative Reporters and Editors, which in turn is housed in the Missouri School of Journalism at the University of Missouri.

In my role, I’ll work with an expanded DocumentCloud team and advisory board to improve the basic platform and add premium features. Support for this effort comes via a grant from the John S. and James L. Knight Foundation, announced last summer.

I’m super excited about the project. I’ve been a long-time fan of DocumentCloud — at USA TODAY, we used it extensively, for example, in Ghost Factories and other investigative projects — and it’s become an indispensable tool. Beyond that, I have been involved with IRE via conferences, bootcamps and teaching for many years, and it’s an honor to join the staff.

More news on the project to come!

 

 

Enter the Rift: Taking journalism to VR

As I write, my voice is hoarse from three days showing Harvest of Change — a Des Moines Register/Gannett Digital series that used the Oculus Rift and 360-degree video — to hundreds of journalists at the Online News Association conference in Chicago.

The demos capped a two-week sprint that included a media day in New York City, publishing five versions of the software and then catching some media buzz, which alternately praised and scoffed at the effort. Such whirlwinds are fleeting, but highlights are milestones. So, while it’s fresh, here’s a recap.

First, a scene from the Midway at ONA:

That’s Rosental Alves, director of the Knight Center for Journalism in the Americas at the University of Texas at Austin, trying out the project. We set up three Oculus workstations, and for three days the chairs were rarely empty. On the last day, as we packed up, we figured between 400 and 500 people had tried it.

Most people came out of curiosity, or with skepticism, but left impressed. Some were compelled by Amy Webb, who said in a Saturday ONA session that our experience was a must-see. Apparently, we even made the unofficial ONA bingo card.

The story behind this story

The project came together over the summer. When I wasn’t coding backend data for an election forecast, I was heading a small team visiting the dusty back roads of Iowa, both in person and in the Oculus headset. Lots has been written about the Oculus Rift, especially since its acquisition for $2 billion by Facebook, but the focus so far has been on gaming. But after journalism innovation professor Dan Pacheco of Syracuse University introduced us to the Rift, Gannett Digital decided to build its first VR explanatory journalism project. Continue…

csvkit: A Swiss Army Knife for Comma-Delimited Files

If you’ve ever stared into the abyss of a big, uncooperative comma-delimited text file, it won’t take long to appreciate the value and potential of csvkit.

csvkit is a Python-based Swiss Army knife of utilities for dealing with, as its documentation says, “the king of tabular file formats.” It lets you examine, fix, slice, transform and otherwise master text-based data files (and not only the comma-delimited variety, as its name implies, but tab-delimited and fixed-width as well). Christopher Groskopf, lead developer on the Knight News Challenge-winning Panda project and recently a member of the Chicago Tribune’s news apps team, is the primary coder and architect, but the code’s hosted on Github and has a growing list of contributors.

As of version 0.3.0, csvkit comprises 11 utilities. The documentation describes them well, so rather than rehash it, here are highlights of three of the utilities I found interesting during a recent test drive:
Continue…

Free Software and APIs: NICAR 2011 slides

I had the privilege this week of speaking on two panels at the 2011 Investigative Reporters and Editors Computer-Assisted Reporting* conference in Raleigh, N.C. Here are the slides my co-presenters and I put together:

— “Free Software: From Spreadsheets to GIS” with Jacob Fenton of the Investigative Reporting Workshop. Here is part 1, and here’s part 2.

“APIs: Making the Web a Data Medium” with Derek Willis of The New York Times.

* Those of us with a few miles on the tires remember that the conference used to go by the name NICAR — for National Institute for Computer-Assisted Reporting. People still call it that.