Test Drive: Freebase Gridworks 1.1
Data journalists spend lots of time wrestling dirty data, so when I heard the News Applications team at the Chicago Tribune raving about the data-handling abilities of Freebase Gridworks, my interest was piqued. Anything that can lessen the pain of cleaning data is worth a closer look!
Freebase Gridworks is a Java-based app that runs locally in your web browser. The makers’ pitch describes it best:
… A power tool that allows you to load data, understand it, clean it up, reconcile it internally, augment it with data coming from Freebase, and optionally contribute your data to Freebase for others to use. All in the comfort and privacy of your own computer.
Installation is simple. I chose to load Gridworks on my Windows XP-based work laptop, although you can download Mac and Linux versions from the code page. I was up and running in about five minutes, which included loading a new version of Java. Once running, the opening screen looks like so (click for larger version):
You can open an existing project or create a new one by importing a data file — and Gridworks hints at its utility by providing options to parse delimited or non-delimited files, limit the import to specific rows, etc. For testing, I grabbed the Academic Libraries: 2008 Public Use Data file from the National Center for Education Statistics — a tab-delimited text file of about 4,100 rows.
Import was a cinch. Gridworks guessed correctly at the file format and split the columns perfectly:
First thing I tried was data cleanup. Some of the cities in the “CITY_M” field were in uppercase and some were capitalized normally. Each column header has a menu of manipulation options, so I chose Edit Cells > Common Transforms > To Titlecase:
Gridworks chugged along for a few seconds (a progress bar might be handy), but soon enough it returned all the cities in the correct case. Nice!
Next, the ZIP_M field (and ZIP also) had a mix of five-digit zips and some with the “plus 4″ extension. To separate the plus 4’s into their own field, I chose Edit Column > Split Into Several Columns. It produced this dialog:
I opted to split the column by field length and typed the values “5,4” for the string lengths. To preserve the leading zeros in the zips and extensions, I unchecked the box “guess cell type” to keep the fields as text. Gridworks chugged along again, then produced the result, automatically renaming the fields in the process:
Another handy feature of Gridworks is its ability to edit field values en masse. If you hover your mouse over a cell, an “edit” button appears:
Clicking it brings up a dialog box where you can change the cell’s value — and apply that change to all other cells with the same content. Handy! Here’s how you could change all the state names of “AL” to “Alabama”:
Data cleanup is clearly a strength, but Gridworks also offers plenty of ways to explore data by creating facets, or summaries of data (think using COUNT and GROUP BY in SQL). It produces summary tables that let you quickly find all the unique values in a column — and edit them if you need to create consistency (i.e. company names spelled several ways).
Finally, Gridworks lets you export your revised data back to Excel or tab/comma-delimited text files, among other options. Very, very useful.
Judging by its revision history, Freebase Gridworks is very much an evolving tool but one worth keeping tabs on. This little test drive has probably just scratched the surface of the ways you can use it to standardize your data, but you can get more ideas via the demo videos on the product’s home page.