If you’ve ever stared into the abyss of a big, uncooperative comma-delimited text file, it won’t take long to appreciate the value and potential of csvkit.
csvkit is a Python-based Swiss Army knife of utilities for dealing with, as its documentation says, “the king of tabular file formats.” It lets you examine, fix, slice, transform and otherwise master text-based data files (and not only the comma-delimited variety, as its name implies, but tab-delimited and fixed-width as well). Christopher Groskopf, lead developer on the Knight News Challenge-winning Panda project and recently a member of the Chicago Tribune’s news apps team, is the primary coder and architect, but the code’s hosted on Github and has a growing list of contributors.
As of version 0.3.0, csvkit comprises 11 utilities. The documentation describes them well, so rather than rehash it, here are highlights of three of the utilities I found interesting during a recent test drive:
csvcut: Henceforth, this utility will likely meet every csv file I get. To start, it will describe the file contents for me: If I want a quick scan of the column names and their order in the file, I just type:
csvcut -n filename
The output is an indexed list of column names, assuming the first row of the file is a header row. If not, you get the first row of data, which can be handy as well.
Still, the “cut” part of this utility is its killer feature, extracting columns from the file in the order you choose. You might use this to subset the file before importing to a database or to quickly reorder columns before embarking on analysis.
To extract the seventh, first and second columns from the file, in that order, it’s as simple as:
csvcut -c 7,1,2 filename
csvsql: Send in a csv, and it returns a CREATE TABLE statement for your SQL database. The first time I ran this and saw the result, I did a double-take of pure joy. Then I got slightly depressed thinking about times I wrote code to import 256-column csv files into SQL Server. No more. You just type a statement like this:
csvsql -i postgresql filename
That produces a CREATE TABLE statement with syntax appropriate to PostgreSQL. Plenty of SQL flavors are available too — the utility uses SQLAlchemy’s dialect collection to offer syntax options for SQL Server, MySQL, Oracle and others.
Another killer feature: You can add an
"inserts" argument to have csvsql generate a SQL INSERT statement for each row of the CSV. Having been flummoxed by SQL Server’s import wizard more than once, I can tell you that inserting data by row is a great alternative, especially if you’re trying to isolate a problem row.
csvstat: Returns basic descriptive statistics for each column in the file. Results include overall row count, the data type for each column, and descriptives including min, max, sum, median, most frequent values, etc. Very handy for a quick read on what you have in the file.
Those three jumped out at me, but there are more. Other utilities will convert files to csv, output a csv as JSON, or merge, clean and stack files. The fact you can pipe output from one utility to another creates a powerful scenario.
It’s great work and an example of the kinds of tools journalists can build to deal with common problems we face. I’ll be watching this develop with great anticipation.