csvkit: A Swiss Army Knife for Comma-Delimited Files

If you’ve ever stared into the abyss of a big, uncooperative comma-delimited text file, it won’t take long to appreciate the value and potential of csvkit.

csvkit is a Python-based Swiss Army knife of utilities for dealing with, as its documentation says, “the king of tabular file formats.” It lets you examine, fix, slice, transform and otherwise master text-based data files (and not only the comma-delimited variety, as its name implies, but tab-delimited and fixed-width as well). Christopher Groskopf, lead developer on the Knight News Challenge-winning Panda project and recently a member of the Chicago Tribune’s news apps team, is the primary coder and architect, but the code’s hosted on Github and has a growing list of contributors.

As of version 0.3.0, csvkit comprises 11 utilities. The documentation describes them well, so rather than rehash it, here are highlights of three of the utilities I found interesting during a recent test drive:

csvcut: Henceforth, this utility will likely meet every csv file I get. To start, it will describe the file contents for me: If I want a quick scan of the column names and their order in the file, I just type:

csvcut -n filename

The output is an indexed list of column names, assuming the first row of the file is a header row. If not, you get the first row of data, which can be handy as well.

Still, the “cut” part of this utility is its killer feature, extracting columns from the file in the order you choose. You might use this to subset the file before importing to a database or to quickly reorder columns before embarking on analysis.

To extract the seventh, first and second columns from the file, in that order, it’s as simple as:

csvcut -c 7,1,2 filename

Really nice.

csvsql: Send in a csv, and it returns a CREATE TABLE statement for your SQL database. The first time I ran this and saw the result, I did a double-take of pure joy. Then I got slightly depressed thinking about times I wrote code to import 256-column csv files into SQL Server. No more. You just type a statement like this:

csvsql -i postgresql filename

That produces a CREATE TABLE statement with syntax appropriate to PostgreSQL. Plenty of SQL flavors are available too — the utility uses SQLAlchemy’s dialect collection to offer syntax options for SQL Server, MySQL, Oracle and others.

Another killer feature: You can add an "inserts" argument to have csvsql generate a SQL INSERT statement for each row of the CSV. Having been flummoxed by SQL Server’s import wizard more than once, I can tell you that inserting data by row is a great alternative, especially if you’re trying to isolate a problem row.

csvstat: Returns basic descriptive statistics for each column in the file. Results include overall row count, the data type for each column, and descriptives including min, max, sum, median, most frequent values, etc. Very handy for a quick read on what you have in the file.

Those three jumped out at me, but there are more. Other utilities will convert files to csv, output a csv as JSON, or merge, clean and stack files. The fact you can pipe output from one utility to another creates a powerful scenario.

It’s great work and an example of the kinds of tools journalists can build to deal with common problems we face. I’ll be watching this develop with great anticipation.

7 Responses to “csvkit: A Swiss Army Knife for Comma-Delimited Files”

  1. [...] you’re already lost, read up on what it can do for you. It’s well worth your [...]

  2. Leslie says:

    Hi Anthony. Did you follow the csvkit tutorial? It uses unix piping and ‘head’ commands which don’t run on Windows. (I use Windows Vista 32-bit.) Instead of piping, I can redirect to a file. But what about reading the head of a file? Thanks.

  3. Anthony says:

    Leslie,

    If you want to emulate Linux command-line functionality on Windows, your first stop should be Cygwin:

    http://cygwin.com/index.html

  4. Iain Elder says:

    “The first time I ran this and saw the result, I did a double-take of pure joy. Then I got slightly depressed thinking about times I wrote code to import 256-column csv files into SQL Server.”

    So true.

    For me, csvsql is the killer feature.

    Everything else I can do one way or another, but I have to get the bloody data into the tables first!

    SQL Server, sqlite, PostgreSQL, MySQL all have their quirks and limitations for importing tabular data. I’m glad to discover something that will hide all that from me!

  5. Iain Elder says:

    What kind of “problem row” does csvsql help you detect?

    As of version 0.6.2, the –insert option doesn’t generate visible insert statements.

    I fed it a malformed CSV (missing delimiter) and it managed to load something into sqlite, but the result was a mess.

    $ echo '"a","b","c"
    > "1","2","3"
    > "4","5,"6"
    > "7","8","9"' > malformed.csv
    $ csvsql malformed.csv --insert --db 'sqlite:///malformed.db'
    $ sqlite3 malformed.db 
    sqlite> SELECT * FROM malformed;
    1|2|3
    4|5,6"|9

    I was expecting at least a warning or something.

  6. Anthony says:

    Iain,

    Did you try running that malformed.csv through CSVkit’s csvclean module?
    http://csvkit.readthedocs.org/en/latest/scripts/csvclean.html

    Nevertheless, your scenario is probably worth a bug report to the code’s author.

Leave a Reply