NICAR ’14: Getting Started With Python

For a hands-on intro to Python at IRE’s 2014 NICAR conference, I put together a Github repo with code snippets just for beginners.

Find it here: https://github.com/anthonydb/python-get-started

For more Python snippets I’ve found useful, see:
https://github.com/anthonydb/python-snippets

Finally, if you’d like an even deeper dive, check out journalist-coder Tom Meagher’s repository for the Python mini bootcamp held at this year’s conference.

Thanks to everyone who showed up!

Setting up Python in Windows 8.1

One of my family members recently acquired a Windows 8.1 laptop, and I was curious as to whether Python setup was as easy as when I wrote about installing it on Windows 7. Turns out, it is — and not much different. Which could spawn a whole conversation about Windows OS development, but that’s for another day …

Here’s your quick guide, modified from my earlier Win 7 post:

Set up Python on Windows 8.1

1. Visit the official Python download page and grab the Windows installer. Choose the 32-bit version. A 64-bit version is available, but there are compatibility issues with some modules you may want to install later.

Note: Python currently exists in two versions, the older 2.x series and newer 3.x series (for a discussion of the differences, see this). This tutorial focuses on the 2.x series.

2. Run the installer and accept all the default settings, including the “C:\Python27″ directory it creates.

Continue…

FTP with Python: Quick file stashing

Recently at work, our team[1] built an app to track Atlantic and Pacific tropical storms during hurricane season. To keep the map current — and avoid manual processing — we run a Python script at regular intervals to check shapefiles at the National Hurricane Center and convert them to GeoJSON. After conversion, we push the GeoJSON to our cached storage environment. Using flat files keeps things fast and tidy.

To publish the files, we ship via FTP. A colleague told me about Python’s built-in FTP support, and I was really happy with how easy it was for me to write a script to handle the task.

Here’s a sample script that logs into an FTP server and uploads all files in a local directory. Grab it, along with other useful bits, at my python-snippets repository on GitHub.

# Store all files in a local directory to a server
# using FTP
 
from ftplib import FTP
import os
 
# ftp settings
settings = {
    'ftp': {
        'url': 'ftp.some-server.com',
        'username': 'your-account-name',
        'password': 'your-password',
        'remote-directory': '/path/to/files'
    }
}
 
# local paths
paths = {
   'local-directory': 'my-files/'
}
 
# list of local files
files = os.listdir(paths['local-directory'])
 
# connect and store
for f in files:
    ftp = FTP(settings['ftp']['url'])
    ftp.login(settings['ftp']['username'], settings['ftp']['password'])
    ftp.cwd(settings['ftp']['remote-directory'])
    ftp.storbinary('STOR ' + f, open(paths['local-directory'] + f, 'rb'))
    ftp.close()

[1] Python hackery by Andrea Fuller, mapping by Juan Thomassie, design/production direction by Kristin DeRamus and Maureen Linke. I rode devops/producer this time around.

Python: Pull Twitter, Facebook User Data

Updated July 4, 2013: The Twitter portion of this post has been revised to reflect Twitter’s retirement of v1 of its API and its move to v1.1. Since writing the original, I also discovered the excellent python-twitter library, which is extremely useful for handling the now-required authentication with the API and also providing a wrapper around the API itself.

Original post (updated):

The APIs offered by the two social media giants, Twitter and Facebook, offer plenty of possibilities for data gathering and analysis. From tweets and status messages to numbers of followers and friends, photos, locations and more, there’s a lot of information waiting.

Given my nascent interest in Python, I decided to explore the APIs via some simple scripts, fetching Twitter profile and Facebook page data and writing the values to a SQLite database.

These examples are simple but offer a framework for you (and me) to build upon. SQLite support is built into Python, but for the Facebook script you must install the Requests library if you don’t have it.

Facebook Page Data

This script (available on Github) pulls the number of “likes” and “talking about this” for each Facebook page specified in the list called names_list. It creates a SQLite database called social_data.db if none exists and also creates a table to hold the data.

# Fetch Facebook page metrics via Social Graph API into a SQLite DB
# Grabs the number of likes and "talking about" numbers
 
import requests
import sqlite3
import os
from datetime import datetime
 
# These are the accounts for which you will fetch data
names_list = [
    'fallingskies',
    'usatoday'
]
 
# API base URL
base_url = 'https://graph.facebook.com/'
 
# Function to add row to accounts table
def insert_db(handle, likes, talking):
    conn = sqlite3.connect('social_data.db')
    cur = conn.cursor()
    cur.execute('''
        INSERT INTO fbaccounts VALUES (?,?,?,?);
        ''', (datetime.now(), handle, likes, talking))
    conn.commit()
    conn.close()
 
# Create the database if it doesn't exist
if not os.path.exists('social_data.db'):
    conn = sqlite3.connect('social_data.db')
    conn.close()
else:
    pass
 
# Create the table if it's not in the db
conn = sqlite3.connect('social_data.db')
cur = conn.cursor()
cur.execute('''CREATE TABLE IF NOT EXISTS fbaccounts 
    (FetchDate Date, Handle Text, Likes Integer, Talking Integer)
    ''')
conn.commit()
conn.close()
 
# Iterate over handles and hit the API with each
for user in names_list:
    url = base_url + user 
    print 'Fetching ' + user
    response = requests.get(url)
    profile = response.json()
    handle = profile['name']
    likes = profile['likes']
    talking = profile['talking_about_count']
    insert_db(handle, likes, talking)

Twitter profile data

This script (also on Github) uses the python-twitter library to fetch some basic profile data – screen name, followers and description — into a SQLite DB. To get the keys and access tokens required now by version 1.1 of the Twitter API, you’ll need to register an application under your profile. Start at https://dev.twitter.com/
Continue…

Get JSON from Excel using Python, xlrd

Powering interactive news applications off flat files rather than a call to a database server is an option worth considering. Cutting a production database and data access layer out of the mix eliminates a whole slice of complexity and trims development time. Flat files aren’t right for every situation, but for small apps they’re often all you need.

These days, most of the apps I help build at Gannett Digital consume JSON. Simpler apps — such as the table/modal displays we deployed in February for our Oscar Scorecard and Princeton Review Best Value Colleges — run off one or two JSON files. The more complex — e.g., last year’s Ghost Factories: Poison in the Ground — run off hundreds of them. Updating content is as easy as generating new JSON and uploading it to our file server.

I wrote last year how to use Python to generate JSON files from a SQL database. Lately, I’ve been using Python to make JSON out of Excel spreadsheets.

The key ingredient is the Python library xlrd. It’s part of a suite of Excel-related tools available from www.python-excel.org, including the xlwt library to create Excel files.

(Another choice is openpyxl, which has similar features and works with newer .xlsx formatted Excel files. I’ve used both with equal success. Take your pick.)

Basic xlrd operations

Let’s say we have an Excel workbook containing a small table repeated over three worksheets. The table in each case looks like this:

table

Here are some snippets of code — just scratching the surface — to interact with it programmatically:

# Demonstrates basic xlrd functions for working with Excel files
# (Excel 97-2003)
 
import xlrd
 
# Open the workbook
wb = xlrd.open_workbook('excel-xlrd-sample.xls')
 
# Print the sheet names
print wb.sheet_names()
 
# Get the first sheet either by index or by name
sh = wb.sheet_by_index(0)
 
# Iterate through rows, returning each as a list that you can index:
for rownum in range(sh.nrows):
    print sh.row_values(rownum)
 
# If you just want the first column:
first_column = sh.col_values(0)
print first_column
 
# Index individual cells:
cell_c4 = sh.cell(3, 2).value
# Or you can use:
#cell_c4 = sh.cell(rowx=3, colx=2).value
print cell_c4
 
# Let's say you want the same cell from x identical sheets in a workbook:
x = 2
while x >= 0:
    sh = wb.sheet_by_index(x)
    cell_x = sh.cell(2, 3).value
    print cell_x
    x = x - 1

From Excel to JSON

Pretty cool stuff. Now, let’s convert our sample spreadsheet to JSON. I’ll borrow some of the techniques I discussed when outlining how to use Python to build JSON from a SQL database:

import xlrd
from collections import OrderedDict
import simplejson as json
 
# Open the workbook and select the first worksheet
wb = xlrd.open_workbook('excel-xlrd-sample.xls')
sh = wb.sheet_by_index(0)
 
# List to hold dictionaries
cars_list = []
 
# Iterate through each row in worksheet and fetch values into dict
for rownum in range(1, sh.nrows):
    cars = OrderedDict()
    row_values = sh.row_values(rownum)
    cars['car-id'] = row_values[0]
    cars['make'] = row_values[1]
    cars['model'] = row_values[2]
    cars['miles'] = row_values[3]
 
    cars_list.append(cars)
 
# Serialize the list of dicts to JSON
j = json.dumps(cars_list)
 
# Write to file
with open('data.json', 'w') as f:
    f.write(j)

Here’s the breakdown: We open the workbook, select the sheet and iterate through the available rows (which xlrd conveniently counts using its nrows method).

Add each cell to a key/value pair in a dictionary, then add each dictionary to a list. Dump the list to JSON and write to a file.

Of course, a spreadsheet this simple doesn’t need a Python script to make its way to JSON. Just use Mr. Data Converter for something like this. But as soon as your JSON requirements gain complexity, the ability to use Python to nest nodes, build strings and transform data on the fly make this approach very appealing.

Catch the Cat: A Simple Python Game

New cat in the house? Check.

Son interested in programming? Check.

Dad eager to learn more Python? Check.

Add it up and you get a little project the two of us worked up over a couple of nights this summer — a Python-based text game called Catch the Cat. You can download the code from its repository on Github. To use it, from a command prompt run:
 

python catchthecat.py

The game’s simple: The cat’s on the loose in the house, and you have to catch it. Every time you move to a new room, the cat moves too — at random. To catch the cat, you both have to arrive in the same place.

OK, so we’re not going to unseat EA Sports here, but we met our goals: we had fun and learned some tricks. While I was using the exercise to help my son work through the logic choices required in game play, I got to add few things to my Python toolkit:

– Using Python’s SQLite integration to build a leaderboard database.

– Writing classes for the cat and player objects.

Have you written Python-based games? Suggestions for improvement here?

12 Tangents Later, I Publish a Django Site

Last week, I deployed my first live Django app. Time from start to finish: three years.

Cue the sound of snickers and a thousand eye-rolls. Go ahead. But I confess: From the moment I said, “I want to build something using Django” to the moment I restarted Apache on my WebFaction server and watched the site load for real in my browser, 36 months passed through the hourglass of time.

You see, I got diverted along the way. I’ll tell you why. But first, two things:

1. Learning is wonderful, thrilling, maddening and rewarding. If you’re a journalist and want to see new worlds, let me encourage you to take a journey into code.

2. The site is right here and the code is here. It falls way short in the Awesome Dept., and it will not save journalism. But that’s not why I built it, really.

* * *

The tale began March 2009 in Indianapolis at the Investigative Reporters and Editors Computer-Assisted Reporting conference. That’s the annual data journalism hoedown that draws investigative journalists, app coders and academics for a couple of days of nerdish talk about finding and telling stories with data.
Continue…

Generate JSON From SQL Using Python

Let’s say you want to generate a few hundred — or even a thousand — flat JSON files from a SQL database. Maybe you want to power an interactive graphic but have neither the time nor the desire to spin up a server to dynamically generate the data. Or you think a server adds one more piece of unnecessary complexity and administrative headache. So, you want flat files, each one small for quick loading. And a lot of them.

A few lines of Python is all you need.

I’ve gone this route lately for a few data-driven interactives at USA TODAY, creating JSON files out of large data sets living in SQL Server. Python works well for this, with its JSON encoder/decoder offering a flexible set of tools for converting Python objects to JSON.

Here’s a brief tutorial:

1. If you haven’t already, install Python. Here’s my guide to setup on Windows 7; if you’re on Linux or Mac you should have it already.

2. In your Python script, import a database connector. This example uses pyodbc, which supports connections to SQL Server, MySQL, Microsoft Access and other databases. If you’re using PostgreSQL, try psycopg2.

3. Create a table or tables to query in your SQL database and write and test your query. In this example, I have a table called Students that has a few fields for each student. The query is simple:
 

SELECT ID, FirstName, LastName, Street, City, ST, Zip
FROM Students

4. Here’s an example script that generates two JSON files from that query. One file contains JSON row arrays, and the other JSON key-value objects. Below, we’ll walk through it step-by-step.
Continue…

Scraping CDC flu data with Python

Getting my flu shot this week reminded me about weekly surveillance data the Centers for Disease Control and Prevention provides on flu prevalence across the nation. I’d been planning to do some Python training for my team at work, so it seemed like a natural to write a quick Python scraper that grabs the main table on the site and turns it into a delimited text file.

So I did, and I’m sharing. You can grab the code for the CDC-flu-scraper on Github.

The code uses the Mechanize and BeautifulSoup modules for web browsing and html parsing, respectively. Much of what I demonstrate here I started learning via Ben Welsh’s fine tutorial on web scraping.

We’re still early in flu season, but if you watch this data each week you’ll see the activity pick up quickly.

Update 10/22/2011: Ben Welsh has lent some contributions to this scraper, adding JSON output and turning it into a function. Benefits of social coding 101 …

Setting up Python in Windows 7

An all-wise journalist once told me that “everything is easier in Linux,” and after working with it for a few years I’d have to agree — especially when it comes to software setup for data journalism. But …

Many newsroom types spend the day in Windows without the option of Ubuntu or another Linux OS. I’ve been planning some training around Python soon, so I compiled this quick setup guide as a reference. I hope you find it helpful.

Set up Python on Windows 7

Get started:

1. Visit the official Python download page and grab the Windows installer. Choose the 32-bit version. A 64-bit version is available, but there are compatibility issues with some modules you may want to install later. (Thanks to commenters for pointing this out.)

Note: Python currently exists in two versions, the older 2.x series and newer 3.x series (for a discussion of the differences, see this). This tutorial focuses on the 2.x series.

2. Run the installer and accept all the default settings, including the “C:\Python27″ directory it creates.

Continue…

csvkit: A Swiss Army Knife for Comma-Delimited Files

If you’ve ever stared into the abyss of a big, uncooperative comma-delimited text file, it won’t take long to appreciate the value and potential of csvkit.

csvkit is a Python-based Swiss Army knife of utilities for dealing with, as its documentation says, “the king of tabular file formats.” It lets you examine, fix, slice, transform and otherwise master text-based data files (and not only the comma-delimited variety, as its name implies, but tab-delimited and fixed-width as well). Christopher Groskopf, lead developer on the Knight News Challenge-winning Panda project and recently a member of the Chicago Tribune’s news apps team, is the primary coder and architect, but the code’s hosted on Github and has a growing list of contributors.

As of version 0.3.0, csvkit comprises 11 utilities. The documentation describes them well, so rather than rehash it, here are highlights of three of the utilities I found interesting during a recent test drive:
Continue…