Packing The Stands at Nationals Park

Bryce Harper’s two home runs and Stephen Strasburg’s seven scoreless innings rightly earned the headlines in Monday’s opening-day win for the Washington Nationals. But the data journalist in me couldn’t help but want to apply a little percent change action to the proceedings.

So, I did, especially after I noticed in the boxscore that Nationals Park, on a Monday, was jammed to 108% capacity. A few minutes of research and Excel later, I had some findings:

  • The Nats’ opening day attendance of 45,274 was 11% higher than the team drew at last year’s home opener, when they beat the Reds 3-2 in 10 innings.
  • The attendance wasn’t a record for Nationals Park, but it was close — about 700 below the record set on the last, heartbreaking (if you’re a Nats fan) game of the 2012 NL division series against the St. Louis Cardinals.
  • The day was, however, a regular-season record for the park, which opened in 2008.

After I figured this out, I went a-Googling to see if anyone else had the same scoop. Didn’t find the percent change, but I did see a mention of the attendance record in a post on We Love DC and a mention on a MASN Sports blog that didn’t qualify it against the post-season record. Nothing from major sports media (please comment below if I missed some).

I’m not about to play a baseball writer — it’s one of the few jobs I haven’t had in journalism — but the basics of ballpark attendance gets too little attention, I think. And yet the money flowing through the turnstiles means a lot for a team and a city, as does the mental boost for the players who hear the cheers.

Can the Nats keep it up? I’ll be watching this chart at baseball-reference.com.

 

Excel: Extract text with FIND and MID

Data analysis begins with usable data, and that means every piece organized nicely into its own field where we can count, sort and otherwise test it out.

What if you get a spreadsheet where the pieces of data are all packed in one field? Say, something like this (which I cobbled together from Major League Baseball data in honor of the Nationals’ first playoff appearance):

NAME: Sean Burnett POS: RP AGE: 30 WT: 200 BORN: Dunedin, FL SALARY: 2350000
NAME: Tyler Clippard POS: RP AGE: 27 WT: 200 BORN: Lexington, KY SALARY: 1650000
NAME: Ross Detwiler POS: SP AGE: 26 WT: 174 BORN: St. Louis, MO SALARY: 485000
NAME: Christian Garcia POS: RP AGE: 27 WT: 215 BORN: Miami, FL SALARY: N/A
NAME: Gio Gonzalez POS: SP AGE: 27 WT: 205 BORN: Hialeah, FL SALARY: 3335000
NAME: Mike Gonzalez POS: RP AGE: 34 WT: 215 BORN: Robstown, TX SALARY: N/A
NAME: Ryan Mattheus POS: RP AGE: 28 WT: 215 BORN: Sacramento, CA SALARY: 481000
NAME: Craig Stammen POS: RP AGE: 28 WT: 200 BORN: Coldwater, OH SALARY: 485000
NAME: Drew Storen POS: RP AGE: 25 WT: 180 BORN: Indianapolis, IN SALARY: 498750
NAME: Jordan Zimmermann POS: SP AGE: 26 WT: 218 BORN: Auburndale, WI SALARY: 2300000

Let’s say you want to extract the city of each player’s birth into a separate field. The varying length of each player’s name means the birth place isn’t always in the same position in the string, so a typical text-to-columns operation won’t work. So, how to do it?

The answer lies in two very handy Excel functions: FIND and MID.

FIND locates characters you specify and returns its numeric place in the string.

MID returns X characters from a string beginning at a location X you specify.

For example, we can locate the position where each city name begins by using FIND to locate the string “BORN:” in each cell. The city name itself always starts six characters after the position of that string, so we add six to the result:

=FIND("BORN:",A2)+6

In the first row above, the functions returns 50. In the second row, 52. We’ll feed that value to the MID function as the starting point for our extraction.

MID takes three arguments: Text or cell, position to start, number of characters to return. So, we use the above FIND function as the second argument and, for now, extract 10 characters:

=MID(A2,FIND("BORN:",A2)+6,10)

That gets us part of the way there. We’re starting in the right spot, but 10 characters isn’t always the length of the city and state, so it leads to choppy results:

Dunedin, F
Lexington,
St. Louis,
Miami, FL 
Hialeah, F
Robstown, 
Sacramento
Coldwater,
Indianapol
Auburndale

What we need to do is tell MID the exact number of characters to return each time even though the length of the city varies. We can figure this out using FIND again.
Continue…

Which web browsers do journalists favor?

After I started playing with Internet Explorer 9 tonight — and knowing that most developers, including Microsoft, want to wean the world from IE6 as soon as possible — I grew curious about the browsers favored by my site’s visitors. A quick dig into Google Analytics gave me the data for the last few months, and the Google Charts API let me build a quick pie:

Site visits by browser, November 2010-March 2011

I can’t know for sure, but I suspect that most people who read my site are journalists or developers. Most traffic comes from links I post on Twitter or via search keywords that tend toward journalism, data, math and, lately, the Census.

Generally, you’re not an IE-centric crowd — just 12%. That’s lower than overall metrics, which tend to place Internet Explorer at anywhere from 40% or more of the overall market.

Oh, and the percent using IE6? Less than 0.4%.

Story hunting in birth, death data

Note to readers: This post, originally published in 2010, has been updated and recast to reflect U.S. data available as of May 2014.

* * *

The U.S. government’s annual count of births and deaths is among the most basic of demographics, but tracking it is one of my little obsessions. I keep annual totals in a spreadsheet and get all gooey inside when I can add another year to the pile.

In May 2014, for example, the National Center for Health Statistics released data showing the number of births in the U.S. had leveled off after several years of decline. In my mind, a yearly change like that in a longitudinal data set certainly is worth covering.

But sometimes these basic demographics — including Census data — reveal even more when we take a long-term view.

For example, below are the annual number of births and deaths from 1933 to 2013 plotted via the Google Charts API. Hover over the lines for the data:

It’s simple — just two fever lines. But it’s chock full of generational milestones that bear watching:

  • The first baby boomers — those born in 1946 — will turn 70 starting in January 2016.
  • The Gen Xers that follow are marching toward middle age, now in their early 30s to just about 50. (Gen X poster boy Eddie Vedder of Pearl Jam hits the half-century mark in 2014.)
  • Meanwhile, the first of the Millennials — the “echo boomers” whose numbers peaked in 1990 — have passed age 30.

Each generation brings a new sensibility to the stages of life, and the relative size and makeup of each one — not to mention its cultural context — gives journalists plenty of opportunity for storytelling. Two examples:

  • Much has been written about the big bump of post-World War II babies marching closer to retirement (maybe), Social Security, and the years where health care becomes a major concern. But what about the inevitable? Notice that the number of deaths in the U.S. has plateaued at about 2.5 million a year. That won’t last long with Boomers heading into the years where death rates rise dramatically. How will 4 million deaths annually affect the funeral home business, the ability to buy a cemetary plot, and the overall industry around end-of-life care?
  • Along with Gen X came the “baby bust,” the years of rapidly declining birth rates that led to all kinds of prognostications about the shrinking of America. That means our workforce now has a relative shortage of thirtysomethings. Does that mean more opportunity for Millennials to advance in the business world and less pressure for boomers to retire?

These sorts of trends are slow-burning, but they reflect data trends that exert hidden but massive force on our culture, much like the tides. The savvy data journalist keeps an eye on them not just for what they say this year but what they reveal over time.

Sorting Data in Excel: Simple Analysis

Sorting a data set helps answer a basic question journalists like to ask: “Which ____ has the highest (or lowest) ______?”

Excel (and other spreadsheets such as the open source Calc) make sorting data easy. In fact, I often make sorting my first step when “interviewing” data because it quickly reveals high and low values and often highlights some that may seem questionable.

Let’s work through a simple sort in Excel. I’ll be using Excel 2007, but older versions have similar functions. Start by downloading the file “sorting.xls” and saving it to your computer. Open it and follow along:

1. We have a table of Census data from the 2006-2008 American Community Survey. It shows the median age of the population for each of 79 school districts in Virginia plus the state itself.

We want to know which district has the oldest and youngest populations. Let’s sort it!

2. Click once on one cell anywhere in the table. This will help Excel auto-discover your table in the next step.

Continue…

Mean vs. Median: A Beginner’s Guide

A common way to summarize a group of numbers — one most of us learned in grade school — is to find its mean, commonly called the average. But it’s not always the best measure.

Let’s say six kids go on a field trip, ages 10, 11, 10, 9, 13 and 12. It’s easy to add the ages and divide by six to get the group’s average age:
 

(10 + 11 + 10 + 9 + 13 + 12) / 6 = 10.8

Because all the ages are close, the average of 10.8 gives us a good picture of the group as a whole. But averages are less helpful when the values are skewed toward one end or if they include outliers.

For example, what if we add a much older chaperone to our field trip? With ages of 10, 11, 10, 9, 13, 12 and 46, the average age of the group rises considerably:
 

(10 + 11 + 10 + 9 + 13 + 12 + 46) / 7 = 15.9

Now the mean is not an accurate representation. The outlier skews the average, and no journalist should feel comfortable reporting it.

This is where calculating a median is handy. The median is the midpoint in an ordered list of values — the point at which half the values are higher and half lower. If the median household income in East Middletownburg is $50,000, then half the households earn more and half less.

Continue…

26,500 school cafeterias uninspected

Thousands of school cafeterias went uninspected in the 2007-08 school year, we report today in the fourth major installment of our “Trouble on the Tray” investigation into school lunch safety.

In today’s story, reporters Blake Morrison and Peter Eisler worked with me to examine data on the number of schools in each state that met a federal requirement to have two cafeteria inspections annually. We found that in eight states, more than half of schools reporting failed to meet that standard in 2006-07 and 2007-08 school years.

Meanwhile, the series continues to draw attention on Capitol Hill. This week, Sen. Kirsten Gillibrand, D-N.Y., called on the federal government to increase its standards for meat used in school lunches and to cut contracts with companies that repeatedly did not meet standards.

When chickens stop laying eggs …

In part three of USA TODAY’s investigation into the quality of government-bought food for school lunches, we examine how its standards for microbial testing of school lunch beef are less stringent than those employed by fast food chains such as McDonald’s, Jack In the Box and KFC.

We also write about “spent hens” — birds that are past their egg-laying prime. Tough and stringy, these old birds typically are turned into pet food or compost. But egg producers struggling to find a market for all of them have had help from the federal government:

From 2001 though the first half of 2009, USA TODAY found, the government spent more than $145 million on spent-hen meat for schools — a total of more than 77 million pounds served in chicken patties and salads. Since 2007, 13.6 million pounds were purchased.

Both stories were heavily informed by analysis of data sets obtained from the USDA under the Freedom of Information Act. They included hundreds of thousands of orders from a federal inventory system and about 150,000 results of microbial tests of beef destined for school lunches.

Prompted in part by our series but also by last week’s recall of beef by a company we identified in Part Two of our series, one lawmaker has called for the government to investigate a supplier to the school lunch program. From another story of ours today:

Rep. Rosa DeLauro, D-Conn., called on the U.S. Department of Agriculture to “undertake a comprehensive examination” of the facility, Beef Packers, to “identify and correct any major problems” before it produces more beef.

To see all the stories in our series thus far, click here.

Project: School lunch safety

Today, my colleagues Blake Morrison, Peter Eisler and I published the second part of our investigation into the safety of food used in the National School Lunch Program. Today’s installment focuses on a California firm that kept receiving government contracts even after  it had been suspended from the program several times — twice because of failure to produce ground beef that was free of salmonella.

When the firm, Beef Packers, recalled beef last summer because of a salmonella outbreak in 11 states, the government decided not to recall beef made for school lunches that the company made around the same time:

The recall, announced by the government Aug. 6, covered only ground beef sent to certain retailers. In the days after it was announced, government and company spokesmen said meat sent to schools was not included. Documents obtained by USA TODAY through the Freedom of Information Act reveal a more complicated story — one that raises questions about whether the government took adequate steps to ensure that meat it bought for schoolchildren during the same period was safe.

To get at the story, we filed FOIA requests for several government data sets. They included the results of hundreds of thousands of microbial tests conducted by the USDA as well as a dump from an inventory system the government uses to track orders for the school lunch program.

Update, 12/7/2009: Morrison and Eisler report that Beef Packers issued its second recall this year for beef tainted with salmonella.

Bookshelf: Numbers in the Newsroom

Never underestimate the value of a compact guide to math, especially if you’re one of those journalists who thought  you could avoid numbers by becoming a writer. You shouldn’t — understanding numbers will help you get stories  others miss because of innumeracy.

One of the handiest resources I’ve found — and recommended just this week to a roomful of colleagues — is Sarah Cohen’s “Numbers in the Newsroom.” It’s a 108-page guide that covers the basics on percent change, rates, graphics, probability and much more. Cohen is a Pulitzer-winning former Washington Post staffer and one-time training director for Investigative Reporters and Editors. She’s now at Duke University, where she is the Knight Professor of the Practice of Journalism and Public Policy.

The book is a few years old, but its lessons are timeless. You can pick it up through IRE’s online store.

Have your own math book recommendations? List them below …

Project: NCAA football coach contracts

This morning, my colleagues and I at USA TODAY launch part one of a three-day series on NCAA football coaches contracts. The centerpiece is a database analysis by my teammate Jodi Upton, who worked with the sports staff to collect and analyze hundreds of documents. My contribution is the database programming behind our interactive graphic.

Among the key findings, straight from today’s lead story:

— At least 25 college head football coaches are making $2 million or more this season, slightly more than double the number two years ago.

— The average pay for a head coach in the NCAA’s top-level, 120-school Football Bowl Subdivision is up 28% in that time and up 46% in three years, to $1.36 million.

— Our first look at the salaries of assistant coaches finds many approaching and even exceeding presidents’ compensation and most eclipsing that of full professors.

Adjusting for inflation: A beginner’s guide

When Daniel Craig hit theaters last year in Quantum of Solace, the 22nd film in the James Bond spy series, his ability to dispatch bad guys (and charming good looks, no doubt) helped it earn $168.4 million. That was enough to rank Solace among the top 10 grossing films of 2008.

But how did Solace fare against the rest of the Bond canon, which stretches back to 1963’s Dr. No? The answer depends on whether you adjust for inflation.

We all know that the price of a loaf of bread isn’t what it used to be. The cost of consumer goods tends to rise each year, except during downturns or various calamities. So, taking inflation (or deflation) into account is the only way to  meaningfully compare dollar amounts over time.

There are plenty of apps just for this. The Bureau of Labor Statistics offers one basic calculator, and there’s another at this site. They’re fine for a quick check, but I’d rather do my own calculations. A web app might not have the latest data. And if you’re adjusting more than a couple of amounts, using a spreadsheet will save time. Here’s an exercise from Bond-land:

Continue…

Percent change: Know the formula

Here’s a question I posed to some college students recently:

Let’s say you cover the Town of East Middleburgtown. The mayor announces that this year’s town budget comes in at $12.6 million. Last year’s budget was $11.4 million. What is the percent change? Better yet, what’s the formula for figuring it out?

If you don’t know the answer, or how to obtain it, you’re not alone. This kind of problem — which is in my son’s 7th grade math textbook — routinely stumps most journalists in most of the newsrooms across America.

I’ll avoid the temptation to moralize here. If you’re a journalist — if you have a pulse — you need to know this very basic operation. With it, you’ll have the power to analyze all kinds of data and even double-check the mayor’s math.

Here it is:
 

(the_new_number - the_original_number) / the_original_number

or, in the case of East Middletownburg:
 

(12.6-11.4) / 11.4

Remember (and you learned this in fifth grade) that operations in parentheses come first. That gives you this:
 

1.2 / 11.4 = .105 = 10.5%

So, the mayor’s new budget is a 10.5% increase over last year’s. Now you have something to write about!

Excel: Combine text and formulas in a cell

Whenever I analyze data in Excel, I format the spreadsheet to make it easier to read. A little attention to fonts, boxes and shading can help people understand the key data faster.

One way to give yourself some flexibility with formatting is to combine text and the results of a formula in a single cell. Just use the “&” operator to concatenate the text and the formula.

Consider this formula:

="Quantity: "&SUM(A1:A20)

Enter it into a cell, press enter and (assuming you have numeric values in cells A1 through A20) it will present this result in a single cell:

Quantity: 23

That kind of output’s pretty handy when you want to create a worksheet in your spreadsheet that aggregates data from other sheets while keeping the formatting simple.