One of my old jokes is that newsroom IT puts the No in Innovation, so I’m always on the lookout for ways to get around them. And I’ve been playing around with a good one: The Raspberry Pi.
Unfamiliar with the Pi? The Model B Pi is a $35 computer that’s about the size of a deck of cards. It’s got an ethernet port, and you supply the hard drive in the form of an SD card, the keyboard, mouse and monitor. Now, for $35, you’re not getting a ton of horsepower, but for simple repetitive tasks it works great.
What kind of simple, repetitive tasks? Let’s pretend for a second that you wanted to set up a scraper that dumped data into a database every hour. Ideally, you’d have a server somewhere and you’d set up a task on it — I like using ‘nix’s cron for things like this — and off it would go, mindlessly gathering data for you and putting it into a database. You could then go about your life, stopping by from time to time to get that data and do whatever you’re going to do with it. So you ask newsroom IT for this and, of course, the answer is no. And no we won’t give you the money to run this in the cloud for a few bucks a month either.
Enter the Pi.
For $35, you can write your scripts, put them in a cron job and off it’ll go, gathering your data for you. No need for a server, no need for a server administrator, no need to make sure your work computer stays on and running the whole time, just some elbow grease to get the script running and an ethernet connection to the internet.
I’ve had my Pi running a repetitive task for two weeks now and it’s plugging along without issue, having gathered 50,000 records without me having to do anything. In a month, I’ll have a dataset worth analyzing, and it will only ever cost me $35. And I can use it for other things as well.
A cheap scraper bot. Useful!
Things we have done today in my office:
- Stolen two cups of dirt from a construction site.
- Made a pot of mud with one cup of dirt.
- Microwaved the second cup of dirt in the faculty lounge.
- Measured the the sensor output of totally dry dirt versus a soaking wet pot of mud.
- Used the point-slope form of linear algebra to determine the formula for the line between dry and wet. In a journalism school. And it worked.
More about this later, including what it’s all about and code.
When we last left off, we had a script that would loop through a list of data and write a news lead out of it. All that the script did was look at two numbers and decide if the crime rate went up or down and then wrote an appropriate sentence. Something like this:
Lexington police reported more violent crime in 2010 compared to 2009, according to federal statistics.
But, sometimes, just one year isn’t enough. Sometimes a city gets on a roll, crime goes down for several years in a row, and that’s noteworthy enough to change the lead. So, lets do that. How?
Well, not to continue a trend here, but’s it’s really simple.
Here’s the code:
# determine the duration of the trend
if city > city > city:
trend_length_clause = ", the second year in a row crime has increased"
elif city < city < city:
trend_length_clause = ", the second year in a row crime has declined"
trend_length_clause = ""
What does that say? It’s simple greater than, less than logic. If 2010 (or city in our loop of data) is greater than 2009, which is greater than 2008, then crime has gone up two straight years. If you flip the sign, you get that crime has declined two straight years. So, if those conditions are true, then lets set a variable called trend_length_clause to some words that work in our lead. Then, add that to our lead generating code:
lead = "%s police reported %s violent crime in 2010 compared to 2009%s, according to federal statistics." % (clean_city, direction, trend_length_clause)
Our lead now says insert the city where the first %s is, the direction of the trend where the next %s is and finally jam that clause in. Since we set it to blank if crime didn’t go up or down for two consecutive years, jamming it in there will do nothing if those trends don’t exist. When you run the script, now you see leads like this:
Lincoln police reported more violent crime in 2010 compared to 2009, according to federal statistics.
Norfolk police reported more violent crime in 2010 compared to 2009, the second year in a row crime has increased, according to federal statistics.
That’s a little better. Still not award winning, but at least it varies it up based on the news.
Let’s call it a lead for now. We need a second paragraph, one that starts putting some numbers to this trend. So that’s what we’ll do in the next post. We’ll write a second paragraph that does some percent change math and spells out the data a little further.
There’s a lof of interest and attention right now around the idea of computers writing stories from data. As newsrooms shrink and business models implode, managers are casting about for anything that can keep the pages filled/updates flowing. Stories about automated news all seem to ask the straw-man question: “Can software replace humans?”
I have two thoughts about this:
1. Software bots will never be able to write the most compelling stories, because telling a story is an inherently human act that requires real humanity to do well. There is no algorithm for humanity.
2. It is trivially simple for a software bot to write mundane, data-based (as opposed to databased) stories that fill a lot of news sections these days. For boring, grind it out, have-to-do-it-but-no-one-wants-to stories, a bored developer can bang out a bot that’ll write a decently nuanced story on that topic using ultra-basic programming logic in a day.
How trivially simple? How ultra basic?
Let’s write a software bot that can write the annual “Crime is up/crime is down” story from the FBI’s Uniform Crime Reports release. It’s a simple story to write and very little changes from year to year, other than a few numbers and a couple of quotes.
To do this, I’m going to use Python and a simple list of data I got from the UCR’s data tool.
First, I need a list of data to iterate over. For this example, I’m actually using a Python list. Most of the time, you’d have an array of objects from a database or a row of data from a csv. It really doesn’t matter, but here’s what I’m starting with.
cities = [["Beatrice Police Dept","NE",433.7,281.4,280.9], ["Bellevue City Police Dept","NE",159.5,125.2,139.6], ["Columbus Police Dept","NE",107.3,69.3,122.1], ["Fremont Police Dept","NE",209.0,130.8,189.4], ["Grand Island Police Dept","NE",417.6,486.1,346.2], ["Hastings Police Dept","NE",188.8,204.1,132.5], ["Kearney Police Dept","NE",219.9,204.8,201.4], ["Lavista Police Dept","NE",58.5,52.0,101.5], ["Lexington Police Dept","NE",255.6,207.1,303.0], ["Lincoln Police Dept","NE",509.6,457.9,486.9], ["Norfolk City Police Dept","NE",147.2,148.5,181.7], ["Omaha Police Dept","NE",605.6,533.4,556.0], ["Papillion Police Dept","NE",61.3,77.9,142.9], ["Scotts Bluff Police Dept","NE",375.0,358.8,232.7], ["South Sioux Police Dept","NE",124.9,134.1,142.3]]
As you can see, it’s a list of cities in Nebraska with the violent crime rates for the three most recent years (in this case, 2008, 2009 and 2010).
The most common form of story a journalist can write is called the inverted pyramid — most important thing first, second most second, third most third, and so on. So, lets use that common structure to write our story. So what is the most important thing? I’d say it’s the trend. Is crime up or down? That would make our lead something in the form of “(city) police reported (more/less/same) violent crime in 2010 than 2009, according to federal statistics.”
So, how do we write that in code? Well, like anything, there’s a lot of ways to do it. I’m sure companies that do this for a living are using much, much more sophisticated methods, but it doesn’t require it. Take a look:
#first, import the string library that we'll need later and loop through our list of cities
for city in cities:
#clean up the city name
clean_city = city.replace(" Police Dept", "")
#Lets get rid of that City business in a couple of names
clean_city = clean_city.replace(" City", "")
#determine the year over year trend
if city > city:
direction = "more"
elif city < city:
direction = "less"
direction = "the same"
#write the lead
lead = "%s police reported %s violent crime in 2010 compared to 2009, according to federal statistics." % (clean_city, direction)
What do you get? Something like this:
Beatrice police reported less violent crime in 2010 compared to 2009, according to federal statistics.
Bellevue police reported more violent crime in 2010 compared to 2009, according to federal statistics.
Columbus police reported more violent crime in 2010 compared to 2009, according to federal statistics.
Award winning? Hardly. Gripping narrative? No way. But, with a national dataset, I just wrote a lead for every city in America. And it would take less than a second to do so.
— Joi Ito, head of MIT’s Media Lab, speaking truths news organizations still haven’t learned.