There’s a lof of interest and attention right now around the idea of computers writing stories from data. As newsrooms shrink and business models implode, managers are casting about for anything that can keep the pages filled/updates flowing. Stories about automated news all seem to ask the straw-man question: “Can software replace humans?”
I have two thoughts about this:
1. Software bots will never be able to write the most compelling stories, because telling a story is an inherently human act that requires real humanity to do well. There is no algorithm for humanity.
2. It is trivially simple for a software bot to write mundane, data-based (as opposed to databased) stories that fill a lot of news sections these days. For boring, grind it out, have-to-do-it-but-no-one-wants-to stories, a bored developer can bang out a bot that’ll write a decently nuanced story on that topic using ultra-basic programming logic in a day.
How trivially simple? How ultra basic?
Let’s write a software bot that can write the annual “Crime is up/crime is down” story from the FBI’s Uniform Crime Reports release. It’s a simple story to write and very little changes from year to year, other than a few numbers and a couple of quotes.
To do this, I’m going to use Python and a simple list of data I got from the UCR’s data tool.
First, I need a list of data to iterate over. For this example, I’m actually using a Python list. Most of the time, you’d have an array of objects from a database or a row of data from a csv. It really doesn’t matter, but here’s what I’m starting with.
cities = [["Beatrice Police Dept","NE",433.7,281.4,280.9], ["Bellevue City Police Dept","NE",159.5,125.2,139.6], ["Columbus Police Dept","NE",107.3,69.3,122.1], ["Fremont Police Dept","NE",209.0,130.8,189.4], ["Grand Island Police Dept","NE",417.6,486.1,346.2], ["Hastings Police Dept","NE",188.8,204.1,132.5], ["Kearney Police Dept","NE",219.9,204.8,201.4], ["Lavista Police Dept","NE",58.5,52.0,101.5], ["Lexington Police Dept","NE",255.6,207.1,303.0], ["Lincoln Police Dept","NE",509.6,457.9,486.9], ["Norfolk City Police Dept","NE",147.2,148.5,181.7], ["Omaha Police Dept","NE",605.6,533.4,556.0], ["Papillion Police Dept","NE",61.3,77.9,142.9], ["Scotts Bluff Police Dept","NE",375.0,358.8,232.7], ["South Sioux Police Dept","NE",124.9,134.1,142.3]]
As you can see, it’s a list of cities in Nebraska with the violent crime rates for the three most recent years (in this case, 2008, 2009 and 2010).
The most common form of story a journalist can write is called the inverted pyramid — most important thing first, second most second, third most third, and so on. So, lets use that common structure to write our story. So what is the most important thing? I’d say it’s the trend. Is crime up or down? That would make our lead something in the form of “(city) police reported (more/less/same) violent crime in 2010 than 2009, according to federal statistics.”
So, how do we write that in code? Well, like anything, there’s a lot of ways to do it. I’m sure companies that do this for a living are using much, much more sophisticated methods, but it doesn’t require it. Take a look:
#first, import the string library that we'll need later and loop through our list of cities
for city in cities:
#clean up the city name
clean_city = city.replace(" Police Dept", "")
#Lets get rid of that City business in a couple of names
clean_city = clean_city.replace(" City", "")
#determine the year over year trend
if city > city:
direction = "more"
elif city < city:
direction = "less"
direction = "the same"
#write the lead
lead = "%s police reported %s violent crime in 2010 compared to 2009, according to federal statistics." % (clean_city, direction)
What do you get? Something like this:
Beatrice police reported less violent crime in 2010 compared to 2009, according to federal statistics.
Bellevue police reported more violent crime in 2010 compared to 2009, according to federal statistics.
Columbus police reported more violent crime in 2010 compared to 2009, according to federal statistics.
Award winning? Hardly. Gripping narrative? No way. But, with a national dataset, I just wrote a lead for every city in America. And it would take less than a second to do so.