Replacing reporters with robots? I might be trolling a little, but robotics and programmatic bots are going to play a greater role in reporter’s futures than you might think.

Here is John Keefe and I talking about the near field future of sensors and journalism at the Tow Center’s Sensor Weekend. We gave the second keynote of the day.

A collection of code, experiments, diagrams notes and other details of our experiments with sensors and the Interent of Things at UNL’s CoJMC.

A new way for data journalists to thwart newsroom IT: the Raspberry Pi

image

One of my old jokes is that newsroom IT puts the No in Innovation, so I’m always on the lookout for ways to get around them. And I’ve been playing around with a good one: The Raspberry Pi.

Unfamiliar with the Pi? The Model B Pi is a $35 computer that’s about the size of a deck of cards. It’s got an ethernet port, and you supply the hard drive in the form of an SD card, the keyboard, mouse and monitor. Now, for $35, you’re not getting a ton of horsepower, but for simple repetitive tasks it works great.

What kind of simple, repetitive tasks? Let’s pretend for a second that you wanted to set up a scraper that dumped data into a database every hour. Ideally, you’d have a server somewhere and you’d set up a task on it — I like using ‘nix’s cron for things like this — and off it would go, mindlessly gathering data for you and putting it into a database. You could then go about your life, stopping by from time to time to get that data and do whatever you’re going to do with it. So you ask newsroom IT for this and, of course, the answer is no. And no we won’t give you the money to run this in the cloud for a few bucks a month either. 

Enter the Pi.

For $35, you can write your scripts, put them in a cron job and off it’ll go, gathering your data for you. No need for a server, no need for a server administrator, no need to make sure your work computer stays on and running the whole time, just some elbow grease to get the script running and an ethernet connection to the internet. 

I’ve had my Pi running a repetitive task for two weeks now and it’s plugging along without issue, having gathered 50,000 records without me having to do anything. In a month, I’ll have a dataset worth analyzing, and it will only ever cost me $35. And I can use it for other things as well. 

A cheap scraper bot. Useful!

Adventures in prototyping

image

Things we have done today in my office:

  • Stolen two cups of dirt from a construction site.
  • Made a pot of mud with one cup of dirt.
  • Microwaved the second cup of dirt in the faculty lounge.
  • Measured the the sensor output of totally dry dirt versus a soaking wet pot of mud.
  • Used the point-slope form of linear algebra to determine the formula for the line between dry and wet. In a journalism school. And it worked.

More about this later, including what it’s all about and code.

Writing stories with code, part 2: conditional leads from trends

See part one here. Get the code, such as it is, here.

When we last left off, we had a script that would loop through a list of data and write a news lead out of it. All that the script did was look at two numbers and decide if the crime rate went up or down and then wrote an appropriate sentence. Something like this:

Lexington police reported more violent crime in 2010 compared to 2009, according to federal statistics.

But, sometimes, just one year isn’t enough. Sometimes a city gets on a roll, crime goes down for several years in a row, and that’s noteworthy enough to change the lead. So, lets do that. How?

Well, not to continue a trend here, but’s it’s really simple. 

Here’s the code:

    # determine the duration of the trend

    if city[4] > city[3] > city[2]:
        trend_length_clause = ", the second year in a row crime has increased"
    elif city[4] < city[3] < city[2]:
        trend_length_clause = ", the second year in a row crime has declined"
    else:
        trend_length_clause = ""

What does that say? It’s simple greater than, less than logic. If 2010 (or city[4] in our loop of data) is greater than 2009, which is greater than 2008, then crime has gone up two straight years. If you flip the sign, you get that crime has declined two straight years. So, if those conditions are true, then lets set a variable called trend_length_clause to some words that work in our lead. Then, add that to our lead generating code:

    lead = "%s police reported %s violent crime in 2010 compared to 2009%s, according to federal statistics." % (clean_city, direction, trend_length_clause)

Our lead now says insert the city where the first %s is, the direction of the trend where the next %s is and finally jam that clause in. Since we set it to blank if crime didn’t go up or down for two consecutive years, jamming it in there will do nothing if those trends don’t exist. When you run the script, now you see leads like this:

Lincoln police reported more violent crime in 2010 compared to 2009, according to federal statistics.

Norfolk police reported more violent crime in 2010 compared to 2009, the second year in a row crime has increased, according to federal statistics.

That’s a little better. Still not award winning, but at least it varies it up based on the news. 

Let’s call it a lead for now. We need a second paragraph, one that starts putting some numbers to this trend. So that’s what we’ll do in the next post. We’ll write a second paragraph that does some percent change math and spells out the data a little further.

Writing stories with code

There’s a lof of interest and attention right now around the idea of computers writing stories from data. As newsrooms shrink and business models implode, managers are casting about for anything that can keep the pages filled/updates flowing. Stories about automated news all seem to ask the straw-man question: “Can software replace humans?

I have two thoughts about this:

1. Software bots will never be able to write the most compelling stories, because telling a story is an inherently human act that requires real humanity to do well. There is no algorithm for humanity.

2. It is trivially simple for a software bot to write mundane, data-based (as opposed to databased) stories that fill a lot of news sections these days. For boring, grind it out, have-to-do-it-but-no-one-wants-to stories, a bored developer can bang out a bot that’ll write a decently nuanced story on that topic using ultra-basic programming logic in a day.

How trivially simple? How ultra basic?

Let’s write a software bot that can write the annual “Crime is up/crime is down” story from the FBI’s Uniform Crime Reports release. It’s a simple story to write and very little changes from year to year, other than a few numbers and a couple of quotes. 

To do this, I’m going to use Python and a simple list of data I got from the UCR’s data tool

First, I need a list of data to iterate over. For this example, I’m actually using a Python list. Most of the time, you’d have an array of objects from a database or a row of data from a csv. It really doesn’t matter, but here’s what I’m starting with.

cities = [["Beatrice Police Dept","NE",433.7,281.4,280.9], ["Bellevue City Police Dept","NE",159.5,125.2,139.6], ["Columbus Police Dept","NE",107.3,69.3,122.1], ["Fremont Police Dept","NE",209.0,130.8,189.4], ["Grand Island Police Dept","NE",417.6,486.1,346.2], ["Hastings Police Dept","NE",188.8,204.1,132.5], ["Kearney Police Dept","NE",219.9,204.8,201.4], ["Lavista Police Dept","NE",58.5,52.0,101.5], ["Lexington Police Dept","NE",255.6,207.1,303.0], ["Lincoln Police Dept","NE",509.6,457.9,486.9], ["Norfolk City Police Dept","NE",147.2,148.5,181.7], ["Omaha Police Dept","NE",605.6,533.4,556.0], ["Papillion Police Dept","NE",61.3,77.9,142.9], ["Scotts Bluff Police Dept","NE",375.0,358.8,232.7], ["South Sioux Police Dept","NE",124.9,134.1,142.3]]

As you can see, it’s a list of cities in Nebraska with the violent crime rates for the three most recent years (in this case, 2008, 2009 and 2010). 

The most common form of story a journalist can write is called the inverted pyramid — most important thing first, second most second, third most third, and so on. So, lets use that common structure to write our story. So what is the most important thing? I’d say it’s the trend. Is crime up or down? That would make our lead something in the form of “(city) police reported (more/less/same) violent crime in 2010 than 2009, according to federal statistics.”

So, how do we write that in code? Well, like anything, there’s a lot of ways to do it. I’m sure companies that do this for a living are using much, much more sophisticated methods, but it doesn’t require it. Take a look:

#first, import the string library that we'll need later and loop through our list of cities

import string

for city in cities:

    #clean up the city name

    clean_city = city[0].replace(" Police Dept", "")

    #Lets get rid of that City business in a couple of names

    clean_city = clean_city.replace(" City", "")

    #determine the year over year trend

    if city[4] > city[3]:

        direction = "more"

    elif city[4] < city[3]:

        direction = "less"

    else:

        direction = "the same"

    #write the lead

    lead = "%s police reported %s violent crime in 2010 compared to 2009, according to federal statistics." % (clean_city, direction)

    print lead

What do you get? Something like this:

Beatrice police reported less violent crime in 2010 compared to 2009, according to federal statistics.

Bellevue police reported more violent crime in 2010 compared to 2009, according to federal statistics.

Columbus police reported more violent crime in 2010 compared to 2009, according to federal statistics.

Award winning? Hardly. Gripping narrative? No way. But, with a national dataset, I just wrote a lead for every city in America. And it would take less than a second to do so. 

Simple, right?

Well, you might be asking, how hard is it to keep going? Well, let’s do that in the next post. Let’s change the lead based on a longer term trend. If you’re interested in the code, it’s here.

"If you think about Google, Facebook, and Yahoo, the key thing about these new innovations is that they weren’t started in heavily resourced labs but out on the fringes. The reason that happened was because the internet is an open-sourced way of information sharing. It lowered the cost of innovation to nearly zero - just ramen and sweat. It pushed innovation from big innovators to the edges: student start-ups, etc. The whole explosion of the internet was led by small groups of people, which in turn meant that the whole nature of innovation changed as costs went down."

— Joi Ito, head of MIT’s Media Lab, speaking truths news organizations still haven’t learned.