I’m just about done generating the statistical information from the first three month’s of data from this year for our Q1 statistical report. It’s taken me a while longer than usual to do because of my other work, but it’s also because there’s a lot of manual work that goes into generating these statistical reports…
So I thought I’d share some of the actual process that goes even beyond all the hard work that goes into just gathering all this data.
First I port the data I gather each day from the news feed into our DB and it looks like this:
It’s just raw data at this point, so I have to go in and categorize it, notate the status of each incident, assign values for the number and type of officers involved, victims, fatalities, and geocode it as well… at that point it looks like this:
Once that’s done, I then have to merge the data with older data sets and then manually run through the two combined sets in order to identify entries that are duplicate and remove the newer of those duplicates. I also have to run through them and identify status updates and tie the update to the original, which removes that data from the current data set. That way I ensure that the data I use for statistics is only new reports, not updates.
Once that’s done I’m ready to run some analysis such as determining the spread of incident types/status:
Determining the state-wide statistics for the maps and state-by-state rankings:
… and determining the localized stats via UCR employment data compared to NPMSRP data:
But there’s still more to do… I also have to run back through the reports to identify the civil suit settlements and judgments for the cost stats, break down the total numbers as well as averages, and run through the trending data:
…Oh, and the calculations for comparing US crime rates with comparative police misconduct types:
Then there’s manually creating each map from templates using the data I’ve generated:
…and after all that is done I can finally start to get down to formatting all of that into a semi-coherent report.
I’m not complaining though, it’s worth the work to get at least an idea of what the state of police misconduct might be in America and it’s definitely worth it if people read it and reconsider their position on the issues of police accountability and transparency in order to understand how important these issues are.
I’m always trying to think of new ways to present all this data I create and new ways to dissect and look at that data to glean useful, and interesting, information that will get people to look at the issues. But, honestly, I spend so much time in the data that I don’t get too many chances to come to the surface and look at what I produce to see if it can be made better, and how to do that…. let alone figure out how to make it interesting for people who read the site.
That’s where all of you come in. This work takes a lot of time and effort, I don’t have any left to figure out how best to redesign the site so that the info is easier to find. I don’t have time to figure out what other data people will want to see. I depend on reader feedback for that and, while I don’t know why, that’s the kind of data that is even harder to get than data on police misconduct.
So, please, when you look at the upcoming 2010 Q1 NPMSRP statistical report that I should be releasing on Sunday, (which is later today now, I guess), please let me know what you like about it, what you don’t like about it, and what you think might be missing. Help me make sure that all this work doesn’t go to waste.