Wednesday, January 18, 2012

Beyond a Boxscore

One of my favourite blogs is an American Baseball blog that attempts to explain the information that can be found beyond a box score.  To all those non-baseball fans, a box score generally lists what a player did throughout the game. Typically includes a different box for pitchers and batters.[1]  I’ll try to avoid boring you with history, but it adds an important factor to this conversation. The boxscore was invented to let people know how their teams were doing while they were on the road. So when the Boston Braves played the Philadelphia Athelthics , a reporter would send home the information to the Boston Globe, letting them the information.  So this history loops back in, because for a while, different reporters would [naturally] report different statistics. They reported their batting averages, how many hits a batter got, if he walked or so on.  What this did was, well create a market for statistics. As the process continued, newspapers began to report the same statistics; a unified thing known as a boxscore. If a player was substituted, he would be added under the position that he took over for.
Boxscores changed the way we looked at the game; it gave those of us not at the game, a way to look at how well the player was doing. It allowed us to quantify the portion the player was contributing to a win against the opponent.  So the blog, as mentioned above, tried to think beyond the basic statistics. Bill James [for all you moneyballers] did the same thing. He decided to take a look at what these statistics meant, and how well they measured the players contribution to a win. Well long behold, he (as did others) found out that those statistics did an average job, but not the best job at describing the outcome of the game.
Here is the problem with soccer (football for those abroad). Soccer doesn’t have a boxscore, and people only care about few thigns. They want to know how well their team is doing in the league [as presented by a table], who has the most amount of goals, and how many clean sheets a team/keeper  has.  ESPN [by now you know my thoughts on them] decided to create their own version of a boxscore, but with a soccer twist. They gave the following statistics, saves,goals, shots (on goal), Time of Possession, yellow cards, red cards, free kicks, offsides and  fouls. Here is the kicker, not one of those statistics is a useful predictor of the outcome. That is, if you model points (3 for a win 1 for a draw and zero for a loss), you find that not one of them predicts the points a team receives. I’ve started to take note of this, as I would watch New Castle United play its games. New Castle United, currently sixth in the table, has been outshot all but three of its games. [2] What we need is something that will reflect a basic understanding (leave the advanced stuff for nerds like us). How about a player by player boxscore, something that describes turnovers forced, shots allowed, break aways, defenders drawn, TOP (per player), minutes played (per player) tackles won, tackles lost and headers.  Which team has allowed the most amount of headers this season?  All these add to important was to play the game. Liverpool (bias be aware), could use this information for when to play Andrew Carrol over Luis Suarez. If a team allows a lot of breakways, move your faster players up front. If the team plays a conservative D, bring in Carroll and work on set pieces. You would be amazed that these statistics are out there, but few people are recording them.

[1] Baseball’s sister (or brother we are gender neutral here) sport cricket, also has something similar.

[2] The three games they weren’t outshot, they lost two of them

Thursday, January 5, 2012

Weather| It is an odd Thing.

We look outside, ask siri or just guess what the weather will be each day. Rarely do we look at a year to year basis, wondering what the temperature was. I did an interesting analysis for Washington DC over the past year, and a unique look at how look at data.

Year to Year and Averages

As you can see from the graph above, the past two years have been a little above average. Although we have not had a record breaking this year, although we did have one last year.

Last year it rained 37 out of the 127 days, meaning we had rain 29% of the time (days speaking) last year.
This year it rained 51 out of the 127 days, meaning we had rain 40% of the time (days speaking) this year.
Rain is rain or snow.

Is Weather Data Normal?
Weather needs to be normal for a lot of the analysis we use. If we want to assume that a day was record hi, or that global warming is increasing the temperature, then we need the data to be normal.

As you can see, the weather data for Jan 5 are normal. With a [nerd alert] Shapiro Wilkins p-value of .19, we  can say the data are normally distributed.

What this allows us to do, is to see if we have had any statistically significant days since 1936. We actually have!

High Points
1950 62 Degrees
1997 60 Degrees
2009 58 Degrees

Low Points
1968 18 Degrees
1958 18 Degrees

Is the Earth Getting Warmer?
Always a debatable topic, and I'm not going to side really on it. I'll just throw some data at you.

2011 out of 127 days 110 of them were above average on temperature
2010 out of 127 days 100 of them were above average on temperature

However, as we can see above, the two different years do not look that different. The p-value is a sad .486. So we can conclude that there is not a statistical difference from the temperatures before.

Now if we subset pre 1997 and post 1997, we get something a little more likely to differ. Here is the thing though, that P-value is not statistically significant either, its p-value is .188 . So we conclude that the temperatures in Washington DC have remained Normally around their average since 1936, and that extreme temperatures were not common.

Here is what Jan 5th has looked like since 1936