Wednesday, January 11, 2017

Markov chains and evaluating baseball lineups

There was a question during Monday night's meetup of the Burlington Data Scientists about the use of Markov chains and Brian mentioned how they can be used to evaluate batting lineups.  To go into a little more detail**, one methodology behind this would be to:

1. Create the set of lineups to be compared.
2. For each lineup, simulate the outcomes N games using that lineup.
3. The lineup that returns the best results (say, the highest winning %) is the lineup to go with.

It's in step 2, the simulation of a game, where the Markov chains come into play.  Prior to each pitch in a game of baseball, the game is in a discrete, well-defined state. In any of these states, there is often a wealth of information you can use to estimate the probability of what will happen on the next pitch.  For example:

2a. At the beginning of the game, you can look at the history of what happens when a particular pitcher throws to a particular batter in a 0-0 count with no outs and no one on base in the first inning***.  Based upon that history, you estimate the probability of each possible outcome (ball, strike, hit, out, etc) and simulate the result of the first pitch.
2b. After the simulated first pitch (say it was a strike, 0-1), the game is again in a well-defined state.  Based upon the history of what happens in that state, you can estimate the probability of each possible outcome (1-1, 0-2, hit, out, etc) and simulate the result of the second pitch.
2c. Keep doing this until you have simulated the entire game.  The simulated game is a realization from a Markov chain defined by the lineup in play and the set of estimated probabilities that described how the game transitions from one state (6th inning, score tied, 1 on first, no outs, 1-1 count) to the next.

This methodology is extensible to other fields in order to do simulation studies of complex events.


** Initially I was going to post this in the comments on the Meetup site, but it runs over the 1000-character limit

*** In the ideal, you have a history of what happens when a particular batter faces a particular pitcher, but this can be supplemented by the individual histories of the batter and pitcher, and the overall league history, when you have a batter and pitcher who haven't faced one another often, or rookie players without much individual history.


Sunday, January 8, 2017

Washington Post rainfall data analysis

WaPo posted a very nice map of the continental U.S. rainfall in 2016 and how it compares to the yearly averages.  This is a great first step, and I hope they follow up, for example:

  • looking at the series of years to see whether 2016 is an outlier or part of a trend
  • digging down into rainfall by month/season 


h/t FlowingData, via Feedly