Name This Data Science Blog

Wednesday, January 11, 2017

Markov chains and evaluating baseball lineups

There was a question during Monday night's meetup of the Burlington Data Scientists about the use of Markov chains and Brian mentioned how they can be used to evaluate batting lineups. To go into a little more detail**, one methodology behind this would be to:

1. Create the set of lineups to be compared.
2. For each lineup, simulate the outcomes N games using that lineup.
3. The lineup that returns the best results (say, the highest winning %) is the lineup to go with.

It's in step 2, the simulation of a game, where the Markov chains come into play. Prior to each pitch in a game of baseball, the game is in a discrete, well-defined state. In any of these states, there is often a wealth of information you can use to estimate the probability of what will happen on the next pitch. For example:

2a. At the beginning of the game, you can look at the history of what happens when a particular pitcher throws to a particular batter in a 0-0 count with no outs and no one on base in the first inning***. Based upon that history, you estimate the probability of each possible outcome (ball, strike, hit, out, etc) and simulate the result of the first pitch.
2b. After the simulated first pitch (say it was a strike, 0-1), the game is again in a well-defined state. Based upon the history of what happens in that state, you can estimate the probability of each possible outcome (1-1, 0-2, hit, out, etc) and simulate the result of the second pitch.
2c. Keep doing this until you have simulated the entire game. The simulated game is a realization from a Markov chain defined by the lineup in play and the set of estimated probabilities that described how the game transitions from one state (6th inning, score tied, 1 on first, no outs, 1-1 count) to the next.

This methodology is extensible to other fields in order to do simulation studies of complex events.

** Initially I was going to post this in the comments on the Meetup site, but it runs over the 1000-character limit

*** In the ideal, you have a history of what happens when a particular batter faces a particular pitcher, but this can be supplemented by the individual histories of the batter and pitcher, and the overall league history, when you have a batter and pitcher who haven't faced one another often, or rookie players without much individual history.

Sunday, January 8, 2017

Washington Post rainfall data analysis

WaPo posted a very nice map of the continental U.S. rainfall in 2016 and how it compares to the yearly averages. This is a great first step, and I hope they follow up, for example:

looking at the series of years to see whether 2016 is an outlier or part of a trend
digging down into rainfall by month/season

h/t FlowingData, via Feedly

Monday, December 19, 2016

When charts [don't quite] speak for themselves: Supreme Court Politics

More or Less Numbers posted the following interesting chart on "Supreme Court Politics"**. The blog ends with the note that "the graph speaks for itself" and while the graph is informative and raises a number of interesting questions, it doesn't actually answer those questions, and thus doesn't really "speak for itself". Notably:

After the initial rise in number of abortions in the wake of Roe v. Wade, the number of abortions is fairly flat, with the exception of two enormous dropoffs in 1997-1998 and 2012-2013. How does one explain this pattern? I /think/ this is because of legislated increased access to birth control, but am not sure. This is a vital followup question to understanding this chart.
Are the absolute numbers from 1974 and 2014 are directly comparable? I doubt it. I'd like to see this chart with the y-axis showing the number of abortions relative to the number of women of childbearing age, the number of pregnancies, etc. to arrive at more comparable abortion rates across the years.
Where do the numbers prior to Roe v. Wade come from? In the absence of legal abortions, I seriously doubt the validity of abortion numbers from 1970-1972. It would be great to see more discussion of these numbers.

** via R-Bloggers, via Feedly

Thursday, November 3, 2016

Choosing and setting up a Scala IDE for Spark on Windows

In choosing an IDE: Eclipse or IntelliJ? I found more useful sites more quickly for Eclipse, and have more experience with Eclipse, so let's go that route for now.

To start, I'm following along the itversity "Develop Spark apps with Scala IDE" -- the instructions are for Linux, but were easy to translate to Windows -- with the following notes:

Prereqs

Set up Scala & Spark yesterday.
Installed Scala IDE for Eclipse
Installed sbt
Installed sbteclipse plugin (note to self; on Windows the home directory is the \Users\<username> directory)

In the section "Create Spark Scala project using sbt"

I created a project folder C:\scala-projects\simple-scala-spark with
The step to run sbt package creates project and target folders and downloads the dependencies defined in build.sbt
The step to run sbt eclipse creates the project definitions needed to bring the project into eclipse
(note to self: These would need to be run for every new project)

In the section "Import into Scala IDE for Eclipse"

For the first step, from the Eclipse menus, choose File > Import..., then General > Existing Projects into Workspace. I set the search directory to C:\scala-projects and it found "simple-spark" just fine.
When creating the new Scala object, it will direct you to the src\main\scala folder as the Source Folder.

In the section "Run the Spark Application using Scala IDE"

You need to select Scala Application > New configuration in the Run Configurations window.
Type SimpleApp as the main class
In the Program Arguments, I used local c:\temp\simpleappoutput.
(note to self: do not create the \simpleappoutput directory in advance, otherwise SimpleApp chokes)

SUCCESS! It runs locally. I also now have access to an AWS account, so tomorrow I'll try to stand up an instance and do the "Run Spark Scala application on Cluster" section.

Wednesday, November 2, 2016

Setting up Spark Standalone on Windows

Doing this today with an eye toward Spark Scala development.

Even though I'm running Windows 10, I followed the directions at http://nishutayaltech.blogspot.in/2015/04/how-to-run-apache-spark-on-windows7-in.html with success, with the following notes:

I used the latest versions of Scala and Spark, and I already had Java 8 and Python 3.4 installed.
I used a Spark prebuilt package for Hadoop; initially I was confused about the language surrounding what to do with the environment variables, but realized this meant:

Create a new environment variable SPARK_HOME and set it to where I unzipped Spark (in this case, C:\spark-2.0.1-bin-hadoop2.7)
Edit the PATH environment variable to add %SPARK_HOME%\bin

Since I have Spark prebuilt for Hadoop 2.7, I downloaded the winutils.exe for hadoop-2.7.1 at https://github.com/steveloughran/winutils/tree/master/hadoop-2.7.1/bin
I set log4j.properties to only show WARN and above, per http://stackoverflow.com/questions/28189408/how-to-reduce-the-verbosity-of-sparks-runtime-output

And, success! running spark-shell, opening the Spark UI, and running the SparkPi example.

Up next: choosing and setting up a Scala IDE

Thursday, October 13, 2016

Benchmarking data generation using SPSS Statistics

A little bit of #tbt. The related code is in my GitHub.

It's always useful to be able to test a new statistical algorithm with some easy-to-generate simulated data that you have some control over and can throw away at the end of the day. The trouble with some generators I've seen is that the data are completely random, and therefore useless for testing predictive models. So while working on SPSS, in an eat your own dog food moment, I wrote one that's a little smarter using pure SPSS syntax (no calling R or Python).

The basic idea behind it is that we posit a number of underlying, unobserved factors that all of the observed variables are (generalized) linearly related to. The strength of the relationships is determined by randomly generated regression coefficients. In this way, we can create a random dataset in which, by chance, some of the predictors will be related to the target (and each other), but we don't know in advance which relationships will exist, or how strong they will be (unless we're reusing a random seed).

Usage

You need the two SPSS syntax files:

data_creation_macros.sps, which does the actual work. You shouldn't need to touch this one.
creating_data.sps, which contains an example of how to use the createData macro defined in the other file. In order to run this syntax:
- Edit the path in the FILE HANDLE command to point to wherever you've put the code
- (Optionally) Set your own random seed or comment out that line in order to get a non-replicable dataset
- (Optionally) Edit the parameter values in the createData call to whatever you want

After running the creating_data.sps syntax, the Data Editor will contain the benchmarking data and can be used immediately.

Extensions / Alternatives

Right now, the relationships between the target and predictors is pretty simple. More generalized relationships could be added.

The structure of the data is also simple: each record is a separate, independent case. Hierarchical structures could be added.

To be more generally useful, this code could also be ported to Python/R.

Monday, October 3, 2016

Poking around StackLite, Part 2: the questions dataframe

questions This follows up on Part 1. (Blogging lessons relearned: don't post part 1 until you have part 2 ready to go.)

questions

First, let's investigate the questions dataframe. There are 15806951 rows and 11 columns in questions; now let's take a quick look at what each column represents:

head(questions,10)

## # A tibble: 10 × 7
##       Id        CreationDate          ClosedDate        DeletionDate Score
##    <int>              <dttm>              <dttm>              <dttm> <int>
## 1      1 2008-07-31 21:26:37                <NA> 2011-03-28 00:53:47     1
## 2      4 2008-07-31 21:42:52                <NA>                <NA>   418
## 3      6 2008-07-31 22:08:08                <NA>                <NA>   188
## 4      8 2008-07-31 23:33:19 2013-06-03 04:00:25 2015-02-11 08:26:40    42
## 5      9 2008-07-31 23:40:59                <NA>                <NA>  1306
## 6     11 2008-07-31 23:55:37                <NA>                <NA>  1062
## 7     13 2008-08-01 00:42:38                <NA>                <NA>   425
## 8     14 2008-08-01 00:59:11                <NA>                <NA>   272
## 9     16 2008-08-01 04:59:33                <NA>                <NA>    68
## 10    17 2008-08-01 05:09:55                <NA>                <NA>   103
## # ... with 2 more variables: OwnerUserId <int>, AnswerCount <int>

Next I want some overall summary statistics:

library(dplyr)
# This took forever to run
#summary(questions)
# This ran very fast
questions %>% sample_n(size=100000) %>% summary()

##        Id            CreationDate                
##  Min.   :      19   Min.   :2008-08-01 05:21:22  
##  1st Qu.:11938480   1st Qu.:2012-08-13 16:33:16  
##  Median :21571118   Median :2014-02-05 07:28:47  
##  Mean   :21034252   Mean   :2013-11-15 00:15:04  
##  3rd Qu.:30456193   3rd Qu.:2015-05-26 10:40:04  
##  Max.   :39089497   Max.   :2016-08-22 22:10:52  
##                                                  
##    ClosedDate                   DeletionDate                
##  Min.   :2008-09-04 15:14:49   Min.   :2008-08-16 02:37:10  
##  1st Qu.:2013-03-10 03:16:48   1st Qu.:2013-07-12 03:00:01  
##  Median :2014-04-10 03:05:42   Median :2014-09-15 03:00:01  
##  Mean   :2014-03-19 04:26:11   Mean   :2014-06-18 19:25:58  
##  3rd Qu.:2015-06-27 07:11:11   3rd Qu.:2015-09-08 20:28:56  
##  Max.   :2016-08-23 07:42:06   Max.   :2016-08-23 16:18:22  
##  NA's   :89947                 NA's   :77591                
##      Score           OwnerUserId       AnswerCount      timeToClose     
##  Min.   : -24.000   Min.   :      4   Min.   :-3.000   Min.   :   0.00  
##  1st Qu.:   0.000   1st Qu.: 646565   1st Qu.: 1.000   1st Qu.:   0.03  
##  Median :   0.000   Median :1560697   Median : 1.000   Median :   0.18  
##  Mean   :   1.153   Mean   :2089195   Mean   : 1.443   Mean   :  81.57  
##  3rd Qu.:   1.000   3rd Qu.:3241846   3rd Qu.: 2.000   3rd Qu.:   1.00  
##  Max.   :2440.000   Max.   :6745398   Max.   :32.000   Max.   :2704.81  
##                     NA's   :23319     NA's   :10238    NA's   :89947    
##   timeToDelete       closed         deleted       
##  Min.   :   0.00   Mode :logical   Mode :logical  
##  1st Qu.:   0.91   FALSE:89947     FALSE:77591    
##  Median :  30.78   TRUE :10053     TRUE :22409    
##  Mean   : 147.35   NA's :0         NA's :0        
##  3rd Qu.: 366.21                                  
##  Max.   :2592.85                                  
##  NA's   :77591

Hunh. Normally I would run summary() in order to get summary statistics on all the fields in one place, but it looks like summary() is not performant on 46M records, though it's very fast to grab a sample of 100k records and run summary() on that, so I have a reasonable idea of the summary statistics for the full dataset.
We'll do summaries of the complete set of rows as we look at each column in turn.

CreationDate

library(dplyr)
# This took forever to run
#t(questions %>% summarize_at(vars(CreationDate),funs(min,quantile,max,mean,sd)))
t(questions %>% summarize_at(vars(CreationDate),funs(min,median,max,mean)))

##        [,1]                 
## min    "2008-07-31 21:26:37"
## median "2014-02-11 10:37:03"
## max    "2016-08-22 23:58:42"
## mean   "2013-11-17 01:44:26"

questions %>% 
  filter(!is.na(CreationDate)) %>%
  summarize(n())

## # A tibble: 1 × 1
##      `n()`
##      <int>
## 1 15806951

OIC. summary() is slow because it's computing order statistics (quartiles) – dplyr's summarize() chokes on quantile() with 46M records. This is a good reminder to be wary when working with larger data, and that R's standard summary statistics are not optimized for larger data; there doesn't seem to be an option for computing estimated, rather than exact, sample quartiles.
So we can get some of the summary statistics quickly and easily, though I can't seem to get count() or n() to work within funs() of summarize(); ideally, I want to compute the number of missing values during the same data pass where I'm calculating the other summary statistics.
More illuminating than the summary statistics is the plot of the number of new questions created per week (from David's original post):

library(ggplot2)
library(lubridate)

questions %>%
  count(Week = round_date(CreationDate, "week")) %>%
  ggplot(aes(Week, n)) +
  geom_line()

A few things to note:

There's a changepoint in the series. The graph shows a steady increase in the number of questions asked from July 2008 to the start of 2014, where the series reaches a steady state.
There's clear yearly seasonality when the number of new questions drops precipitously in the last week of the year.
There's potentially interesting yearly seasonality where the number of new questions is higher in Q1 (as everyone tries to keep their New Year's resolutions to learn more about some new language?) but the visual evidence isn't strong.

ClosedDate

In order to get summary statistics for ClosedDate using dplyr, I need to filter out the NAs.

library(dplyr)
# This gave all NAs
t(questions %>% summarize_at(vars(ClosedDate),funs(min,median,max,mean)))

##        [,1]
## min    NA  
## median NA  
## max    NA  
## mean   NA

# This gave what I was looking for.
questions %>% 
  filter(!is.na(ClosedDate)) %>%
  summarize(n())

## # A tibble: 1 × 1
##     `n()`
##     <int>
## 1 1577138

questions %>% 
  filter(!is.na(ClosedDate)) %>%
  summarize_at(vars(ClosedDate),funs(min,median,max,mean)) %>%
  t()

##        [,1]                 
## min    "2008-08-21 13:15:20"
## median "2014-04-21 03:45:03"
## max    "2016-08-23 19:50:21"
## mean   "2014-03-23 11:26:23"

Again, these summary statistics themselves are not particularly interesting, except that the number of closed questions is about 10% of the overall number of questions.
Adapting David's plot of the number of questions created per week, let's look at the number of questions closed per week:

library(ggplot2)
library(lubridate)

questions %>%
  count(Week = round_date(ClosedDate, "week")) %>%
  ggplot(aes(Week, n)) +
  ylim(0,20000) +
  geom_line()

This starts to get at the question of the rate of question closure over time, because we can see the pattern of question closure in the plot. A few things to note:

Like the number of questions asked per week, the number of questions closed per week increases until it reaches a steady state
Unlike the number of questions asked per week, the number of questions closed appears to have two changepoints. The number of questions closed per week increases slowly from July 2008 until the end of 2010, when the slope changes from Jan 2011 to the end of 2012, at which point it reaches its steady state
There isn't a clear seasonal dip in the number of questions closed at the end of the year
There is a serious outlier in early 2014 where nearly 20,000 questions were closed in a single week. Is this an administrative action?

We might also want to examine how long it takes before a question is closed, and how that changes over time.

questions$timeToClose <- as.numeric(difftime(questions$ClosedDate,questions$CreationDate,units="days"))
questions %>% 
  filter(!is.na(timeToClose)) %>%
  summarize_at(vars(timeToClose),funs(min,median,max,mean)) %>%
  t()

##                [,1]
## min    4.629630e-05
## median 1.707060e-01
## max    2.930478e+03
## mean   7.486537e+01

hist(questions$timeToClose)

Week <- round_date(questions$CreationDate, "week")
aggClosingTime <- aggregate(questions$timeToClose ~ Week, data=questions, mean)
ggplot(aggClosingTime, aes(aggClosingTime$Week,aggClosingTime$'questions$timeToClose')) + geom_line()

Not entirely surprisingly, questions created further in the past have had more time to be closed. However, the time to close questions also has a highly skewed distribution, so the distribution of time to close those early questions could be skewed by a few questions that took a long time to close.

##               [,1]
## min       1.000012
## median   20.906545
## max    2930.477755
## mean    305.454393

DeletionDate

library(dplyr)
questions %>% 
  filter(!is.na(DeletionDate)) %>%
  summarize(n())

## # A tibble: 1 × 1
##     `n()`
##     <int>
## 1 3527214

questions %>% 
  filter(!is.na(DeletionDate)) %>%
  summarize_at(vars(DeletionDate),funs(min,median,max,mean)) %>%
  t()

##        [,1]                 
## min    "2008-08-01 13:18:31"
## median "2014-09-25 19:32:04"
## max    "2016-08-23 19:50:08"
## mean   "2014-06-26 07:32:09"

More than twice as many questions have been deleted than closed.
Number of questions deleted per week:

library(ggplot2)
library(lubridate)

questions %>%
  count(Week = round_date(DeletionDate, "week")) %>%
  ggplot(aes(Week, n)) +
  ylim(0,50000) +
  geom_line()

Similarly, this starts to get at the question of the rate of question deletion over time. Some things to note:

Like number of questions closed, the number of questions deleted appears to have two changepoints. The number of questions deleted per week increases slowly from July 2008 until the end of 2010, when the slope changes from Jan 2011 to the end of 2014, at which point it may have reached its steady state – this is difficult to tell, because there hasn't been much time to judge whether this is so.
There are many more “outlying” points in this chart. 2013-2014 is an especially chaotic period.

We might also want to examine how long it takes before a question is deleted, and how that changes over time.

questions$timeToDelete <- as.numeric(difftime(questions$DeletionDate,questions$CreationDate,units="days"))
questions %>% 
  filter(!is.na(timeToDelete)) %>%
  summarize_at(vars(timeToDelete),funs(min,median,max,mean)) %>%
  t()

##                [,1]
## min    1.157407e-05
## median 3.072476e+01
## max    2.888135e+03
## mean   1.459521e+02

hist(questions$timeToDelete)

aggDeletionTime <- aggregate(questions$timeToDelete ~ Week, data=questions, mean)
ggplot(aggDeletionTime, aes(aggDeletionTime$Week,aggDeletionTime$'questions$timeToDelete')) + geom_line()

As with questions closed, questions created further in the past have had more time to be deleted, with the same concerns about the skewed distribution for time to delete. However, there are a couple of oddities to note:

The distribution of time to delete is bimodal. The expected highly-skewed distribution is broken up by a spike around… could that be 365 days? Is there an administrative job that auto-deletes questions after a year?
There is a sudden dropoff in the plot of time to delete over time. If there is an admin job that auto-deletes questions, then this dropoff is likely due to the fact that the yearly admin job hasn't yet affected questions opened after the dropoff.

##    Var1  Var2     Freq
## 1 FALSE FALSE 11740327
## 2  TRUE FALSE   539410
## 3 FALSE  TRUE  2489486
## 4  TRUE  TRUE  1037728

## Error in eval(substitute(expr), envir, enclos): wrong result size (2), expected 15806951 or 1

Score

library(dplyr)
questions %>% 
  filter(!is.na(Score)) %>%
  summarize_at(vars(Score),funs(min,median,max,mean)) %>%
  t()

##                [,1]
## min     -154.000000
## median     0.000000
## max    15422.000000
## mean       1.193366

There's a wide range of values for question Score, with the median smack at 0. This is an instance where it would be great to have Q1 and Q3, so that we'd know to expect a cruddy histogram:

ggplot(data=questions, aes(questions$Score)) + geom_histogram()

The histogram is dominated by the 0 values; nearly half of the questions asked have a 0 score.

questions %>%
  filter(Score==0) %>%
  dim()

## [1] 7722026      11

However, even when you remove all the 0's, the histogram is still dominated by small values:

questions %>%
  filter(Score!=0) %>%

  ggplot(data=filter(questions,Score!=0), aes(Score)) + geom_histogram()

## Error: Mapping should be created with `aes() or `aes_()`.

Focusing in on the values closest to 0, we get a slightly better view of the distribution of values.

ggplot(data=questions, aes(questions$Score)) + xlim(-10, 10) + geom_histogram()

OwnerUserId

Number of unique users who have asked questions:

length(unique(questions$OwnerUserId))

## [1] 2296759

A listing of the users who have asked the most questions, presented in two different ways:

uniqueOwners <- as.data.frame(table(questions$OwnerUserId))
head(uniqueOwners[order(-uniqueOwners$Freq),],20)

##           Var1 Freq
## 370326  875317 2095
## 14338    39677 2084
## 2854      4653 1801
## 13038    34537 1659
## 50971   179736 1507
## 34607   117700 1463
## 182439  470184 1394
## 264143  651174 1359
## 509818 1194415 1357
## 42908   149080 1289
## 4613      8741 1255
## 21004    65387 1229
## 25875    84201 1192
## 113979  325418 1191
## 42278   146780 1146
## 199575  508127 1113
## 326524  784597 1112
## 2846      4639 1109
## 165189  434051 1099
## 35903   122536 1094

uniqueOwners2 <- count(questions, c(questions$OwnerUserId))
head(uniqueOwners2[order(-uniqueOwners2$n),],20)

## # A tibble: 20 × 2
##    `c(questions$OwnerUserId)`       n
##                         <int>   <int>
## 1                          NA 3668649
## 2                      875317    2095
## 3                       39677    2084
## 4                        4653    1801
## 5                       34537    1659
## 6                      179736    1507
## 7                      117700    1463
## 8                      470184    1394
## 9                      651174    1359
## 10                    1194415    1357
## 11                     149080    1289
## 12                       8741    1255
## 13                      65387    1229
## 14                      84201    1192
## 15                     325418    1191
## 16                     146780    1146
## 17                     508127    1113
## 18                     784597    1112
## 19                       4639    1109
## 20                     434051    1099

I prefer the latter because I get the NAs by default. 3.6M questions where we don't have a user ID for the asker is significant.

AnswerCount

library(dplyr)
questions %>% 
  filter(!is.na(AnswerCount)) %>%
  summarize_at(vars(AnswerCount),funs(min,median,max,mean)) %>%
  t()

##              [,1]
## min     -5.000000
## median   1.000000
## max    518.000000
## mean     1.444817

How do you get a negative AnswerCount?!
Histogram of number of answers to a question:

ggplot(data=questions, aes(questions$AnswerCount)) + geom_histogram()

Like Score, the number of answers is dominated by small counts. Focusing in on the values closest to 0, we get a slightly better view of the distribution of values.

ggplot(data=questions, aes(questions$AnswerCount)) + xlim(-5, 10)  + geom_histogram()