Name This Data Science Blog: Poking around StackLite, Part 2: the questions dataframe

questions This follows up on Part 1. (Blogging lessons relearned: don't post part 1 until you have part 2 ready to go.)

questions

First, let's investigate the questions dataframe. There are 15806951 rows and 11 columns in questions; now let's take a quick look at what each column represents:

head(questions,10)

## # A tibble: 10 × 7
##       Id        CreationDate          ClosedDate        DeletionDate Score
##    <int>              <dttm>              <dttm>              <dttm> <int>
## 1      1 2008-07-31 21:26:37                <NA> 2011-03-28 00:53:47     1
## 2      4 2008-07-31 21:42:52                <NA>                <NA>   418
## 3      6 2008-07-31 22:08:08                <NA>                <NA>   188
## 4      8 2008-07-31 23:33:19 2013-06-03 04:00:25 2015-02-11 08:26:40    42
## 5      9 2008-07-31 23:40:59                <NA>                <NA>  1306
## 6     11 2008-07-31 23:55:37                <NA>                <NA>  1062
## 7     13 2008-08-01 00:42:38                <NA>                <NA>   425
## 8     14 2008-08-01 00:59:11                <NA>                <NA>   272
## 9     16 2008-08-01 04:59:33                <NA>                <NA>    68
## 10    17 2008-08-01 05:09:55                <NA>                <NA>   103
## # ... with 2 more variables: OwnerUserId <int>, AnswerCount <int>

Next I want some overall summary statistics:

library(dplyr)
# This took forever to run
#summary(questions)
# This ran very fast
questions %>% sample_n(size=100000) %>% summary()

##        Id            CreationDate                
##  Min.   :      19   Min.   :2008-08-01 05:21:22  
##  1st Qu.:11938480   1st Qu.:2012-08-13 16:33:16  
##  Median :21571118   Median :2014-02-05 07:28:47  
##  Mean   :21034252   Mean   :2013-11-15 00:15:04  
##  3rd Qu.:30456193   3rd Qu.:2015-05-26 10:40:04  
##  Max.   :39089497   Max.   :2016-08-22 22:10:52  
##                                                  
##    ClosedDate                   DeletionDate                
##  Min.   :2008-09-04 15:14:49   Min.   :2008-08-16 02:37:10  
##  1st Qu.:2013-03-10 03:16:48   1st Qu.:2013-07-12 03:00:01  
##  Median :2014-04-10 03:05:42   Median :2014-09-15 03:00:01  
##  Mean   :2014-03-19 04:26:11   Mean   :2014-06-18 19:25:58  
##  3rd Qu.:2015-06-27 07:11:11   3rd Qu.:2015-09-08 20:28:56  
##  Max.   :2016-08-23 07:42:06   Max.   :2016-08-23 16:18:22  
##  NA's   :89947                 NA's   :77591                
##      Score           OwnerUserId       AnswerCount      timeToClose     
##  Min.   : -24.000   Min.   :      4   Min.   :-3.000   Min.   :   0.00  
##  1st Qu.:   0.000   1st Qu.: 646565   1st Qu.: 1.000   1st Qu.:   0.03  
##  Median :   0.000   Median :1560697   Median : 1.000   Median :   0.18  
##  Mean   :   1.153   Mean   :2089195   Mean   : 1.443   Mean   :  81.57  
##  3rd Qu.:   1.000   3rd Qu.:3241846   3rd Qu.: 2.000   3rd Qu.:   1.00  
##  Max.   :2440.000   Max.   :6745398   Max.   :32.000   Max.   :2704.81  
##                     NA's   :23319     NA's   :10238    NA's   :89947    
##   timeToDelete       closed         deleted       
##  Min.   :   0.00   Mode :logical   Mode :logical  
##  1st Qu.:   0.91   FALSE:89947     FALSE:77591    
##  Median :  30.78   TRUE :10053     TRUE :22409    
##  Mean   : 147.35   NA's :0         NA's :0        
##  3rd Qu.: 366.21                                  
##  Max.   :2592.85                                  
##  NA's   :77591

Hunh. Normally I would run summary() in order to get summary statistics on all the fields in one place, but it looks like summary() is not performant on 46M records, though it's very fast to grab a sample of 100k records and run summary() on that, so I have a reasonable idea of the summary statistics for the full dataset.
We'll do summaries of the complete set of rows as we look at each column in turn.

CreationDate

library(dplyr)
# This took forever to run
#t(questions %>% summarize_at(vars(CreationDate),funs(min,quantile,max,mean,sd)))
t(questions %>% summarize_at(vars(CreationDate),funs(min,median,max,mean)))

##        [,1]                 
## min    "2008-07-31 21:26:37"
## median "2014-02-11 10:37:03"
## max    "2016-08-22 23:58:42"
## mean   "2013-11-17 01:44:26"

questions %>% 
  filter(!is.na(CreationDate)) %>%
  summarize(n())

## # A tibble: 1 × 1
##      `n()`
##      <int>
## 1 15806951

OIC. summary() is slow because it's computing order statistics (quartiles) – dplyr's summarize() chokes on quantile() with 46M records. This is a good reminder to be wary when working with larger data, and that R's standard summary statistics are not optimized for larger data; there doesn't seem to be an option for computing estimated, rather than exact, sample quartiles.
So we can get some of the summary statistics quickly and easily, though I can't seem to get count() or n() to work within funs() of summarize(); ideally, I want to compute the number of missing values during the same data pass where I'm calculating the other summary statistics.
More illuminating than the summary statistics is the plot of the number of new questions created per week (from David's original post):

library(ggplot2)
library(lubridate)

questions %>%
  count(Week = round_date(CreationDate, "week")) %>%
  ggplot(aes(Week, n)) +
  geom_line()

A few things to note:

There's a changepoint in the series. The graph shows a steady increase in the number of questions asked from July 2008 to the start of 2014, where the series reaches a steady state.
There's clear yearly seasonality when the number of new questions drops precipitously in the last week of the year.
There's potentially interesting yearly seasonality where the number of new questions is higher in Q1 (as everyone tries to keep their New Year's resolutions to learn more about some new language?) but the visual evidence isn't strong.

ClosedDate

In order to get summary statistics for ClosedDate using dplyr, I need to filter out the NAs.

library(dplyr)
# This gave all NAs
t(questions %>% summarize_at(vars(ClosedDate),funs(min,median,max,mean)))

##        [,1]
## min    NA  
## median NA  
## max    NA  
## mean   NA

# This gave what I was looking for.
questions %>% 
  filter(!is.na(ClosedDate)) %>%
  summarize(n())

## # A tibble: 1 × 1
##     `n()`
##     <int>
## 1 1577138

questions %>% 
  filter(!is.na(ClosedDate)) %>%
  summarize_at(vars(ClosedDate),funs(min,median,max,mean)) %>%
  t()

##        [,1]                 
## min    "2008-08-21 13:15:20"
## median "2014-04-21 03:45:03"
## max    "2016-08-23 19:50:21"
## mean   "2014-03-23 11:26:23"

Again, these summary statistics themselves are not particularly interesting, except that the number of closed questions is about 10% of the overall number of questions.
Adapting David's plot of the number of questions created per week, let's look at the number of questions closed per week:

library(ggplot2)
library(lubridate)

questions %>%
  count(Week = round_date(ClosedDate, "week")) %>%
  ggplot(aes(Week, n)) +
  ylim(0,20000) +
  geom_line()

This starts to get at the question of the rate of question closure over time, because we can see the pattern of question closure in the plot. A few things to note:

Like the number of questions asked per week, the number of questions closed per week increases until it reaches a steady state
Unlike the number of questions asked per week, the number of questions closed appears to have two changepoints. The number of questions closed per week increases slowly from July 2008 until the end of 2010, when the slope changes from Jan 2011 to the end of 2012, at which point it reaches its steady state
There isn't a clear seasonal dip in the number of questions closed at the end of the year
There is a serious outlier in early 2014 where nearly 20,000 questions were closed in a single week. Is this an administrative action?

We might also want to examine how long it takes before a question is closed, and how that changes over time.

questions$timeToClose <- as.numeric(difftime(questions$ClosedDate,questions$CreationDate,units="days"))
questions %>% 
  filter(!is.na(timeToClose)) %>%
  summarize_at(vars(timeToClose),funs(min,median,max,mean)) %>%
  t()

##                [,1]
## min    4.629630e-05
## median 1.707060e-01
## max    2.930478e+03
## mean   7.486537e+01

hist(questions$timeToClose)

Week <- round_date(questions$CreationDate, "week")
aggClosingTime <- aggregate(questions$timeToClose ~ Week, data=questions, mean)
ggplot(aggClosingTime, aes(aggClosingTime$Week,aggClosingTime$'questions$timeToClose')) + geom_line()

Not entirely surprisingly, questions created further in the past have had more time to be closed. However, the time to close questions also has a highly skewed distribution, so the distribution of time to close those early questions could be skewed by a few questions that took a long time to close.

##               [,1]
## min       1.000012
## median   20.906545
## max    2930.477755
## mean    305.454393

DeletionDate

library(dplyr)
questions %>% 
  filter(!is.na(DeletionDate)) %>%
  summarize(n())

## # A tibble: 1 × 1
##     `n()`
##     <int>
## 1 3527214

questions %>% 
  filter(!is.na(DeletionDate)) %>%
  summarize_at(vars(DeletionDate),funs(min,median,max,mean)) %>%
  t()

##        [,1]                 
## min    "2008-08-01 13:18:31"
## median "2014-09-25 19:32:04"
## max    "2016-08-23 19:50:08"
## mean   "2014-06-26 07:32:09"

More than twice as many questions have been deleted than closed.
Number of questions deleted per week:

library(ggplot2)
library(lubridate)

questions %>%
  count(Week = round_date(DeletionDate, "week")) %>%
  ggplot(aes(Week, n)) +
  ylim(0,50000) +
  geom_line()

Similarly, this starts to get at the question of the rate of question deletion over time. Some things to note:

Like number of questions closed, the number of questions deleted appears to have two changepoints. The number of questions deleted per week increases slowly from July 2008 until the end of 2010, when the slope changes from Jan 2011 to the end of 2014, at which point it may have reached its steady state – this is difficult to tell, because there hasn't been much time to judge whether this is so.
There are many more “outlying” points in this chart. 2013-2014 is an especially chaotic period.

We might also want to examine how long it takes before a question is deleted, and how that changes over time.

questions$timeToDelete <- as.numeric(difftime(questions$DeletionDate,questions$CreationDate,units="days"))
questions %>% 
  filter(!is.na(timeToDelete)) %>%
  summarize_at(vars(timeToDelete),funs(min,median,max,mean)) %>%
  t()

##                [,1]
## min    1.157407e-05
## median 3.072476e+01
## max    2.888135e+03
## mean   1.459521e+02

hist(questions$timeToDelete)

aggDeletionTime <- aggregate(questions$timeToDelete ~ Week, data=questions, mean)
ggplot(aggDeletionTime, aes(aggDeletionTime$Week,aggDeletionTime$'questions$timeToDelete')) + geom_line()

As with questions closed, questions created further in the past have had more time to be deleted, with the same concerns about the skewed distribution for time to delete. However, there are a couple of oddities to note:

The distribution of time to delete is bimodal. The expected highly-skewed distribution is broken up by a spike around… could that be 365 days? Is there an administrative job that auto-deletes questions after a year?
There is a sudden dropoff in the plot of time to delete over time. If there is an admin job that auto-deletes questions, then this dropoff is likely due to the fact that the yearly admin job hasn't yet affected questions opened after the dropoff.

##    Var1  Var2     Freq
## 1 FALSE FALSE 11740327
## 2  TRUE FALSE   539410
## 3 FALSE  TRUE  2489486
## 4  TRUE  TRUE  1037728

## Error in eval(substitute(expr), envir, enclos): wrong result size (2), expected 15806951 or 1

Score

library(dplyr)
questions %>% 
  filter(!is.na(Score)) %>%
  summarize_at(vars(Score),funs(min,median,max,mean)) %>%
  t()

##                [,1]
## min     -154.000000
## median     0.000000
## max    15422.000000
## mean       1.193366

There's a wide range of values for question Score, with the median smack at 0. This is an instance where it would be great to have Q1 and Q3, so that we'd know to expect a cruddy histogram:

ggplot(data=questions, aes(questions$Score)) + geom_histogram()

The histogram is dominated by the 0 values; nearly half of the questions asked have a 0 score.

questions %>%
  filter(Score==0) %>%
  dim()

## [1] 7722026      11

However, even when you remove all the 0's, the histogram is still dominated by small values:

questions %>%
  filter(Score!=0) %>%

  ggplot(data=filter(questions,Score!=0), aes(Score)) + geom_histogram()

## Error: Mapping should be created with `aes() or `aes_()`.

Focusing in on the values closest to 0, we get a slightly better view of the distribution of values.

ggplot(data=questions, aes(questions$Score)) + xlim(-10, 10) + geom_histogram()

OwnerUserId

Number of unique users who have asked questions:

length(unique(questions$OwnerUserId))

## [1] 2296759

A listing of the users who have asked the most questions, presented in two different ways:

uniqueOwners <- as.data.frame(table(questions$OwnerUserId))
head(uniqueOwners[order(-uniqueOwners$Freq),],20)

##           Var1 Freq
## 370326  875317 2095
## 14338    39677 2084
## 2854      4653 1801
## 13038    34537 1659
## 50971   179736 1507
## 34607   117700 1463
## 182439  470184 1394
## 264143  651174 1359
## 509818 1194415 1357
## 42908   149080 1289
## 4613      8741 1255
## 21004    65387 1229
## 25875    84201 1192
## 113979  325418 1191
## 42278   146780 1146
## 199575  508127 1113
## 326524  784597 1112
## 2846      4639 1109
## 165189  434051 1099
## 35903   122536 1094

uniqueOwners2 <- count(questions, c(questions$OwnerUserId))
head(uniqueOwners2[order(-uniqueOwners2$n),],20)

## # A tibble: 20 × 2
##    `c(questions$OwnerUserId)`       n
##                         <int>   <int>
## 1                          NA 3668649
## 2                      875317    2095
## 3                       39677    2084
## 4                        4653    1801
## 5                       34537    1659
## 6                      179736    1507
## 7                      117700    1463
## 8                      470184    1394
## 9                      651174    1359
## 10                    1194415    1357
## 11                     149080    1289
## 12                       8741    1255
## 13                      65387    1229
## 14                      84201    1192
## 15                     325418    1191
## 16                     146780    1146
## 17                     508127    1113
## 18                     784597    1112
## 19                       4639    1109
## 20                     434051    1099

I prefer the latter because I get the NAs by default. 3.6M questions where we don't have a user ID for the asker is significant.

AnswerCount

library(dplyr)
questions %>% 
  filter(!is.na(AnswerCount)) %>%
  summarize_at(vars(AnswerCount),funs(min,median,max,mean)) %>%
  t()

##              [,1]
## min     -5.000000
## median   1.000000
## max    518.000000
## mean     1.444817

How do you get a negative AnswerCount?!
Histogram of number of answers to a question:

ggplot(data=questions, aes(questions$AnswerCount)) + geom_histogram()

Like Score, the number of answers is dominated by small counts. Focusing in on the values closest to 0, we get a slightly better view of the distribution of values.

ggplot(data=questions, aes(questions$AnswerCount)) + xlim(-5, 10)  + geom_histogram()

Name This Data Science Blog

Monday, October 3, 2016

Poking around StackLite, Part 2: the questions dataframe