Wednesday, September 14, 2016

Poking around StackLite, Part 1 : Getting started

I was interested in David Robinson's post on the StackLite dataset of StackOverflow questions and tags as a good opportunity to (1) play with some new data, and (2) mess around with forking a public project on GitHub.

Note: As is often the case, the journey is as important as the destination, so there will be multiple posts, not all of which will contain much in the way of data analysis.

Fork vs. Clone vs. Download and create new repo

First: how to work with this data?  I could download the StackLite data and work with it locally, or clone or fork David's repo.

One of the neat things about StackLite vs. the StackExchange Data Dump is that the StackLite data itself is versioned on GitHub, and therefore if you fork or clone the repo, you can pull updates to the data through Git; that's cool, and opens the possibility of automating the pulls to generate regular updates of any analysis done, so I'm not going to just download the data.  And since I'm not a StackExchange employee and don't know David personally, that means forking the repo and cloning the fork.

(This is all obvious in retrospect, but nearly all my source control experience is with ClearCase, SVN, and Rational Team Concert, so I needed to do some researching to try to be sure of any GitHub community protocols)

Sample code and knitr

The knitr package is one of the many that I have read about, but haven't had much of a chance to play with before.  The basic idea is to single-source your code in, say, an R Markdown file that also contains the writeup of your analysis.  David includes such code in his README.Rmd and calls it from setup-data.R using knitr.  This, in turn, generates the README.md that gets reused in HTML format in his blog post.  So far, so good.

However, this R file is also used to create the StackLite dataset, and, according to the comments, "cannot be run except by Stack Overflow employees with access to the internal sqlstackr package and the Stack Exchange databases."

So I could edit David's README in my fork for my own purposes, but because any edits I make would break what David is doing with his README, that messes with the possibility of incorporating any analyses I do into the original repository.  Also, I'm not 100% sure how it would affect any pulls I make from the original repository to my fork.  I can probably set up Git to only pull the new data, but again, that seems ugly.

So, my plan for now is to create a new folder within my repo that has its own README and related markdown files and associated R file(s).

Part 2 should actually get to some data analysis.

No comments:

Post a Comment