Wednesday, September 14, 2016

Poking around StackLite, Part 1 : Getting started

I was interested in David Robinson's post on the StackLite dataset of StackOverflow questions and tags as a good opportunity to (1) play with some new data, and (2) mess around with forking a public project on GitHub.

Note: As is often the case, the journey is as important as the destination, so there will be multiple posts, not all of which will contain much in the way of data analysis.

Fork vs. Clone vs. Download and create new repo

First: how to work with this data?  I could download the StackLite data and work with it locally, or clone or fork David's repo.

One of the neat things about StackLite vs. the StackExchange Data Dump is that the StackLite data itself is versioned on GitHub, and therefore if you fork or clone the repo, you can pull updates to the data through Git; that's cool, and opens the possibility of automating the pulls to generate regular updates of any analysis done, so I'm not going to just download the data.  And since I'm not a StackExchange employee and don't know David personally, that means forking the repo and cloning the fork.

(This is all obvious in retrospect, but nearly all my source control experience is with ClearCase, SVN, and Rational Team Concert, so I needed to do some researching to try to be sure of any GitHub community protocols)

Sample code and knitr

The knitr package is one of the many that I have read about, but haven't had much of a chance to play with before.  The basic idea is to single-source your code in, say, an R Markdown file that also contains the writeup of your analysis.  David includes such code in his README.Rmd and calls it from setup-data.R using knitr.  This, in turn, generates the README.md that gets reused in HTML format in his blog post.  So far, so good.

However, this R file is also used to create the StackLite dataset, and, according to the comments, "cannot be run except by Stack Overflow employees with access to the internal sqlstackr package and the Stack Exchange databases."

So I could edit David's README in my fork for my own purposes, but because any edits I make would break what David is doing with his README, that messes with the possibility of incorporating any analyses I do into the original repository.  Also, I'm not 100% sure how it would affect any pulls I make from the original repository to my fork.  I can probably set up Git to only pull the new data, but again, that seems ugly.

So, my plan for now is to create a new folder within my repo that has its own README and related markdown files and associated R file(s).

Part 2 should actually get to some data analysis.

Friday, September 2, 2016

Creating a separate "work" blog

For some years I've maintained a blog that has been mostly personal but sometimes strays into statistics, data analysis, and other topics that could be considered "work".  For a little while, at least, I'm going to try to commit to doing more "work" posts and have decided this warrants a separate blog, and I'll figure out along the way what to do with posts that straddle the work/personal line.

It also means that I need a name for the new blog.  The personal blog was easy, based off a nickname given to me, but it doesn't easily extend to "Hardcover Reutter" for the work blog.  

Using "Bayesian Wonderland" is personally appealing, but after 18 years working for SPSS/IBM I've done little to no Bayesian research and am unlikely to be posting much Bayesian material.

Norm Matloff already snagged the perfect "Mad (Data) Scientist", and there are lots of "Confessions of..." out there. 

So for now, it's "Name This Data Science Blog" and I'll eventually come up with something that is both pithy and conveys what the blog is about.