This is very interesting. The author argues that “Data carpentry” is “not a single process but a thousand little skills and techniques”. He takes issue with the manner in which other ways of framing this dimension of what data scientists do obscure the craft inherent in it. I think this argument has important implications for the rapid expansion of data science courses and the risk that speed and modularisation lead ‘data carpentry’ to be rendered peripheral:
The New York Times has an article titled For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Mostly I really like it. The fact that raw data is rarely usable for analysis without significant work is a point I try hard to make with my students. I told them “do not underestimate the difficulty of data preparation”. When they turned in their projects, many of them reported that they had underestimated the difficulty of data preparation. Recognizing this as a hard problem is great.
What I’m less thrilled about is calling this “janitor work”. For one thing, it’s not particularly respectful of custodians, whose work I really appreciate. But it also mis-characterizes what this type of work is about. I’d like to propose a different analogy that I think fits a lot better: data carpentry. (Note: data carpentry seems to already be a thing).
Why is woodworking a better analogy? The article uses a few other terms, like data wrangling (data as unruly beasts to be tamed?) and munging (what is that, anyway?), neither of which mean much to me. I also like data curation but that’s also a bit vague. Data carpentry probably has something to do with wishing I could make things like Carrie Roy, but I should start by saying what I don’t like about the “data cleaning” or “janitor work” terms. To me these imply that there is some kind of pure or clean data buried in a thin layer of non-clean data, and that one need only hose the dataset off to reveal the hard porcelain underneath the muck. In reality, the process is more like deciding how to cut into a piece of material, or how much to plane down a surface. It’s not that there’s any real distinction between good and bad, it’s more that some parts are softer or knottier than others. Judgement is critical.
I’m interested in the rapidity with which the role of ‘data scientist’ is emerging, the interests expressed within it and their conjunction in the institutionalisation of ‘data science’: what implications does the hype surrounding data science have for how data science courses are designed and marketed?