Will Data Scientists Be Replaced by Tools?

The Quick-and-Dirty Summary

I was recently asked to participate in a proposed SXSW panel that will debate the question, “Will Data Scientists Be Replaced by Tools?” This post describes my current thinking on that question as a way of (1) convincing you to go vote for the panel’s inclusion in this year’s SXSW and (2) instigating a larger debate about the future of companies whose business model depends upon Data Science in some way.

The Slow-and-Clean Version

In the last five years, Data Science has emerged as a new discipline, although there are many reasonable people who think that this new discipline is largely a rebranding of existing fields that suffer from a history of poor marketing and weak publicity.

All that said, within the startup world, I see at least three different sorts of Data Science work being done that really do constitute new types of activities for startups to be engaged in:

  1. Data aggregation and redistribution: A group of companies like DataSift have emerged recently whose business model centers on the acquisition of large quantities of data, which they resell in raw or processed form. These companies are essentially the Exxon’s of the data mining world.
  2. Data analytics toolkit development: Another group of companies like Prior Knowledge have emerged that develop automated tools for data analysis. Often this work involves building usable and scalable implementations of new methods coming out of academic machine learning groups.
  3. In-house data teams: Many current startups and once-upon-a-time startups now employ at least one person whose job title is Data Scientist. These people are charged with extracting value from the data accumulated by startups as a means of increasing the market value of these startups.

I find these three categories particularly helpful here, because it seems to me that the question, “Will Data Scientists Be Replaced by Tools?”, is most interesting when framed as a question about whether the third category of workers will be replaced by the products designed by the second category of workers. I see no sign that the first category of companies will go away anytime soon.

When posed this way, the most plausible answer to the question seems to be: “data scientists will have portions of their job automated, but their work will be much less automated than one might hope. Although we might hope to replace knowledge workers with algorithms, this will not happen as soon as some would like to claim.”

In general, I’m skeptical of sweeping automation of any specific branch of knowledge workers because I think the major work done by a data scientist isn’t technological, but sociological: their job is to communicate with the heads of companies and with the broader public about how data can be used to improve businesses. Essentially, data scientists are advocates for better data analysis and for more data-driven decision-making, both of which require constant vigilance to maintain. While the mathematical component of the work done by a data scientist is essential, it is nevertheless irrelevant in the absence of human efforts to sway decision-makers.

To put it another way, many of the problems in our society aren’t failures of science or technology, but failures of human nature. Consider, for example, Atul Gawande’s claim that many people still die each year because doctors don’t wash their hands often enough. Even though Seimelweiss showed the world that hygiene is a life-or-death matter in hospitals more than a century ago, we’re still not doing a good enough job maintaining proper hygiene.

Similarly, we can examine the many sloppy uses of basic statistics that can be found in the biological and the social sciences — for example, those common errors that have been recently described by Ioannidis and Simonsohn. Basic statistical methods are already totally automated, but this automation seems to have done little to make the analysis of data more reliable. While programs like SPSS have automated the computational components of statistics, they have done nothing to diminish the need for a person in the loop who understands what is actually being computed and what it actually means about the substantive questions being asked of data.

While we can — and will — develop better tools for data analysis in the coming years, we will not do nearly as much as we hope to obviate the need for sound judgment, domain expertise and hard work. As David Freedman put it, we’re still going to need shoe leather to get useful insights out of data and that will require human intervention for a very long time to come. The data scientist can no more be automated than the CEO.

7 responses to “Will Data Scientists Be Replaced by Tools?”

  1. rav

    I think the field is overhyped in the marketplace. There are a few niche areas such as finance where data science directly generates revenue. But I think in a lot of areas (e.g. various forms of “consulting”) it’s basically ancillary information used to check, justify, and c.y.a. regarding decisions that are decided on for other reasons.

    Regarding category 3- I don’t think the issue is that tools reach a level of generality where they replace people. I think the issue is that in spite of the added value a creative specialist offers, there’s an 80/20 thing that happens once a field has decided that there are some general methods that are ‘good enough’ and easy enough to digest that non-specialists feel comfortable with them (things like regression and contingency table tests come to mind).

    To some extent that’s what’s happened with bioinformatics – for most biologists, they need a few sequence alignment programs, a few generic phylogenetic tree programs and that’s about it. This leads to mediocre (and in some cases, bad) science, but unfortunately incentives in academia (publication count, grants awarded) aren’t always aligned with good science. New disruptions will arise which affect the demand for data scientists (next generation sequencing comes to mind), but on the whole I think this 80/20 effect will dominate institutional behavior.

  2. Robert Young

    “I think the major work done by a data scientist isn’t technological, but sociological: their job is to communicate with the heads of companies and with the broader public about how data can be used to improve businesses.”

    Consider the field that adopted data science early and often: financial services. See what a bang up jobbed they did once they’d *really* automated; they gave us The Great Recession. Some say that there was systemic breakdown in law and regulation too, but it was almost wholly the result of mortgage companies, then banks, then rating agencies, and finally mortgage insurers chasing each other with ever more fanciful automated data models of the real world, all ignoring fundamental ratios (easily calculated, by the way). To the extent data science was involved, it failed miserably at constraining “heads of companies”. The few who told the truth, at the time, were generally not data scientists, but left wing economists (talk about having two strikes against you before getting to the batter’s box). The real data were clearly visible, but the quants (who may be either a different, or sub-, species of data scientist; depending on one’s understanding) happily kept churning out ever more algorithms.

    I note with amusement, daily, that R-bloggers (from whence I came to this post) offers up mostly amateur stock trading schemes, anonymously! As my psych friends would put it, “not good with the face validity”.

    The poster child for bad automated data science is high frequency trading. Knight Capital being the most recent.

    We saw similar devolution of skill application with Lotus 1-2-3 (what Excel was before there was Excel for those youngsters in the audience). Lotus allowed “secretaries” to take over the drudge work of various professionals. What soon happened was that problems were re-defined to fit the spreadsheet data paradigm. I was shocked to find that actuaries, fellows, use Excel for professional analysis. The reports didn’t state the prevalence, but still, come on.

    And so far as being a c.y.a mechanism, I departed econometrics years ago just because the profession had turned itself into lawyering: data advocacy for hire. I often wonder whether data science has already reached that end.

  3. Justin Kamerman

    I don’t think the field is very different from general software development in this respect. If you want a competitive advantage or are developing a new class of application you will hire some top software engineers for the task. If you are looking to do due diligence or c.y.a, you look for OTS software with a low bar with respect to staff prerequisites and training costs.

  4. Francois

    It’s a bit of a chaotic field and people don’t really know what they want (much less what they need). The optimistic view is that, as things stabilize, we’ll get a series of well-defined use-cases and decent tools to deal with them. On the other side, you’ll have marketers (or whoever the user is) who are better trained to use these tools and interpret the results.

    At the same time, there will be a leading edge of new, complex and/or uncommon cases that won’t be covered by these tools and keep remaining data scientists gainfully employed, if in smaller number.

    Or am I naive to think that there are enough compelling scenarios that will lead to the development of genuine good tools: a limited version of the Deteministic Statistical Machine (http://simplystatistics.org/post/30315018436/a-deterministic-statistical-machine).

    Or maybe the goalposts will keep on moving too fast for good tools to get established? On the bioinformatics side, I was getting confident that the field had a good grip on microarray analysis and then everyone moved to next-generation sequencing.

  5. Rob Renaud

    I think the answer to this question is so obvious that it is not an interesting question. Talk to an economist.

    http://en.wikipedia.org/wiki/Jevons_paradox

    I assure you, the world is no where near the state that data scientists are answering every question thrown at them perfectly and there is a lack of interesting things for them to do, such that if you made them twice as good, you’d have half as many.

    If the tools that data scientists can use improve, the marginal contribution per data scientist will be higher, and you will get more data scientists.

  6. rav

    Just because people “consume more data science” doesn’t mean that more people will be needed to produce it.

  7. Richard Guha

    The discussion about the shortage of Data Scientists reminds me that in the early 1900s people thought that the number of cars on the road would be limited by the supply of trained chauffeurs. Then Henry Ford and others built cars that owners could drive themselves. New tools are going to be available that business owners can use themselves without need data scientists. Synerscope is one such.