The Philosophy of Data Science

Over the last six months I conducted a series of interviews for the LSE Impact Blog about the philosophical challenges which data science poses for the social sciences. Here’s a list of the interviews:

Here are some of my favourite bits from the interviews:

Emma Uprichard: To many the question of what is ‘social’ about the data is not even a necessary question because there seems to be an assumption in some circles that the ‘social bit’ doesn’t really matter; instead what matters are the data. But in the case of big data, the data usually are the social bit! I cannot emphasis this point enough: most big data is *social* data. Yet the modes of analysis applied to big data tend to mirror approaches that have been long been rejected by social scientists. And importantly, they have been rejected not because of the ‘discomfort’ that comes from the idea of modelling human systems using modes of analysis useful to model and think about atoms, fluid dynamics, engine turbulence or social insects etc., although that may be part of it. These kinds of big data methodologies might well be used meaningfully to address certain kinds of questions and ones that we haven’t even asked before. But one of the main reason these ‘social physics’ approaches tend to be rejected by social scientists is that these methodological approaches don’t so much as even a nod to how important social meaning, context, history, culture, notions of agency or structure might be – and yet these matter enormously to how we use data to study social change and continuity.

Emma Uprichard: It is important to appreciate that much of methodological drive to do with big data is coming from commercial entities, which quite rightly are interested in targeting their products and services to most of their clients and customers. However, for social policy and planning purposes, especially around issues to do with social divisions of various kinds, we not only need to know what is mostly happening, we also need to examine the ‘odd cases’, the outliers, the different minority trends, and so on.

Noortje Marres: Digital sociology is a relatively recent term, it has only really come into use a few years ago, which is odd given the fact that sociologists have long played a central role in advancing our understanding of the computerization of social life. Some people propose to define digital sociology primarily in terms of its methods and techniques, saying that it’s centrally concerned with the new opportunities for researching society opened up by digital data. I think this explanation is misguided, as it often ends up simply making the case for a highly particular set of social research techniques, namely statistical methods of data analysis, which are fairly well-established at that. In my view, the recent interest in digital sociology is best understood in terms of changing relations between social life – as an object of research – and social analysis.

Especially crucial is the proliferation of instruments and practices of social analysis across social life in the form of social media platforms, digital analytics, the internet of things, and so on. In this situation, it is not simply the case that new data and new techniques become available for social research. Rather, it means that social actors, practices and events are increasingly and explicitly oriented towards social analysis and are actively involved in it (in collecting and analysing data, applying metrics, eliciting feed-back, and so on). This raises lots of questions about the ways in which technology participates in the representation and doing of ‘social life’, questions which in the past have been considered more of a specialist interest (for the sociology of technology). So digital sociology is a digitally aware form of social inquiry, one which does not seek to bracket the influence of digital technology in the doing of social life and social research. In terms of its relations to other fields, I think that this growing digital awareness in sociology creates lots of opportunities for collaboration between sociology and related fields like data science. Although paradoxically we might find that data scientists believe in society – in the sense of a universe made up of purely human interactions – much more than many sociologists!

Noortje Marres: In the case of digital sociology, the ‘new’ phenemona – monitored living, trackeable sociality, and so on – have been very explicitly associated with what is undeniably a core concern of sociology, namely “the social,” and I think this circumstance carries some promise. Of course, there is the real risk that the type of “sociality” that is studied by digital sociologists, social media scholars, digital anthropologists and so on, will not be recognized as such by other sociologists and social scientist, and seen as not being about ‘the social’ at all, but about something else, like marketing. But I am hoping that the question of what distinguishes different forms and types of sociality, and why these differences matter, is of interest to scholars and researchers across different fields, and so we should seek to formulate this question in the broadest possible terms, while also recognizing that we are all “geeks” to an extent, with particular hang-ups and concerns.

Evelyn Ruppert: We want to change that by providing a platform that doesn’t reduce Big Data to analytics, but extends our understanding it to all practices such as how data is generated and configured as a part of everyday digital practices to how it is curated, categorised, cleaned, accessed, analysed and acted upon. This is a much richer engagement, what you could call a data social science that analyses these practices and at the same time also engages empirically through experimenting with and innovating methods while reflecting on the consequences for how societies are represented (epistemologies), realised (ontologies) and governed (politics).

Evelyn Ruppert: This indeed will be a challenge and while we encourage interdisciplinarity we also recognize that in many cases we will have content that speaks to specific disciplinary approaches that, as you say, talk past each other. Be that as it may, one aspect of the journal that may facilitate such trans-disciplinary conversations and understanding is that its focus is a particular material thing – Big Data – which is a matter of concern across disciplines and about which there is no clear settlement or definition. We are at a moment of discovery, uncertainty and experimentation and it is at such times that I think there is greater openness to different possibilities and ways of thinking. The extent to which we can demonstrate and facilitate this is to be seen. At the same time I think it is important to be modest about what a journal can achieve!

Rob Kitchin: For me, big data has seven traits — they are:

huge in volume, consisting of terrabytes or petabytes of data;

high in velocity, being created in or near real-time;

diverse in variety in type, being structured and unstructured in nature, and often temporally and spatially referenced.

exhaustive in scope, striving to capture entire populations or systems (n=all)

fine-grained in resolution, aiming to be as detailed as possible, and uniquely indexical in identification;

relational in nature, containing common fields that enable the conjoining of different datasets;

flexible, holding the traits of extensionality (can add new fields easily) and scalable (can expand in size rapidly).

Big data then are not simply very large datasets, they have other characteristics. This is why a census dataset does not constitute big data as I define it. True, census data are huge in volume, seeks to be exhaustive, and has high resolution (though they are usually aggregated for release), indexicality and relationality. However, a census has very slow velocity being generated once every ten years, very weak variety consisting of 30 to 40 structured questions, no flexibility or scalability (once formulated a census is set and is being administered it is impossible to tweak or add/remove questions). Another traditionally large dataset, the national household survey, is more timely, usually administered quarterly, but at the sacrifice of exhaustivity (it is sampled), but likewise lacks variety and flexibility.

Rob Kitchin: There are inherent philosophies of data science, whether data scientists recognize this or not. Even if a scientist claims to have no philosophical position, she is expressing a conceptual position about how she makes sense of the world. On questioning, their position with respect to epistemology, ontology, ideology and methodology can be teased out (though it might be slightly confused and not well thought through). Philosophy is important because it provides the intellectual framework that shapes and justifies what kinds of questions are asked, how they are asked, how the answers are made sense of, and what one does with the resulting knowledge. Avoiding it weakens the intellectual rigour of a project and widens the scope of potential critique. Quite often scientists avoid the difficult work of thinking through their philosophical position by simply accepting the tenets of a dominant paradigm, or by operating merely at the level of methodology. Generally, this consists of a claim to using the ‘scientific method’ which tries to position itself as a commonsensical, logical, and objective way to approach understanding the world that is largely beyond question.

As I’ve already discussed, the philosophy of science is not fixed and does change over time with new ideas about how to approach framing and answering questions. This is clearly happening with debates concerning how big data and new forms of data analytics is and can alter the scientific method, and also debates over the approach of the digital humanities and computational social sciences. And even if data scientists do not want to engage in such debates, their work remains nonetheless open to philosophical critique. In my view, the intellectual rigour of data science would be significantly improved by working through its philosophical underpinnings and engaging in debate that would strengthen its position through evolution in thought and practice and which rebuffs and challenges critique. Anything less demonstrates a profound ignorance of the intellectual foundations upon which science is rooted. So, yes, philosophy does and should matter to data science.

Deborah Lupton: We seem unable to talk about the big data phenomenon without using metaphors that are often drawn from nature. The most common metaphorical system that is used employs liquidity metaphors, referring to ‘drowning’ in big data, data ‘flows’ and ‘floods’ or the big data ‘tsunami’. I argue that these metaphors suggest both the sense that big data are constantly mobile and circulating from one site to another, and that they are overwhelming in their magnitude.

Metaphors drawn from nature have been common in representing digital technologies since their emergence, as Sue Thomas points out in her book Technobiophilia. They are a way of seeking to domesticate and familiarise technologies that may appear threatening in their novelty and strangeness, and work to incorporate technologies into a pre-existing world view. Conceptualising digital data as liquid flows helps us to conceptualise and make sense of the phenomenon. But nature is not always benign, as the metaphors of ‘big data floods’ and ‘tsunamis’ suggest. Here again the meaning that there is something threatening about big data emerges.

Deborah Lupton: It may well be the case that established approaches in the social sciences are represented as focusing only on ‘small data’. But I do not see this nomination as a threat, but rather as an opportunity. For a new project I have been investigating portrayals of ‘small’ as compared to ‘big’ data. What is interesting is the value that ‘small data’ are beginning to be invested with many forums. Increasingly we are seeing discussions in the marketing and computer science literature concerning how small data provide meaning to big data and may be more valid. Small data are represented as offering an important alternative to big data because they are viewed as more insightful and detailed and also as more manageable in their size. This relates back to the notion that big data are threatening or challenging because of their volume.

Here again, as critical and reflexive social researchers we can investigate the social and cultural aspects of portrayals of small data in the popular media or technical literature as they are juxtaposed against big data. As we have always done, we can highlight the ways in which small data are socially constructed. We can also continue to generate and use small data for our own research purposes, as we have been doing for a long time. This is where the opportunity to distinguish ourselves as offering a unique perspective on both small and big data emerges.

Susan Halford: Web Science began at the University of Southampton as a label to describe an interdisciplinary space for researchers to explore the evolution of the web as an inextricably sociotechnical set of practices of rapidly increasing significance in shaping the nature of the world around us. The original impetus came from the computational sciences, particularly from those who had been centrally involved in developing the technical architectures of the web in its early years and had come to appreciate that the web was evolving in all kinds of unexpected ways – at once exciting and troublesome –that computer science alone could not address. By drawing in expertise from the social sciences, humanities, medicine & health sciences, law, business and so on – the experiment was to see if and how new forms of critical engagement and analysis might emerge beyond our familiar disciplinary repertoires.

Of course it is perfectly possible to do a sociology of the web, or use the web for sociological research, similarly for anthropology or the digital humanities, but the point of web science is to explore the far wider landscape of questions that emerge from the web as our focus of study – rather than beginning with disciplines and the questions that they might ask about the web. This in itself demands new ways of working & new forms of communication. It took us a while, for instance, to realise that we meant really rather different things by the term ‘ontology’; and disciplinary certainties are continually shattered, for example as joint projects make it apparent to computer scientists that technical capacity might be the least important thing in shaping web use and growth; whilst social scientists began to see the limits of their knowledge about the web as – at least in part – a technical system; and the limits of their methods in apprehending the web and web data.

Susan Halford: The main obstacle is obvious: most sociologists don’t know very much about the infrastructures of the web even in theory, let alone have the skills to get involved in developing web standards and protocols or building web applications. My good colleague Mark Weal (a computer scientist) put it to me, very politely, as follows: sociologists criticise the natural sciences for simplifying the social, failing to capture its diversity and complexity – pouring scorn, for instance on social media analyses that count the number of ‘happy’ or ‘sad’ words in tweets and conclude on the happiness of a nation; but at the same time we – sociologists –have strong tendencies to ‘black box’ the technical, the web for instance, as if treating it all as socially constructed means that no understanding of how it works is necessary.

Of course, you’re right, as well as our disciplinary orientations there is also a question of training. We do not train social scientists in the kinds of techniques and methods that are necessary to engage in depth with the web and, to be honest, I don’t think it is viable to do this to the extent that would be necessary to take us to the level of a computer scientist trained for 8, 10 years or more. It is of course worth starting some basic training – we do this on our MSc in Web Science, where the programme that we designed led in particular by Catherine Pope and Les Carr at Southampton ensures that all students – regardless of background – take modules in semantic web technologies and hypertext (more challenging for the theologists and sociologists) as well as in social theory (which challenges the computer scientists and mathematicians). This produces informed researchers able to engage across disciplines, which is important, and develop interdisciplinary PhD projects. But ultimately we still need to draw on deep expertise that has, to date, been grounded in particular disciplines, as I argued above.

Sabina Leonelli: the current manifestations of data-centric science have distinctive features that relate to the technologies, institutions and governance structures of the contemporary scientific world. For instance, this approach is typically associated to the emergence of large-scale, multi-national networks of scientists; to a strong emphasis on the importance of sharing data and regarding them as valuable research outputs in and of themselves, regardless of whether or not they have yet been used as evidence for a given discovery; the institutionalization of procedures and norms for data dissemination through the Open Science and Open Data movements, and policies such as those recently adopted by RCUK and key research funders such as the European Research Council, the Wellcome Trust and the Gates Foundation; and the development of instruments, building on digital technologies and web services, that facilitate the production and dissemination of data with a speed and geographical reach as yet unseen in the history of science. In my work, I stress how this peculiar conjuncture of institutional, socio-political, economic and technological developments have made data-centric science into a prominent research approach, which has considerably increased international debate and active reflection over processes of data production, dissemination and interpretation within science and beyond. This level of reflexivity over data practices is what I regard as the most novel and interesting aspect of contemporary data-centrism.

Sabina Leonelli: The epistemological aspect that interests me most, however, is even more fundamental. Given the central role of data in making scientific research into a distinctive, legitimate and non-dogmatic source of knowledge, I view the study of data-intensive science as offering the opportunity to raise foundational questions about the nature of knowledge and knowledge-making activities and interventions. Scientific research is often presented as the most systematic set of efforts in the contemporary world aimed to critically explore and debate what constitutes acceptable and sufficient evidence for any given belief about reality. The very term ‘data’ comes from the Latin ‘givens’, and indeed data are meant to document as faithfully and objectively as possible whatever entities or processes are being investigated. And yet, data collection is always steeped in a specific way of understanding the world and constrained by given material and social conditions, and the resulting data are therefore marked by the historical circumstances through which they were generated: what constitutes trustworthy or sufficient data changes across time and space, making it impossible to ever assemble a complete and intrinsically reliable dataset. Furthermore, data are valued and used for a variety of reasons within research, including as sources of evidence, tokens of exchange and personal identity, signifiers of status and markers of intellectual property; and myriads of data types are produced by as many stakeholders, from citizens to industry and governmental agencies, which means that what constitutes data, for whom and for which purposes is constantly at stake.

This landscape makes the study of data into an excellent entry point to reflect on the activities and claims associated to the idea of scientific knowledge, and the implications of existing conceptualisations of various forms of knowledge production and use. This is nicely exemplified by an ongoing Leverhulme Trust Research Grant on the digital divide in data handling practices across developed and developing countries, particularly sub-Saharan Africa, which we are currently developing at Exeter – what constitutes knowledge, and a ‘scientific contribution’, varies enormously depending not only on access to data, but also on what is regarded as relevant data in the first place, and what capabilities any research group has to develop, structure and disseminate their ideas.

The Philosophy of Data Science

Share this: