The Gradual Disappearance of Twitter

It was recently reported by that Twitter is going to start charging academic researchers and institution $42,000/month if they want to maintain their current level of expansive access to data, and – more significantly – require that they delete all Twitter data from their archives if they do not. I’d heard rumors that this might be happening a few weeks ago, but the iNews article is the first independent reporting that I’ve seen about it.

It is hard to overstate the extent to which Twitter has become central to research focused on online communities, natural language processing, computational sociology, and related areas. Unlike most social media platforms, Twitter has long provided easy ways for academic researchers to gather large numbers of tweets matching certain search criteria, facilitating studies into, for example, the backlash effects of authoritarian crackdowns, or the long term consequences of student lifestyle on academic performance. Separate research on the ethics and community perceptions has shown that most Twitter users are unaware their data could be used in this way (and often confused about why researchers would be interested in their tweets), but the terms of service were clear.

The above change in pricing applies specifically to the decahose – a particular access point that provided access to around 10% of all tweets. Past work has shown that this was not truly a “random” sample, but for most purposes it’s not critical, especially given that Twitter users themselves are not a random sample of the public. Rather, the decahose provided an unbelievably rich archive of data combining text, links, and metadata, which a small number of institutions would archive and mine.

Again, it’s hard to overstate the scale of this sort of enterprise. Thousands of tweets are sent every second, meaning that the accumulation of all that data adds up very quickly. While I was in graduate school, I knew at least one researcher who had access to the decahose, but eventually decided to give up on archiving it due to the scale involved. Most famously, the Library of Congress announced a plan to archive all of Twitter, but eventually scaled this back, to only acquiring a limited set of tweets.

Twitter is obviously a specialized domain. Not only does it over-represent certain demographics (including along dimensions such as age, gender, profession, etc.), the affordances of the platform (such as limits on the length of messages, and the adoption of hashtags) shape the ways in which people communicate. Nevertheless, it is a rich domain for studying the nature of community and communication.

Other than Twitter, the other most commonly used source of social media data is almost certainly Reddit, which has also long had a very accessible API. In many ways, Reddit would seem even more well suited to many research purposes – many of the comments are longer form, and exist within pre-defined communities. Nevertheless, there was something about Twitter being a kind of living pulse of society that attracted people to using it. Perhaps not coincidentally, Reddit has also recently decided to start charging for access to Reddit data for most use cases, in this case seemingly in response to the use of Reddit data in training large language models.1

Despite it’s popularity as a data source, the openness of Twitter was only ever partial. It was very easy to collect data, but the terms of service specified that only Tweet IDs could be shared, not the actual tweets themselves. In principle, it was easy to “rehydrate” datasets, by scraping the text and metadata that matched a set of tweet IDs. However, it is very common for tweets to get deleted (often by users themselves), meaning that researchers would be unable to perfectly replicate the work of others. In addition, secondary aspects of tweets, such as likes and retweets, are constantly changing, so recollecting a collection of tweets would produce a slightly different dataset. In practice, of course, it has been common for researchers to privately share datasets of full tweet text and metadata, to enable other researchers to build on their work, but this is hardly ideal for open science.

In other words, Twitter was always a rather uncertain archive. It seems that no one really thought that Twitter would go so far as to actually request that decahose archives be deleted – even though that was apparently clearly written into the terms of service as a possibility. Even the idea that data you have on a your own computer might in some sense not be yours seems somewhat counterintuitive, but of course has a long precedent, with something like Amazon remotely deleting people’s copies of 1984 being a kind of canonical example.

I doubt many people will see it this way, but there is a sense in which the death of Twitter for research could perhaps be a blessing in disguise. Although it has been enormously influential, one could argue that some research communities have become overly dependent on Twitter, using it simply because it is so easy to access, and not necessarily because it is the best data source for answering a particular research question.

At the same time, the loss of access to these records is clearly a huge blow to efforts to create an enduring archive of social media use over the past two decades, something that will surely be of great interest to the future. Although people have long debated how historians will possibly be able to make use of such massive archives, one could take comfort in the fact that they existed. Much like how archaeologists will leave parts of an archaeological site untouched for future generations, it is sometimes worth archiving information, even if we don’t yet know how it will be used.

To be clear, the data in question is clearly not being destroyed or eliminated, and there is little indication that Twitter itself is about to disappear. In principle, a well funded future organization could still request access and get data from Twitter itself. But that sort of guarantee only exists as long as Twitter does. There have been far too many examples from the past of seemingly permanent archives disappearing quickly and dramatically. Even for something as ephemeral and seemingly unimportant as tweets, it is worth taking seriously the idea that we should proactively invest in preservation, as there is always the possibility that we may not get another chance.

  1. In a fascinating twist, some Reddit communities have decided to make their subreddits private in an act of protest against this policy change. ↩︎