ChatGPT Dominance - Granular Material

I expect that almost anyone reading this will have heard of ChatGPT by now. Released about a month ago, ChatGPT is a system developed by OpenAI which provides text responses to text input. Although details are scarce, under the hood ChatGPT is basically a large language model, trained with some additional tricks (see Yoav Goldberg’s write up for a good summary). In other words, it is a model which maps from the text input (treated as a sequence of tokens), to a distribution over possible next tokens, and generates text by making repeated calls to this function, and sampling tokens from the predicted distributions.

Although it is unclear just how different ChatGPT is from something like InstructGPT, which it is explicitly based on, the effect of OpenAI releasing this system has been profound. The week that it was released, it seemed like every single conversation I had, both in person and online, eventually came around to the topic of ChatGPT. Although some news organizations were slightly slow to cover it initially, it now seems to be everywhere.

To some extent, this is understandable. Not only do I work in NLP research, where language models generally and GPT-3 in particular have become central to the majority of research that is taking place, I also work in education, where the potential complications that arise from such a system are fairly obvious in their importance, if not yet in their consequences. The possibility of using tools like large language models to write coherent text is clearly something that will be consequential, and it is hard to know exactly where this technology will take us. It is attention grabbing, and possible futures are both easy and fun to speculate on.

At the same time, it was a useful wakeup call to me as to the state of knowledge about these things more generally. Again, I was shocked by how much ChatGPT dominated every single conversation I had that week. For some it was concerned about what this would mean for teaching. Others were eager to try experimenting with it to see what it could do for them. For yet others, this was their first real exposure to what these kinds of systems are capable of.

That last reaction seems to be the most important. For anyone who had been paying attention (which, to be fair, would have been hard to do well without some amount of expert knowledge), developments along these lines have gradually been gathering steam over the past few years. Most of the shock that is now being felt in response to ChatGPT was previously experienced by the research community with the release of GPT-3, or perhaps earlier.

The fact that the public has now become more aware of the state of things is clearly a good thing, given that these technologies do have the potential to impact society in major ways (although much of the coverage remains dismally misleading). Nevertheless, for someone more on the inside of things, it’s hard not to feel like the public response is vastly disproportionate to the scale of change that has taken place from one moment to the next.

In particular, it feels like there are two aspects of the ChatGPT release that should be getting far more attention than they are.

The first is the question of evaluation. In some sense, this is kind of a moot point, given how overwhelmingly impressive the system is. Moreover, the potential capabilities are so vast, that it’s likely hard to quantify all relevant dimensions. Nevertheless, the fact that the system can only be accessed via the website or an API, and moreover, that it could easily be changing at any moment behind the scenes, means that it’s extremely difficult to think about doing any kind of rigorous quantification of exactly how effective it is as certain things.

Indeed, evaluation is an area where there is somewhat of a broader crisis within NLP. Part of the problem is that as soon as one develops a benchmark, people begin optimizing towards solving it, in a way that would have been different if the benchmark had not existed. As such, the creation of a benchmark can paradoxically, in some sense, reduce our ability to fairly evaluate the thing we want to test. Even in cases where there is a properly hidden test set, some amount of tuning toward the task inevitably takes place, in such a way that we can’t necessarily expect performance on a benchmark to generalize to things we actually care about.

The second, and even more important problem is with the lack of transparency. Aside from a small amount of written material (basically a blog post), OpenAI has simply not explained what they have done (or continue to do) in the creation and tuning of ChatGPT. Although they are free to operate as they choose, the lack of information is in stark contradiction with good scientific practice. Without a more public release of models or code, it is extremely difficult for people to properly interrogate what is happening behind the scenes.

You could say that this represents both a commercial and a scientific opportunity. In part because of the lack of transparency, there is a huge number of open questions related to what works well and what doesn’t in producing such models. The existence of ChatGPT is kind of like a proof of concept of something, but different types of work will be required to answer the more interesting scientific questions. That will likely involve building similar resources outside of a corporate setting (as the BigScience project recently attempted to do). Unfortunately such work is extremely resource intensive, and requires a devoted effort. It will be exciting to see what kinds of organizational models emerge around this, but still somewhat frustrating that a more direct path is not available.

The truth is, to some extent, ChatGPT is just a different kind of thing. It’s not really a scientific artifact per se, but rather a commercial product, which is completely understandable from a commercial point of view. But it could also easily be the case that it is currently among the useful extant instruments for advancing a scientific understanding of large language models and machine learning more generally.

There are so few models of this scale that are readily accessible to researchers, that it feels like somewhat of a travesty to have it so locked up. It’s almost like someone made a supercollider, or some other super expensive experimental platform for physics, but did so only for commercial purposes, and did not make it available to, or in partnership with, the scientific community. (Perhaps granting access for $42 per month?)

In any event, it seems highly probable that we will soon see a multitude of similar products released, along with all manner of third party intermediary services, though just how much diversity will exist among these is debatable, given that most systems are trained on such similar material in such similar ways.

My guess is that coverage of this topic will soon abate, even as various parts of the economy, and communities such as academia, slowly adapt. And while that is happening, researchers will gradually cobble together the resources required to study the relevant scientific questions, and eventually get a better understanding of precisely how such capable systems emerge from such simple designs, and just what the limits of those capabilities are.