Credible Estimates and Open Science


Somewhere on social media I encountered the recent story that the Tony Blair Institute for Global Change, (a large, avowedly centrist think tank created by Tony Blair), put out a report estimating the potential impact of AI on jobs in the public service, but that ironically, they arrived at their estimates by basically just asking GPT-4. While this does have a certain kind of poetry to it, it’s not exactly what we think of as a reputable methodology.1

When I first saw the story, I thought that perhaps commentators were being unfair to the authors, and that maybe they had actually done something careful or sophisticated. But no, reading their Methodology section, (which, to be fair, they didn’t try to hide away in an appendix, but rather put right up front), they do come right out and say that what they wanted to do was too hard (“relying on expert judgements to make these individual decisions would make our analysis intractable”), so they essentially just asked GPT-4 to do it.

The goal of the report was to “estimate the potential for time savings for public-sector workers using AI tools”. The authors started with a database from O*NET of about 20,000 work tasks associated with various jobs. As prior work, they point to a report from OpenAI, which used a combination of GPT-4 and human annotators to estimate which of those tasks could be done with AI. The Blair Institute report, however, dismisses that approach with a somewhat handwavy argument about LLMs like GPT-4 being “mysterious” and “not deterministic”.

Strangely, the authors of the Blair Institute report nevertheless end up doing something very similar, except using a multistage process, without incorporating any human estimates. First, they ask GPT-4 to categorize each tasks in terms of several different “determinants” of whether a task can be performed by AI or not, and then feed the task back into GPT-4, along with the initial GPT-4 assessments of these features, to get some more targeted estimate from GPT-4 (apparently ignoring all prior concerns about mysteriousness and non-determinism). To quote from the report, “In the final stage, we ask GPT-4 to categorize the type of AI tool that could be used to perform the task, estimate the time saving and give an assessment of whether it would be cost effective to deploy the tool”.

The final categories they end up with are things like free vs. low-cost vs. bespoke AI tools and systems, and those that require high vs. low cost equipment. Breaking things down by different occupations, they suggest that those with little potential for time saving from AI include ambulance staff, probation officers, and elected representatives (lol), whereas those with high potential time saving include “communication operators” and various administrative and secretarial occupations.

Although this is all quite easy to make fun of, I think there are a few interesting things about this. First, it seems highly likely that we will encounter more and more reports like this going forward, except that not everyone will be so explicit about how or where they got their numbers from. Notably, this report at least admits that they used GPT-4, but still does not provide enough detail to actually replicate their findings.

Perhaps more interesting is what the reaction has been. Much of the coverage has understandably been at least partly mocking (e.g,. “Tony Blair Institute says AI is good for UK because ChatGPT said so”), but most of this mockery seems to focus on GPT-4 being unreliable, rather than how the report uses the resulting estimates, or taking issue with the task itself. These articles tend to read like if the report authors had actually done the hard work of getting experts to perform the same task (estimating the potential time saving from AI), then they would not have treated this report with the same level of derision.

To me, the problem with that is that it’s not clear how much we should trust such estimates from experts either. Obviously there are people who are experts in both occupational task composition and current AI capabilities, but assessing the potential for automation of a task that has been taken out of context seems like a somewhat quixotic endeavor.

There will certainly be exceptions to this; to take some examples from O*NET, I would be comfortable saying that the potential for the task, “Inserts new or repaired tumblers into lock to change combination”, to be done by current AI alone, (i.e., without some sort of specialized hardware, which I assume does not currently exist), to be basically zero. But for a task like “Compiles and analyzes data on conditions, such as location, drainage, and location of structures for environmental reports and landscaping plans”, what does it mean for this to be “done by AI”? Almost certainly, some sort of AI (loosely defined) is already being used for this task, and incorporating more sophisticated tools will involve changing the nature of the task, not just replacing people with machines. It seems to me there is a serious measurement problem here, regardless of whether we are trying to resolve this by asking experts or an LLM.

As a final point, the report is interesting because I think it points to one of the key thing people are hoping for from these models, which is to be some kind of universal information aggregation and reasoning machine, i.e., something that will take in all available information on a question and come to some sort of well reasoned inference, based on that information. We do of course already have some very highly developed statistical machinery that will do this exact thing (i.e., statistical inference), but it requires relatively highly structured problems and input data. The hope seems to be that LLMs will somehow be able to do the same thing, but also be able to operate on much more abstract natural language data.

Unfortunately, even ignoring the biases and incompleteness in the information that these models have been trained on, this is just not how we should expect things to work in general. Learning to embed input texts in such a way such that those which are followed by similar words will be placed close together is just not the same thing as applying statistical inference to the information that underlies that text. When the report says “we would ideally want to assess the ability of AI to perform each individual task, drawing on a wide range of sources on AI’s current capabilities”, that makes total sense as a goal. Unfortunately, that’s just not what these models have been trained to do.

There is of course one clear exception to this, which is precisely the task that models have been trained to do—namely, to estimate the probability that a particular word will follow an input sequence of text, in terms of the distribution of text on the internet. Here, it is clear that models are able to do this task much better than people, which makes total sense. Not only does doing this task well depend on having been exposed to a very large sample of text, it also requires integrating that information in a principled way, both of which are things that people will be woefully inadequate at.

Nevertheless, we should note two limitations on even that most modest of goals. The first is that this is still domain specific. These models are sensitive enough to context that it’s not like they won’t adapt their distributions to the input, (indeed, that is the whole point), but there may still be domains in which experts might do better, especially if these are not well represented among the training data. For example, with uncommon languages that have very little presence on the internet, we would expect a speaker of that language to do better than a state of the art model. The second caveat is that the models which are most commonly being used these days, namely instruction tuned models, the instruction tuning seems to dramatically alter the underlying next word prediction probabilities, such that these models are actually not very well calibrated for predicting the next word.

Regardless, this is clearly at least an existence proof of how these models can do much better than people at integrating certain types of information, in certain specialized contexts. The big questions really are how far can we push that ability, and in what circumstances should we place our trust in these models. Unfortunately, we are still far from being clear on both questions.

Hopefully the Blair Institute incident will be seen as a warning to policymakers and policy advocates not only that that it is premature to place their trust wholly in these models, but also that people are generally (and rightfully) skeptical of estimates obtained using such means (even if they should in some cases also be more skeptical of estimates from humans). That does not mean that we might not make use of this kind of LLM-based attempt to aggregate information in some circumstances. Rather the point is that it is insufficient to use such an approach on its own, without some pretty extensive validation and/or complementary approaches. Moreover, for such a method to be at all credible, it is also necessary to provide sufficient detail to replicate it, as well as (ideally) the intermediate data produced along the way, for others to inspect and interrogate.

In many ways, we do want policymakers and policy analysts to be using all of the available information, and the best available tools for data analysis; but the more they do so, the more important it is that they also adopt the best practices that have emerged from the open science movement. This includes transparency, reproducibility, and documentation, but does not stop there, and it is arguably an area where there is a need for improved public education, extending well beyond data scientists.


  1. In case it goes offline, the report is also available on the Wayback Machine here↩︎