With the widespread success of language models on many tasks in a zero-shot setting, there has been a huge surge of interest among social scientists in wanting to use them to code or classify documents, sometimes in place of human annotators. Given both the freedom to specify prompts, and the lack of connection to domain-specific training data, a concern naturally arises as to how easily people can manipulate their designs to produce a desired conclusion. This had been on my mind recently, and so I was delighted to see that a couple of recent papers specifically take on this question, both of which conclude that there is indeed considerable latitude to produce a desired finding by manipulating the choices involved. These are extremely useful and important results, although they also open up some questions for me, which I wanted to think through here.
The first paper is “Large Language Model Hacking” by Joachim Baumann, Dirk Hovy, and others from Bocconi University and elsewhere. The second is “Data Annotation with Large Language Models” by Eddie Yang, Yaosheng Xu, and others from Purdue.1 Here, I’ll mostly concentrate on the first of these, both because I encountered it first, and because I’ve read it more carefully, although both are worth reading. Both papers appear to have been a huge amount of work, and include extensive experimentation in support of their claims.
To provide greater context, the main issue these papers are confronting is that social scientists are now commonly using language models to classify or label data, and then plugging those results into a downstream analysis. For example, one might want to judge many documents, such as political speeches, in terms of some social science construct, like ideology, and then analyze the degree to which that construct is associated with some other variable, such as party. The outcome of interest is a coefficient in the downstream model (i.e., ideology ~ β · party), and the concern is that by choosing a different language model or prompt, one could potentially produce different individual predictions that would in turn lead to different estimates of that coefficient (i.e., the association between party and ideology). As the Baumann paper puts it, “researchers can also deliberately exploit multiple configuration comparisons to achieve desirable, statistically significant results”.
In many ways, this is no different from what has been the situation in text-as-data work for many years. The key differences seem to be that (a) language models have made it much easier for almost anyone to do this kind of analysis; and (b) the use of language models involves a vast space of choices (many of which are not necessarily widely reported), which potentially exacerbates the possibility of intentional or unintentional manipulation. In other words, this is essentially a version of Andrew Gelman’s classic “garden of forking paths”, but perhaps with a much vaster and more tortuous garden.2 In an earlier era, one could potentially manipulate the choice of model (e.g., a random forest versus an SVM), as well as various hyperparameters (such as regularization strength, or what features to use); in the modern era, however, one might use basically any prompt that sounds plausible, providing—at least conceptually—a great deal more flexibility, without obvious limits.
To assess how much of a problem this might be, both papers take a similar approach: they gather datasets from past computational social science or NLP papers, and use language models, with a variety of prompts or configurations, to replace the methods used in the original papers, and see how this impacts the results, in terms of the estimated coefficients of a downstream model. Notably, the focus here is on whether this affects the results of the downstream analysis, rather than the individual predicted labels. So, continuing the example from above, if an original paper found some relationship between party and ideology, we might ask whether that finding is replicated when using various language models for measuring ideology, instead of whatever was originally used.
This is all off to a great start, and overall the results of both papers are quite convincing. In particular, they both find that in most cases, different language models and/or prompts produced highly variable results, in terms of the downstream regressions. Moreover, it seems that this was true even when the models had relatively high agreement with the validation data, and regardless of whether or not the original annotators had high agreement with each other. The concern here is that because of this flexibility, a nefarious (or careless) researcher could manipulate their use of language models to either confirm or refute a hypothesis.3
Despite the significant contributions of these papers, they raised a number of questions in my mind, which I’ll discuss below.
Probably the most obvious issue with both papers is that they make the very practical choice to treat the human annotations provided by the existing dataset as a reliable gold standard.4 While the authors acknowledge that this is almost certainly false, and note it as a limitation, it does make it more complicated to judge the output of language models in relation to those labels. The quality of annotations in NLP varies widely, both in terms of annotation quality and annotator agreement. As such, it’s not hard to imagine cases where a state of the art model actually is more reliable than, for example, crowd-workers or undergraduates paid to do original annotations.5 Especially given that the Baumann paper finds that Type II errors dominate, this raises a concern that the original papers might themselves have involved some degree of p-hacking.
With the right framing, this wouldn’t necessarily matter that much. Indeed, one could effectively do a similar analysis without any reference to human annotations at all, simply by trying many different prompts and models, to see how much the results from the downstream analysis vary. If one can achieve wide variation, that would imply there is the prospect for results hacking, regardless of what the “true” effect is. Indeed, this is part of what is done in the Yang paper, where they plot the resulting coefficients when using predictions from various models, and show generally clustered results, with some extreme outliers (see below).

Detail from a figure in Yang et al., where each row is an model coefficient (from a corresponding paper), each symbol is a different language model, and the red dots are the estimates from the original data annotated by humans.
Ultimately, however, both papers primarily work within a frequentist framework, in which there is an assumed null hypothesis, which the analysis can potentially reject. (In our running example, this would be something like the hypothesis that there is no relationship between the party and ideology.) This means that their primary results in these papers are in terms of the likelihood of Type I and Type II errors (i.e., false positives and false negatives). The Baumann paper also considers Type-S and Type-M errors, (changes to the coefficient in terms of magnitude and sign, respectively), but these take a back seat to the emphasis on disproving hypotheses.
This framing is a fairly reasonable way of simplifying the complexity of reporting their results. Nevertheless, it sometimes distracts attention from the core issue. For example, Baumann et al. note that those findings (derived using human annotations) which have p-values close to the typical significance threshold of p < 0.05 are more likely to result in “hacking” when using LLMs. This makes sense, but has much more to do with the fragility of using a particular threshold for p-values, rather than anything inherent to language models.
Moreover, this issue with null hypothesis testing is somewhat complicated by the particular hypotheses that the authors choose to investigate. Baumann et al., in particular, include in their analyses both the hypotheses that were put forward by the original papers, as well as additional synthetic hypotheses they produce automatically. To me, some of these hypotheses are somewhat questionable. As they explain in the appendix, these mostly come from splitting the dataset by properties like document length, or whether or not they include certain words. In general, these are not likely to be particularly well-motivated hypotheses, and also have the effect of inducing covariance between the dependent and independent variables (since the length, or the presence of certain words, could also influence the language model output).
Given that the overall results in the paper aggregate so many different experiments, it is hard to tell whether the headline results hold quite generally, or if they are influenced by these particular hypotheses. Presumably, if we chose hypotheses which involved large and well-known effects (for example, that Republicans and Democrats are highly polarized on many issues today), and restricted ourselves to high-quality models and reasonable-sounding prompts, it would be much more difficult to intentionally produce a false negative.
These hypotheses also lead to what feels like a missed opportunity for greater clarity. Along with the rest of their hypotheses, Baumann et al. include one which is simply a random partitioning of the data. This, however, feels underused in the analysis. If one randomly partitions the data, that should guarantee there is no association with the data labels, at least in expectation. As such, one can precisely quantify the expected number of false positives one would obtain with randomly partitioned data, which would seem like an ideal way to measure the propensity of models to find false positives in cases where there is truly no effect.
By contrast, most of the other hypotheses are basically the opposite. As Andrew Gelman is fond of pointing out, in most social contexts, there is no such thing as a null effect. Basically everything has some non-zero association; it’s just a matter of whether you have enough data to estimate it precisely enough to be confident about whether it’s positive or negative.6 In a similar way here, it seems plausible to assume that all null hypotheses used (other than random partitioning) are actually false, regardless of what can be inferred from the available annotated data. Whether or not they tend to be rejected will thus have more to do with the amount of data available than the specific language modeling choices.
Regardless, the results nevertheless clearly show that it is very possible to obtain wide variation in estimates for parameters considered here, even in cases where the models are generally quite accurate, and regardless of whether annotators tended to agree with each other.7 Interestingly, Baumann et al. conclude that prompt engineering is not the primary cause of this issue, finding that “prompt engineering choices contribute less than 1% of explained variance”, although presumably this depends a lot on the specific context, in terms of what prompts and hypotheses are included. The fact that they tend to lump all their hypotheses together in the analysis makes it a bit hard to tease apart, but it seems plausible that choice of model is effectively the biggest source of variation.8
For me, the biggest surprise for me in both papers was that various bias correction methods underperformed, relative to expectations, which seems like a bit of a mystery. Arguably the biggest problem for using language models in social science research (or any other machine learning method, for that matter) is that the distribution of errors is almost certainly not random, and these systematic errors in individual predictions may influence the results in downstream analyses. A few recent papers have proposed approaches to correct for this. One method that I’m a fan of, known as Design-based supervised learning (DSL), is from Naoki Egami (along with Brandon Stewart and others), who was a keynote speaker at last year’s NLP+CSS workshop.
I am planning to write more about these methods in the future, but at least in theory, they have nice statistical properties, such that using them should tend to be better than only using annotated data. The Baumann paper, however, finds that they using the Egami correction produces worse results, on average, than just doing the analysis with an equivalent number of human annotations (see below). This seems surprising, and partially makes me wonder whether a few cases are distorting the average.

Figure from Baumann et al., showing the results of various bias correction techniques, relative to only using human annotations. The basic Egami correlation is GT + LLM + DSL (M3), which they find performs slightly worse (in terms of both Type I and Type II errors) than just using the human annotations (M1).
Yang et al. similarly conclude “For most studies, a sample size of around 800 to 1,000 is required for DSL to become beneficial”, which does not match my expectations. As far I understand, more labeled data should translate into less variance and tighter confidence intervals for the corrected estimates, but they should be unbiased regardless. Of course, that’s not the same as a guarantee that it will help in every case. I’m not sure I fully understand the evaluation that was done here, but it’s possible that if we could break the results out by dataset or model or hypothesis, things would become clearer.9 Regardless, this feels like a promising direction for further investigation.
A couple of final points I want to make have to do with how parts of these papers are framed. The Baumann paper draws a distinction between p-hacking and language model hacking, writing “both practices can yield significant p values where none should exist, but LLM hacking achieves this by shaping the annotated data itself rather than how that data is analyzed.” To my mind, this slightly stretches a reasonable use of the term “data”. The text to be classified and the human generated annotations both seem appropriate to refer to as data. By contrast, I would tend to refer to the use of LLMs to simulate annotators as a kind of data analysis, rather than as a source of new data. As such, I would argue that language model hacking is no different than p-hacking; it is just a special case of it that involves the use of language models to mimic human annotators.
The Yang paper, by contrast, has much more of an optimistic framing, at least in places. The authors write, for example, that “Perhaps the biggest advantage of LLMs is that there is no need for manual coding”. If anything is clear by now, it’s that one can’t simply trust the output of language models on a novel task without some sort of validation, which in most cases will imply a need for manual coding. While the authors are more critical and cautious elsewhere in the paper, I worry slightly that lines like this have the potential to mislead readers.
For their part, Baumann et al. place some hope in “multiverse analysis”, which they describe as “reporting coefficient distributions across all reasonable LLM configurations rather than single point estimates”. As appealing as this may be, I’m somewhat more skeptical of this approach, given that it seems very difficult to come up with a way of characterizing all reasonable configurations, and possibly very resource-intensive to run orders of magnitude more experiments.
Regardless, I’m sure that stronger norms will eventually arise around what sorts of models are appropriate, and what approaches are reasonable for demonstrating validity in these kinds of analyses. In the meantime, these papers have only strengthened the case that some amount of high quality human annotations are still needed in order to obtain trustworthy findings in this kind of social science research.
The second paper is currently only available on the author’s website, as far as I am aware. ↩︎
There are of course additional issues with the use of language models that were not present in earlier modes of analysis, such as hard-to-resolve concerns about data leakage, and the use of commercial models that may limit replicability, but these are distinct from the issue being considered in these papers. ↩︎
The case we typically worry more about, especially in light of the replication crisis, is producing false positives or exaggerated findings. The opposite could also happen, however, if one wanted to downplay the inferred importance of some particular variable. ↩︎
To be fair, The Yang paper starts off by saying that they do not assume human annotations are the ground truth, and first look at differences across models, but they eventually shift to adopting the perspective that the human annotations are perfect when experimenting with bias correction methods. ↩︎
Indeed, some of the datasets come from papers which are specifically trying to make the point about language models being superior. How to define “better” or “more reliable” is complicated, of course, and deserves more future posts. ↩︎
Note that this is different in an important way from Matt Gardner’s provocative work arguing that all word associations are spurious. The perspective there is that there is no necessary relationship between a word and something we might measure from text, because words are always context dependent. The point here, by contrast, is that within a specific domain, people will typically use words in a way that has some empirical association with anything else we might want to measure. That association might be very weak, however, which would require a lot of data to confidently estimate. ↩︎
This finding feels like it has the potential to be slightly misleading. Clearly if a language model could perfectly replicate the decisions of expert human annotators on held out data, that would greatly reduce any concern about language model hacking. (Although it might raise other questions). In practice, however, most models only had moderate agreement with the existing annotations. The point is that even relatively high accuracy on individual instances could still allow room for potentially meaningful differences in estimated coefficients in the downstream model. As for agreement rates, it’s not obvious that there should be any inherent relationship with hacking potential, but I’m not convinced that we can totally trust the analysis here. Figure 9 in the Baumann paper (v2) shows that most datasets they used had Krippendorff alpha values of 1.0 (indicating perfect agreement), which does not seem credible. My guess is that these are agreement rates calculated after some sort of deliberation step, which is quite different from baseline agreement rate. ↩︎
This impact of the choice of model seems deeply intertwined with other factors, including publishing norms. Those models which are generally most capable will tend to have higher agreement with experts. As such, it seems like the most obvious path to result hacking is to try a variety of less good models, and take the result that one likes best. (Although I would guess that if we dug into the actual output of such models, rather than just the extracted labels, we might find clues). One could insist that papers only use the best models, but this would have implications for cost, as well as for factors like replicability. ↩︎
Yang et al., also write “The standard errors of the DSL-corrected estimates are consistently larger than those of the naive estimates”, which to me suggests a potential confusion. If I understand correctly, “naive estimate” here refers to using the predictions of the language model as if they were completely reliable. In my opinion, the standard errors from that approach are not really relevant, since we can be almost certain that there is some systematic error present. I would argue that it doesn’t make sense to think of “inflation of variance” relative to the uncorrected predictions as a downside of DSL, and that the more relevant comparison is to just using the available human annotations. If DSL is consistently producing larger standard errors than just using human annotations, then that seems worth digging into. ↩︎