We are at a strange moment in scholarship, where a powerful new tool (language models) has been handed to people, and existing academic systems are scrambling to figure out how to respond. Even leaving aside the crisis around a potential deluge of AI-generated papers, there are many more conventional questions about what an appropriate use of language models is for domains other than computer science. Especially in the social sciences, my understanding is that established journals are struggling both to recruit qualified reviewers, and to know what appropriate standards should be, in terms of methodological rigor.1
I was recently asked to review a paper for a top social science journal, (which I sometimes say yes to), and working through the details of that paper turned out to be a useful opportunity for me to reflect a bit on current practices for the use of LMs in social science, and how they could be improved. Although reviewing is often rewarding, putting a bunch of time into a review that you know will only be seen by a handful of people always feels like a bit of a shame. As such, I thought I’d try an experiment and post an anonymized excerpt from my review here, both as a way of recording my thoughts in the present moment, and also as a way of publicly raising some questions around the use of LMs for social science.
My full review touched on a number of aspects of the submission, but here I have kept only the part that dealt with methodological questions on the use of LMs. I have edited the text to hopefully make it difficult to know which paper I was reviewing, although it is close enough that the original authors would no doubt recognize me as the original reviewer.2
Finally, I will also note that I myself have previously been guilty of some of the issues that I fault the authors for here! From my perspective, reviewing is a way both to help authors improve their work, and to think through how one’s own work could be better. Although there is a certain unfairness in this, standards differ across journals, norms are constantly evolving, and we should all collectively strive to do better. I’m hoping that by sharing part of this review more widely, I can do a small amount to help shift things in that direction.
Here is the edited excerpt:
…
In terms of methodology, this paper does many things well. The authors ground their measures in theoretically motivated constructs. They collect an extensive set of annotations, including multiple annotations per example, and make use of a reasonable division into training and evaluation data. They evaluate a variety of language models against reasonable supervised baselines, and provide some sense of how model predictions differ from patterns in human annotations. They also make attempts to validate the models, first by comparing predictions to human annotations, and second by comparing aggregated predictions from multiple models against each other. In terms of documentation, the authors include the prompts they used, along with some hyperparameter values. Finally, they also contextualize and interpret their findings in light of relevant knowledge about the topic under investigation.
On the one hand, this paper is definitely a contribution to evaluating the ability of LMs to perform text classification on nuanced or novel categories without fine-tuning, and how to get the best performance on such tasks. This is especially the case because the authors focus on a set of categories that have not been widely studied in NLP, compared to more familiar tasks like sentiment analysis. At the same time, there is still considerable room for improvement, as well as some issues that may be hard to resolve. Since the authors mention prioritizing validation, I think it is worth identifying areas where this could be done better.
I will break this down into several factors. First is the stage of data annotation. Although the authors include some details (e.g., measures of agreement, etc.), they don’t provide much detail as to how they addressed issues of validity at the annotation stage. In particular, how did the authors ensure that the annotators were correctly applying the existing codebook? If the authors did not make use of expert annotators in this work, how did they ensure that annotators correctly understood the typology? As a minor point, it would also further strengthen claims to validity if the authors would commit to releasing the full set of data annotations (ideally with some basic pseudonymized annotator information included).
The second issue is documentation. Although the authors provide the prompts they used, and include the temperature value used, they do not explain how they arrived at these choices (either for prompt or hyperparameters), or how much experimentation was tried. In addition, I did not find any details on how they actually extracted model judgements (i.e., discrete classes) from the text that is generated by the LMs, or how they checked that these extractions were correct with respect to the model’s generated output text. I also did not see an explanation of what was done in cases where an acceptable answer could not be extracted from the output, or how common such cases were. I would guess this was not much of a problem for the largest models, but I would expect it might be for some of the smaller models. In any case, including these procedural details is important, both for reasons of replicability, and to help assess validity.
The third issue has to do with annotator agreement and model performance. The authors find that both models and annotators are worse on one part of data, but I did not notice any in-depth explanation as to why this is the case. This seems like a clear threat to validity, especially if there were differences in who annotated that data. Were these more challenging examples because the annotators (and models?) were less familiar with the context? Or are the categories that the authors are working with perhaps less applicable to that subset of data? Or is that data just different in some other way?
Fourth, a well-known downside of LMs is that we don’t really know what data they are trained on. (And even when we do, the scale is essentially too vast to easily interrogate). In this case, I noticed that part of the dataset used was posted online some years ago. This means that it was easily scrapable for language model training, and could potentially have been included in the pretraining data for a model like GPT-4. This is not necessarily a problem for this paper, but it would mean that this data is no longer useful for a perfectly clean evaluation, and we might expect actual model performance to perhaps be slightly worse than what is reported here, which might downgrade our expectation of how valid its classifications are.
Fifth, if I understand correctly, the authors find that even the best model they consider is less accurate than a single human annotator. By contrast, combining annotations from multiple annotators provides the gold standard, and is thus by definition better than relying on a single annotator. Given that the authors rely on the model classifications for all of their analyses, this seems like a potential problem. Would we trust an analysis like this if it were based on the work of a single annotator, even if that annotator had moderate to high agreement with a benchmark? I think we would be skeptical of that (relative to relying on a group of annotators), and thus should perhaps be skeptical of an analysis based on only a model’s predictions.
The last question is how to handle the imperfections of the model. As the authors show in their analysis, the errors that are being made are far from random. Rather, one of the best models used appears to tend to systematically misclassify one class as another. As such, the analyses which simply aggregate the model predictions are inherently not correct. We know they are wrong, because some proportion of those predictions will be systematically incorrect. That being said, how to handle this is somewhat of an open question. One option might be to apply one of the bias correction methods that have recently been published, such Design-based Supervised Learning, although the authors would need to think about whether and how those techniques apply here.
Notably, for at least some of the comparisons, it seems like the authors might be able to get better insight into their data by using only the annotations. For example, if the authors have many dozens of annotations per subset, that should be more than enough to get a reasonable estimate of the average for each one, complete with proper error bars. Even if these error bars are relatively wide, this would be likely more accurate, and certainly more faithful to the available information, than using the model predictionas as if they were perfectly correct.
Overall, this paper is an impressive amount of work, a decent test of LMs on a less familiar task, and a substantive in-depth study, all of which are very positive. I believe more work is still required, however, to produce a final interpretation that is convincingly valid, and to also make it an exemplar of the rigorous use of LMs in social science.
…
Back to my commentary: Reading back through a review often leaves one with the feeling that one has perhaps been too harsh, but it is worth keeping in mind that these are often written in a compressed amount of time, and sometimes involve a certain amount of reactivity to small frustrations in the work being reviewed. In this case I believe my review was fair, and I tried my best to uphold the standards of this journal, but I nevertheless would love to see the paper under review published somewhere, especially after incorporating some of the revisions I suggest.
Interestingly, we are seeing different systems respond in different ways. ArXiv.org recently decided to prohibit what they call “position papers” and review articles, unless those papers have already been accepted for publication at a peer-reviewed conference or journal. SocarXiv, meanwhile, has placed a temporary moratorium on papers about AI, that is “papers about AI models, testing AI models, proposing AI models, theories about the future of AI and so on.” ↩︎
The whole question of whether reviews should be anonymous or not is a tricky one. Personally committing to making one’s reviews public would be an interesting option, but would definitely come with some costs. If you are among the authors of this paper and reading this post, feel free to comment or get in touch. ↩︎