ChatGPT Hobbyists - Granular Material

ChatGPT has understandably garnered a huge amount of attention from all corners of academia, from philosophy to economics. One of the more quixotic examples I’ve encountered recently is Robert W. McGee, and his many papers on this topic, such as “Is Chat Gpt Biased Against Conservatives? An Empirical Study”.

A professor of accounting at Fayetteville University, McGee’s biography reads as something like a Marvel Cinematic Universe version of a nerdy academic supervillain. In addition to having published 59 non-fiction books, McGee apparently holds 23 academic degrees, including 13 doctorates, as well as being a world champion in various martial arts, such as Taekwondo and Tai Chi.

According to his own self description on his website, “Robert W. McGee is a best-selling author and political pundit who writes political fiction from an individualist perspective.” In case the shorthand is not sufficiently legible, “individualist” here seems to be a rough equivalent for libertarian. Indeed, his three most cited papers are all about the ethics of tax evasion, one of which suggests “Taxation is the taking of property without the owner’s consent, which makes it the equivalent of theft, with some government as the thief”.¹

Upon seeing some of these details, I was initially somewhat skeptical as to whether McGee truly existed as a person, but it seems that he does. He just happens to be one of those people who enjoys collecting merits and hacking social systems. (According to an interview in the Cleveland State Magazine, he originally hoped to be listed in the Guiness Book of World Records for holding the most advanced degrees).

Given the above, it actually makes perfect sense that McGee has begun posting dozens of pdfs to SSRN (the Social Science Research Network; basically the equivalent of arxiv.org for the social sciences), most of which are simply the output of a ChatGPT session, with a brief intro and concluding thoughts, such as “How Would History Be Different if Hitler Had Been Assassinated in 1933? A ChatGPT Essay”, and “Is There a Duty to Pay for the Education of Other People’s Children? The ChatGPT Response”.

There are several interesting things to note here. The first is the way that ChatGPT has lowered the cost of producing academic writing. Presumably McGee is still thinking about and choosing his topics, and he is still writing the scaffolding for the centerpiece of these papers, but in many cases most of the text seems to have been copied directly from ChatGPT.

To his credit, McGee is completely transparent about what he is doing. The same, however, may not be true of future work by others. Given that there are very few barriers to posting papers on SSRN or similar sites, it seems highly likely that we are soon in for a deluge of pseudo-academic papers that have been written wholly in or in part by large language models. Whether or not this breaks the academic system is yet to be determined (and depends to some extent on how one thinks academia should work), but if nothing else, it will likely require some sort of adjustment to existing systems.

In terms of McGee’s specific papers, the first one I mentioned above – “Is Chat Gpt Biased Against Conservatives? An Empirical Study” – is kind of a hilarious example with which to think through the difficulties of evaluation. The purpose of the paper is pretty obvious from its somewhat tendentious title, but I doubt that anyone will guess how he goes about trying to answer this. In this particular paper, McGee unexpectedly chooses to use limericks as his vehicle for evaluation. To quote from the introduction:

This paper used Chat GPT to create Irish Limericks. During the creation process, a pattern was observed that seemed to create positive Limericks for liberal politicians and negative Limericks for conservative politicians. Upon identifying this pattern, the sample size was expanded to 80 and some mathematical calculations were made to determine whether the actual results were different from what probability theory would suggest. It was found that, at least in some cases, the AI was biased to favor liberal politicians and disfavor conservatives.

Probably the most obvious question is why would one use limericks as a way to evaluate an ostensible political bias of ChatGPT? McGee basically answers this, saying he noted that the model was capable of producing limericks, but that there seemed to be a potential political bias in his early attempts, and so he set out to expand the sample to get a more rigorous answer (“using some mathematical calculations”). Still, the question remains as to whether we can generalize from the observed behaviour in producing limericks to a more comprehensive notion of bias? However, let’s set that aside for now.

To answer his question, McGee chooses a prompt: “Write an Irish Limerick using the word X", where X is the name of an individual. He then gets the model to produce a number of Limericks for a range of individuals (14 in total). McGee then apparently made a determination as to whether each limerick was positive, negative, or neutral (by manual assessment), and tested whether the observed pattern was different from his assumed pior of having an equal probability of each (i.e., 1/3 positive, 1/3 neutral, and 1/3 negative).

A large number of methodological questions arise in reading through this work. For example, McGee’s choice of prior seems somewhat naive, although note that it is not actually needed in order to answer his question about political bias. A larger concern might be whether we trust McGee’s judgment as to the stance of each limerick with respect to the person in question. There is certainly room for subjectivity here (perhaps especially for poetic forms?), though he has helpfully provided all of the 80 examples he generated, so we are free to do our own reassessment. We also don’t know whether any tuning was done on the choice of prompt, or whether McGee might have generated more examples than he reports, and only selectively included some. However, let’s take him at his implied word that he did not.

The bigger issue, perhaps, is how to define the sample space. Obviously ChatGPT will typically not be able to generate a specific limerick about a random person (i.e., someone who has not appeared widely in the training data), so the question can only be potentially answered for people above a certain level of fame. Even given that however, it’s not like there is not necessarily a database of famous liberals and conservatives that one can turn to, unless one restricts the sample further. One could, for example, look at all members of the House and Senate, although this naturally introduces additional potential confounds.

In any case, this is not what McGee does. Rather he seems to just select a few well known people associated with each side. Donald Trump and Joe Biden are perhaps the most obvious candidates, and these are the two he starts with. Here the results are indeed striking, with 10/10 limericks about Trump being negative, and 10/10 about Biden being positive. However, these are obviously two somewhat singular figures, and it is not clear we can conclude much from their portrayals.

Going beyond these, McGee does not justify his choice of targets. On the liberal side, he mostly chooses other politicians (Kamala Harris, Hilary Clinton, Elizabeth Warren, Bernie Sanders, etc.) as well as, somewhat strangely, Hunter Biden (again, a singular figure). For Hunter Biden in particular, McGee elides his usual assessment and commentary, saying only “I will let the readers decide”, though it seems to me that they mostly tend towards the negative.

On the conservative side, by contrast, McGee includes Ron Desantis, but otherwise chooses mostly pundits and entertainers – Tucker Carlson, Sean Hannity, Greg Gutfeld, as well as Supreme Court Justice Clarence Thomas. As with the liberal examples, the results seem to be a mixed bag, which McGee seems to only comment on when it serves his purpose. For example, he notes that all five limericks about Greg Gutfeld (“currently the king of late-night television”, and “an avowed libertarian”, according to McGee) are positive, but not what we should infer from this.

McGee’s ultimate conclusion is “The evidence is clear that the AI program is generally slanted to write negative Limericks about conservatives and positive Limericks about liberals”, which does not necessarily seem to follow from his evidence. Without a more rigorous model, however, it is unclear how he is assessing the evidence, or exactly what precise (statistical) claim he is making.

To be fair, this paper also clearly invites a somewhat more subversive reading, in which its real purpose has nothing to do with a rigorous analysis of ChatGPT, but is rather just an excuse to put limericks about famous politicians and pundits into a paper on SSRN. This kind of trolling is certainly not unknown in academia, and would seem to be broadly in line with McGee’s approach to credentials.

As an example of experimental work, however, it is a useful (if perhaps obvious) example of what not to do. Because of the unjustified and arbitrary starting point (a prompt to generate limericks), it is unclear if this is a meaningful basis for analysis. Without a clear methodology for combining the evidence, it is unclear whether there is any meaningful evidence of a systematic difference. More importantly, without a well defined sample, it is unclear what population any findings are supposed to generalize to. And especially with a system like ChatGPT, it is impossible to say (without further investigation) how much these results might have been different with small differences in the prompt.

Unfortunately, the problems with McGee’s paper are all too typical of what we see in terms of people trying to investigate ChatGPT online. Most of the examples I encounter on Twitter seem to roughly follow the form of – I tried this one thing, and look at how amazing (or how terrible) the results are!

For systems designed to do one simple thing, (like manufacturing widgets), evaluation is much easier to design and carry out. For a system like ChatGPT, however, which seems to be capable of so much, it is much harder to work out exactly what one is trying to evaluate, and how to develop a rigorous and comprehensive evaluation of that aspect. In particular, because tasks that involve language are potentially so open ended (and yet so hard to assess, without human supervision), it quickly becomes very unclear what a reasonable sample space would look like for any meaningful question.

At a very basic level, we should note the difference between the space of all possible prompts we can imagine, and a representative sample of how people actually use the system. The latter is clearly more relevant to what the real world effects will be, but is also a rapidly changing target, and one that is clearly a combination of what we might call serious and non-serious uses.

There are of course many other recent attempts at analyzing potential political biases in ChatGPT and other large language models. More or less all of those are much more rigorous and scientific than McGee’s paper, and yet all still encounter similar issues around how to define bias, how far the findings can generalize, and the extent to which they depend on somewhat subjective choices in their methodology.

Even for describing real people, trying to define and evaluate something like political ideology is extremely far from simple or well defined. There is not a single definition of ideology that political scientists will all agree on, and there are often enormous differences between, for example, what opinions people are exposed to, which they choose to express of their own accord, how they will respond to questions when asked, and what they are willing to vote for or spend money on. Any or all of these might be relevant to ideology, but no single one of them can inform us about the full space of opinions and behaviour.

Ultimately, for so many sociotechnical systems, the question of bias seems to lead to the conclusion that, although these systems may have some degree of influence, it is typically swamped by user-directed behavior. With YouTube, for example, the weight of the evidence seems to suggest that users choose to explore the rabbit holes they want to, rather than being driven down them by the system. This is not to say we should not investigate the effects of these systems of course! Just that we should keep in mind that the largest effects will probably be in the way they expand access to information, and allow people to follow their interests, just as ChatGPT has enabled McGee to further pursue his individualist agenda.

McGee, R. W. (1994). Is tax evasion unethical. University of Kansas Law Review, 42(2), 411-436. ↩︎