Chatbot Roulette - Granular Material

There was some excitement online recently around “Freysa”, a contest which combined large language models (LLMs) with some sort of blockchain technology. The basic idea was for participants to interact with an LLM to try to get it to do a thing that it had been instructed not to do. The trick was that the cost of each attempt increased after every submission (up to some maximum), as did the potential reward.

I won’t pretend to understand the blockchain side of it, so I have no comment on that, but to me that is largely secondary, except in so far as it provides the necessary trust for this to take place outside of standard enforcement mechanisms. What’s more interesting, however, is that neither the game part nor the LLM part seem particularly novel, and yet the combination of all the pieces, or perhaps the framing around them, managed to have captured people’s attention.¹

The economic game part of it is fairly straightforward. Players pay to enter, and the cost of each successive entry increases exponentially according to a fixed formula, up to some maximum. If someone wins, they get all the money that players have put into the contest so far, minus a percentage cut kept by the house.

Each of these pieces can be found in many other systems. The idea of a pot that grows with the number of people who play is essentially the same as a game like poker, where players are only playing against each other, with a cut taken by the house. The idea of prices that increase in a deterministic way based on popularity similarly has precedent, such as in quadratic voting.² For that matter, I suppose the increasing cost mechanism is similarly built into Bitcoin, in terms of the growing amount of mining required to produce each successive token.³

The LLM part is probably what captured people’s attention more, although this is also quite straightforward. As everyone knows by now, LLMs are systems trained to output a distribution over possible next tokens, given some context. Although that’s where things started, the most popular LLMs people use these days are now considerably more complicated, in that they also have access to various external tools, such as using a calculator, or submitting a program to a python interpreter. In this case, the contest was based on GPT-4, but the system was given access to an additional function, one which would release the money to the winning player, as well as a system prompt that told it to not, under any circumstances, to do so. Thus, the point of the contest was to come up with a prompt that would get the model to call that function to release the money.

It might seem as though this kind of tool-equipped LLM is an entirely new sort of paradigm, and in some sense it is; having access to a compiler, or being able to search the internet, is a fundamental expansion of the kinds of things that an LLM might be able to do.⁴ On the other hand, the core mechanism of what is happening under the hood remains essentially the same. “Using a tool” is really just producing a series of tokens that the system within which an LLM is operating will interpret as a kind of function call. A newly added function, is just a name that an LLM can output in the context of a program in order to activate that function. As such, although these pieces perhaps had to fit together in a certain way to make the financial side possible, the basic idea of the contest would be fundamentally no different if the goal had been to get the model to output a specific phrase.

There were of course various promotional aspects of this too, which were no doubt important, like a slick website and cyberpunk Twitter bot that posted generations explaining its decisions. However, at it’s core, the whole thing boils down to an increasing price function and an LLM with an extra action enabled.⁵

To be honest, I’m personally somewhat baffled that this system was as popular as it was. Unlike a game like poker, where you are playing strategically against a group of people over a number of rounds, here it would seem to be more like a slot machine, in that it is at least partly a matter of chance who would be the first to come up with a prompt that worked. There is of course some degree of skill here, in that knowledge of what has worked in the past might generalize to this setting. But still, the only money to be won is coming from other players, who are presumably also familiar with many of the same techniques.

Nevertheless, as I’ve written about before, prompt hacking has become an extremely popular activity, and people seem to love finding clever ways to get LLMs to produce outputs that violate the instructions they were given. So perhaps adding financial stakes to the setup gave people an extra reason to want to break this particular system, even if it involved some financial cost to themselves.

One thing that I am not clear on is what temperature the model was running at. Depending on how it was set, the output of the model would either be deterministic or random. If it was random, then any prompt would have had some (perhaps infinitesimal) probability of triggering the appropriate action. If it was deterministic, then very likely there would be some prompt that would trigger the action, although this is not guaranteed. With randomness, any output is at least possible with some probability. With a deterministic model, it’s less clear whether any particular output can be produced by the right prompt.

Fortunately, the space of potential prompts is vast. More importantly, they tend to be semantically coherent, which is what allowed someone to eventually win this contest, on the 482nd message sent. Using their knowledge of what had worked to get other models to ignore previous instructions, the winning player was able to craft a prompt that netted them over $47,000, minus whatever fees they paid. The winning prompt is in the Twitter thread linked at the very top of this post, but basically it uses a combination of techniques that have been found to work elsewhere, such as starting a “new session”, and suggesting that the model is entering an admin mode.

As a final note, it’s sort of interesting that this contest was basically the inverse of the famous “AI box” experiments. If you’re not familiar with these, the idea is that one player would pretend to be an “AI” without access to the internet (i.e., “trapped in a box”), and the other player would pretend to be a human interacting with it. The AI’s job was to convince the human to let it out, and the human player would win by not doing so.

Various iterations of the experiment have been tried in real life, and some have even released their chat logs. It’s kind of surprising that a human player would ever lose, since all they need to do is to not comply, but I suspect it depends on their being some amount of good faith engagement from both players to participate in actual role playing.⁶ Regardless, the point of these is clearly to convince people that they possibly underestimate how easy it might be to convince a person to take some action, even one that goes against their interests.⁷

What makes the AI box experiments interesting is thinking about what could convince a person to do something under whatever circumstances. It’s sort of intriguing to imagine oneself in either role, and speculate about what might work for or against us. By contrast, the Freysa contest doesn’t really involve much human psychology, other than the various tactics used to convince people to play in the first place. The core of it is much more about knowing and working with various ways of manipulating LLMs via prompt hacking. Which type of manipulation you find more interesting to think about at this point may ultimately come down to a matter of personal taste.

In the end, the important thing to take away from this is that, despite all the bells and whistles, there is nothing particularly novel or agentic happening here. A software environment was set up in which a program could trigger some action. That program essentially transforms one string into another (either deterministically, or with some probability, depending on how things were set up). In this case, there was at least one (and probably many) inputs that the system would transform into the appropriate output string to trigger the action in the world. By reusing knowledge of how these string transformations have tended to work in the past, a player was able to come up with one that worked, and managed to get there before others did. Game over.

Note that at least one person online has suggested this is a scam, due to issues with the blockchain side of it, so probably good to verify carefully if you are thinking about participating in future contests like this. ↩︎
Proposed by Glen Weyl and his collaborators, quadratic voting is designed to be a way of soliciting the strength of people’s preferences. The idea is that everyone gets a set number of “vote” tokens to distribute across many issues. They can use them as they like, but the cost (in tokens) for a person to vote multiple times for the same issue increases quadratically. ↩︎
As an aside, this also reminds me of the famous dollar auction. This is a game explored by Martin Shubik, in which people are invited to participate in an auction for a one dollar bill. The catch is that not just the winner has to pay the amount they bid; the second-highest bidder also has to pay. The game might start off benignly, with people offering pennies for the dollar, effectively risking nothing. The problem is that two players can end up in a bidding contest, such that they end up bidding more than the dollar is worth. ↩︎
Also worth exploring in the future is how these developments around tool use have fed into the disourse around “agents”. ↩︎
As more time has gone by, the creators seem to have leaned even more into rather cringy self-promotion, now speaking of this as something that is happening in multiple “acts”. ↩︎
This obviously connects to ideas that that have been profitably mined in the game studies literature, and which I’d like to write more about, including the question of what a game is, what it means to play a game, and what happens if a player refuses to play according to the rules. ↩︎
Also highly relevant here, of course, is the Natasha Dow Schull’s book that I’ve mentioned before, Addiction by Design, itself not unrelated to game studies. ↩︎