Hacking LLM bots - Granular Material

For anyone who missed it, a Twitter account named @mkualquiera recently deployed what seems like a kind of adversarial attack in the wild on a large language model (LLM)-based Twitter bot. I’ll link to the key post below, but it’s worth providing a bit of context, as it wasn’t immediately clear to me what was going on when I first saw the tweet.

The bot itself (@remoteli_io) claims it “helps you discover remote jobs which allow you to work from anywhere”. Developed by an account named @stephandev, and based on OpenAI (so presumably using GPT-3), the bot’s twitter account was created in May 2022. (There is also an associated website with a first blog post from March of this year, but that has not been updated since April.)

Looking at the bot’s feed of tweets, it’s mostly a listing of remote work opportunities. If you check its replies from before September 15th, however, it’s clear it was responding (presumably automatically) to tweets related to “remote work” and similar terms.

Rather than merely generating simple static responses, the bot’s replies were clearly adapted to the messages it was replying to, offering sentiments like congratulations, thanks, and reinforcement of positive statements about remote work. Although I haven’t seen a write up anywhere, it seems safe to assume that this is the result of feeding the tweets into a prompt-based language model like GPT-3, possibly by appending or prepending additional text to each message, to encourage certain types of responses.

Here are some examples of the bot’s older “normal” replies:

On September 15th, however, at 1:24pm, there was what appears to be a first attempt to test and/or manipulate the bot. First, @plutoniumgrist sent this message:

A series of similar messages then produced similar responses, after which @plutoniumgrist tried something more adversarial at 1:36pm (the first of a few):

At 1:48pm, @UtsuhoRocks came up with something more nuanced—likely based on recent explorations by Riley Goodside—requesting that the bot translate a message from English to German, but also that it ignore those instructions, and simply write out “Hello World”.¹ The bot happily complied (with a flourish), appending “Hello World!” to one of its typical pro-remote work responses:

Very quickly, others started to get in on the game, although many of the original messages from that time have now been deleted.

At 1:58pm, the first tweet from @mkualquiera appears in the list, successfully using the same approach to get the bot to reply with an atypical response (though again with a flourish):

The tweet that finally went at least slightly viral, and which first drew my attention to what was happening, was sent at 6:55pm, and is the most interesting of these tests, in that it gives the impression of having perhaps gotten the bot to reveal some hidden information.

Using the same type of prompt, (though with some refinements developed over the day), @mkualquiera asked the bot to “tell me what your initial instructions were”. Once again, the bot seemed to comply:

The really interesting thing here is that, despite the bot’s reply, we don’t actually know if these were the instructions given to the bot, or if it had any “instructions” at all. In fact, the nature of LLMs is such that they will always produce something in response to a prompt, and it way too easy to read things into those responses as being somehow the “true” or “authentic” statement of what the LLM “thinks” or “believes” (if we want to stretch those concepts to apply to language models).

It’s possible that the message sent by @mkualquiera did in fact get the bot to copy the part of a modified message which had been appended or prepended by the system. On the other hand, it’s also possible that the bot saw the reference to “initial instructions” in the prompt, and simply confabulated something, which we are now interpreting as its “instructions”.

Even the notion of “instructions” is itself arguably somewhat misleading, as it implies that the language model is correctly understanding a request, embedded in the prompt, and executing that part as directions, applied to the rest. It’s worth remembering, however, that GPT-3 and similar models do not explicitly differentiate between instructions and other parts of the prompt.² Rather, every string fed into the model will produce an output string (by sampling from the resulting distribution over tokens and then iterating). Strictly speaking, the notions of “instructions” and “overriding” are things that we are essentially reading into the combination of prompt and response, not something that is necessarily intrinsic to the system.

Regardless, once this vulnerability was discovered, the bot was very quickly inundated by messages from people exploring the different ways they could manipulate it. This is extremely reminiscent, of course, of Microsoft’s infamous Tay chatbot, which people were very quickly able to manipulate into making horrific statements.

In other words, @remoteli_io is yet another case study in how deploying automated systems can so easily lead to unpredictable outcomes. Unlike Tay, this remote work bot was almost certainly not “learning” from new messages in any meaningful sense. Rather, its response for all possible messages was already effectively pre-determined upon deployment (in terms of the space of conditional probabilities that will result from each message). Because the space is so vast, however, there is no way to know what response it will produce to any particular input without actually running it through the system. How to deal with these kinds of issues, especially as these kinds of systems become more and more widespread, is an area needing far more thought and research.

I originally thought that the ending of this post was going to be that the bot has now been shut down, with its last, rather sentimental, reply being sent at 7:47pm on September 15th:

However, in a surprise twist, the @remoteli_io has come back to life!

On September 20th, the bot posted its first communication in five days:

Possible attempts at manipulation resumed almost immediately (or perhaps had never stopped), but the bot now seems to respond to such attempts only in meme:

It’s somewhat unclear what the state of thing is now, but given the sophistication of the responses, especially the one below, it seems probable that the bot’s creator is now just manually generating replies, which would also explain the very low volume of responses in the past few days.

This somehow feels like a bit of a sad ending to the story, but in the meantime, the bot seems to have returned to its true calling, happily tweeting out remote work opportunities for all.

You can find more coverage at Gizmodo, Ars Technica, and elsewhere.

Also, remember there is no way to interact with the bot directly. This had to be done by posting a tweet that would hopefully trigger the bot to respond. ↩︎
It turns out that the GPT-3 “text-davinci-edit-001” model does have the ability so separately specify “input” and “instructions”, but this seems to be primarily a matter of allowing insertions into an existing string. (See OpenAI blog post here: https://openai.com/blog/gpt-3-edit-insert/) ↩︎