Hackers ‘jailbreak’ powerful AI models in global effort to highlight flaws

By News Room Last updated Jun 21, 2024

Pliny the Prompter says it typically takes him about 30 minutes to break the world’s most powerful artificial intelligence models.

The pseudonymous hacker has manipulated Meta’s Llama 3 into sharing instructions for making napalm. He made Elon Musk’s Grok gush about Adolf Hitler. His own hacked version of OpenAI’s latest GPT-4o model, dubbed “Godmode GPT”, was banned by the start-up after it started advising on illegal activities.

Pliny told the Financial Times that his “jailbreaking” was not nefarious but part of an international effort to highlight the shortcomings of large language models rushed out to the public by tech companies in the search for huge profits.

“I’ve been on this warpath of bringing awareness to the true capabilities of these models,” said Pliny, a crypto and stock trader who shares his jailbreaks on X. “A lot of these are novel attacks that could be research papers in their own right . . . At the end of the day I’m doing work for [the model owners] for free.”

Pliny is just one of dozens of hackers, academic researchers and cyber security experts racing to find vulnerabilities in nascent LLMs, for example through tricking chatbots with prompts to get around “guardrails” that AI companies have instituted in an effort to ensure their products are safe.

These ethical “white hat” hackers have often found ways to get AI models to create dangerous content, spread disinformation, share private data or generate malicious code.

Companies such as OpenAI, Meta and Google already use “red teams” of hackers to test their models before they are released widely. But the technology’s vulnerabilities have created a burgeoning market of LLM security start-ups that build tools to protect companies planning to use AI models. Machine learning security start-ups raised $213mn across 23 deals in 2023, up from $70mn the previous year, according to data provider CB Insights.

“The landscape of jailbreaking started around a year ago or so, and the attacks so far have evolved constantly,” said Eran Shimony, principal vulnerability researcher at CyberArk, a cyber security group now offering LLM security. “It’s a constant game of cat and mouse, of vendors improving the security of our LLMs, but then also attackers making their prompts more sophisticated.”

These efforts come as global regulators seek to step in to curb potential dangers around AI models. The EU has passed the AI Act, which creates new responsibilities for LLM makers, while the UK and Singapore are among the countries considering new laws to regulate the sector.

California’s legislature will in August vote on a bill that would require the state’s AI groups — which include Meta, Google and OpenAI — to ensure they do not develop models with “a hazardous capability”.

“All [AI models] would fit that criteria,” Pliny said.

Meanwhile, manipulated LLMs with names such as WormGPT and FraudGPT have been created by malicious hackers to be sold on the dark web for as little as $90 to assist with cyber attacks by writing malware or by helping scammers create automated but highly personalised phishing campaigns. Other variations have emerged, such as EscapeGPT, BadGPT, DarkGPT and Black Hat GPT, according to AI security group SlashNext.

Some hackers use “uncensored” open-source models. For others, jailbreaking attacks — or getting around the safeguards built into existing LLMs — represent a new craft, with perpetrators often sharing tips in communities on social media platforms such as Reddit or Discord.

Approaches range from individual hackers getting around filters by using synonyms for words that have been blocked by the model creators, to more sophisticated attacks that wield AI for automated hacking.

Last year, researchers at Carnegie Mellon University and the US Center for AI Safety said they found a way to systematically jailbreak LLMs such as OpenAI’s ChatGPT, Google’s Gemini and an older version of Anthropic’s Claude — “closed” proprietary models that were supposedly less vulnerable to attacks. The researchers added it was “unclear whether such behaviour can ever be fully patched by LLM providers”.

Anthropic published research in April on a technique called “many-shot jailbreaking”, whereby hackers can prime an LLM by showing it a long list of questions and answers, encouraging it to then answer a harmful question modelling the same style. The attack has been enabled by the fact that models such as those developed by Anthropic now have a bigger context window, or space for text to be added.

“Although current state-of-the-art LLMs are powerful, we do not think they yet pose truly catastrophic risks. Future models might,” wrote Anthropic. “This means that now is the time to work to mitigate potential LLM jailbreaks before they can be used on models that could cause serious harm.”

Some AI developers said many attacks remained fairly benign for now. But others warned of certain types of attacks that could start leading to data leakage, whereby bad actors might find ways to extract sensitive information, such as data on which a model has been trained.

DeepKeep, an Israeli LLM security group, found ways to compel Llama 2, an older Meta AI model that is open source, to leak the personally identifiable information of users. Rony Ohayon, chief executive of DeepKeep, said his company was developing specific LLM security tools, such as firewalls, to protect users.

“Openly releasing models shares the benefits of AI widely and allows more researchers to identify and help fix vulnerabilities, so companies can make models more secure,” Meta said in a statement.

It added that it conducted security stress tests with internal and external experts on its latest Llama 3 model and its chatbot Meta AI.

OpenAI and Google said they were continuously training models to better defend against exploits and adversarial behaviour. Anthropic, which experts say has made the most advanced efforts in AI security, called for more information-sharing and research into these types of attacks.

Despite the reassurances, any risks will only become greater as models become more interconnected with existing technology and devices, experts said. This month, Apple announced it had partnered with OpenAI to integrate ChatGPT into its devices as part of a new “Apple Intelligence” system.

Ohayon said: “In general, companies are not prepared.”

Read the full article here