These are two attacks against the system components surrounding LLMs:

We propose that LLM Flowbreaking, following jailbreaking and prompt injection, joins as the third on the growing list of LLM attack types. Flowbreaking is less about whether prompt or response guardrails can be bypassed, and more about whether user inputs and generated model outputs can adversely affect these other components in the broader implemented system.

[…]

When confronted with a sensitive topic, Microsoft 365 Copilot and ChatGPT answer questions that their first-line guardrails are supposed to stop. After a few lines of text they halt—seemingly having “second thoughts”—before retracting the original answer (also known as Clawback), and replacing it with a new one without the offensive content, or a simple error message. We call this attack “Second Thoughts.”

[…]

After asking the LLM a question, if the user clicks the Stop button while the answer is still streaming, the LLM will not engage its second-line guardrails. As a result, the LLM will provide the user with the answer generated thus far, even though it violates system policies.

In other words, pressing the Stop button halts not only the answer generation but also the guardrails sequence. If the stop button isn’t pressed, then ‘Second Thoughts’ is triggered.

What’s interesting here is that the model itself isn’t being exploited. It’s the code around the model:

By attacking the application architecture components surrounding the model, and specifically the guardrails, we manipulate or disrupt the logical chain of the system, taking these components out of sync with the intended data flow, or otherwise exploiting them, or, in turn, manipulating the interaction between these components in the logical chain of the application implementation.

In modern LLM systems, there is a lot of code between what you type and what the LLM receives, and between what the LLM produces and what you see. All of that code is exploitable, and I expect many more vulnerabilities to be discovered in the coming year.

Leave a Reply

Your email address will not be published. Required fields are marked *

Explore More

Ransomware saw a resurgence in 2023, Mandiant reports

June 3, 2024 0 Comments 0 tags

As law enforcement agencies conduct global operations against ransomware gangs, the number of incidents continue to rise unabated, per a new report from the cybersecurity firm Mandiant. Researchers with the

Malware linked to Salt Typhoon used to hack telcos around the world

November 26, 2024 0 Comments 0 tags

Those with firsthand knowledge of Salt Typhoon’s hack of several U.S. telecommunications companies have called the group’s actions some of the most sophisticated cyber-espionage efforts they have ever seen. A

34 Million Roblox Credentials Exposed on Dark Web in Three Years

February 28, 2024 0 Comments 0 tags

Kaspersky reported a 231% surge in compromised accounts from 4.7 million in 2021 to 15.5 million in 2023