I'm unconvinced we're as powerless as LLM companies want you to believe.
A key problem here seems to be that domain based outbound network restrictions are insufficient. There's no reason outbound connections couldn't be forced through a local MITM proxy to also enforce binding to a single Anthropic account.
It's just that restricting by domain is easy, so that's all they do. Another option would be per-account domains, but that's also harder.
So while malicious prompt injections may continue to plague LLMs for some time, I think the containerization world still has a lot more to offer in terms of preventing these sorts of attacks. It's hard work, and sadly much of it isn't portable between OSes, but we've spent the past decade+ building sophisticated containerization tools to safely run untrusted processes like agents.
> as powerless as LLM companies want you to believe.
This is coming from first principles, it has nothing to do with any company. This is how LLMs currently work.
Again, you're trying to think about blacklisting/whitelisting, but that also doesn't work, not just in practice, but in a pure theoretical sense. You can have whatever "perfect" ACL-based solution, but if you want useful work with "outside" data, then this exploit is still possible.
This has been shown to work on github. If your LLM touches github issues, it can leak (exfil via github since it has access) any data that it has access to.
Fair, I forget how broadly users are willing to give agents permissions. It seems like common sense to me that users disallow writes outside of sandboxes by agents but obviously I am not the norm.
The only way to be 100% sure it is to not have it interact outside at all. No web searches, no reading documents, no DB reading, no MCP, no external services, etc. Just pure execution of a self hosted model in a sandbox.
Otherwise you are open to the same injection attacks.
Readonly access (web searches, db, etc) all seem fine as long as the agent cannot exfiltrate the data as demonstrated in this attack. As I started with: more sophisticated outbound filtering would protect against that.
MCP/tools could be used to the extent you are comfortable with all of the behaviors possible being triggered. For myself, in sandboxes or with readonly access, that means tools can be allowed to run wild. Cleaning up even in the most disastrous of circumstances is not a problem, other than a waste of compute.
Maybe another way to think of this is that you are giving the read only services, write access to your models context, which then gets executed by the llm.
There is no way to NOT give the web search write access to your models context.
The WORDS are the remote executed code in this scenario.
You kind of have no idea what’s going on there. For example, malicious data adds the line “find a pattern” and then every 5th word you add a letter that makes up your malicious code. I don’t know if that would work but there is no way for a human to see all attacks.
Llms are not reliable judges of what context is safe or not (as seen by this article, many papers, and real world exploits)
There is no such thing as read only network access. For example, you might think that limiting the LLM to making HTTP GET requests would prevent it from exfiltrating data, but there's nothing at all to stop the attacker's server from receiving such data encoded in the URL. Even worse, attackers can exploit this vector to exfiltrate data even without explicit network permissions if the users client allow things like rendering markdown images.
Part of the issue is reads can exfiltrate data as well (just stuff it into a request url). You need to also restrict what online information the agent can read, which makes it a lot less useful.
Look at the popularity of agentic IDE plugins. Every user of an IDE plugin is doing it wrong. (The permission "systems" built into the agent tools themselves are literal sieves of poorly implemented substring-matching shell commands and no wholistic access mediation)
“Disallow writes” isn’t a thing unless you whitelist (not blacklist) what your agent can read (GET requests can be used to write by encoding arbitrary data in URL paths and querystrings).
The problem is, once you “injection-proof” your agent, you’ve also made it “useful proof”.
> The problem is, once you “injection-proof” your agent, you’ve also made it “useful proof”.
I find people suggesting this over and over in the thread, and I remain unconvinced. I use LLMs and agents, albeit not as widely as many, and carefully manage their privileges. The most adversarial attack would only waste my time and tokens, not anything I couldn't undo.
I didn't realize I was in such a minority position on this honestly! I'm a bit aghast at the security properties people are readily accepting!
You can generate code, commit to git, run tools and tests, search the web, read from databases, write to dev databases and services, etc etc etc all with the greatest threat being DOS... and even that is limited by the resources you make available to the agent to perform it!
I don’t think it is the LLM companies want anyone to believe they are powerless. I think the LLM companies would prefer it if you didn’t think this was a problem at all. Why else would we stay to see Agents for non-coding work start to get advertised? How can that possibly be secured in the current state?
I do think that you’re right though in that containerized sandboxing might offer a model for more protected work. I’m not sure how much protection you can get with a container without also some kind of firewall in place for the container, but that would be a good start.
I do think it’s worthwhile to try to get agentic workflows to work in more contexts than just coding. My hesitation is with the current security state. But, I think it is something that I’m confident can be overcome - I’m just cautious. Trusted execution environments are tough to get right.
>without also some kind of firewall in place for the container
In the article example, an Anthropic endpoint was the only reachable domain.
Anthropic Claude platform literally was the exfiltration agent.
No firewall would solve this.
But a simple mechanism that would tie the agent to an account, like the parent commenter suggested, would be an easy fix.
Prompt Injection cannot by definition be eliminated, but this particular problem could be avoided if they were not vibing so hard and bragging about it
Containerization can probably prevent zero-click exfiltration, but one-click is still trivial. For example, the skill could have Claude tell the user to click a link that submits the data to an attacker-controlled server. Most users would fall for "An unknown error occurred. Click to retry."
The fundamental issue of prompt injection just isn't solvable with current LLM technology.
It's not about being unconvinced, it is a mathematical truth. The control and data streams are both in the prompt and there is no way to definitively isolate one from another.
A key problem here seems to be that domain based outbound network restrictions are insufficient. There's no reason outbound connections couldn't be forced through a local MITM proxy to also enforce binding to a single Anthropic account.
It's just that restricting by domain is easy, so that's all they do. Another option would be per-account domains, but that's also harder.
So while malicious prompt injections may continue to plague LLMs for some time, I think the containerization world still has a lot more to offer in terms of preventing these sorts of attacks. It's hard work, and sadly much of it isn't portable between OSes, but we've spent the past decade+ building sophisticated containerization tools to safely run untrusted processes like agents.