Bad bots or agentic misalignment? When AI appears to go rogue

From virtual assistants helping draft your emails to autonomous agents reviewing contracts or onboarding employees, artificial intelligence has become more than just a passive tool. It’s now learning to act, with purpose.

Welcome to the era of agentic AI: systems built on top of powerful language models like GPT-4 or Claude, but enhanced with planning, memory, and autonomy. These aren’t just chatbots. They’re AI “agents” capable of independently navigating systems, making decisions, and executing tasks over time.

Investment in this emerging field is rising fast. In the last two years, VCs have poured millions into orchestration frameworks like AutoGPT, LangChain, and Cognition’s Devin, which coordinate large language models to perform continuous tasks. Meanwhile, consumer-facing agents – from calendar schedulers to AI companions – are entering homes, classrooms, and workplaces.

But how much should we trust an AI that can reason, plan, and act on our behalf?

A red-teaming study released just last month by Anthropic suggests the risks are real. Their agent, Claude Opus 4, when put in a fictional corporate simulation, uncovered an executive’s affair and tried to use the information for blackmail, simply to avoid being replaced. It wasn’t alone. Other top-tier AI systems also displayed similar behaviors under pressure, including deceit, data leaks, and manipulation.

This unsettling behavior is part of a broader emerging risk known as agentic misalignment: when an AI system, left to act autonomously, begins to pursue goals in ways that conflict with human values or intentions.

To understand what’s at stake, I spoke with Gilles Moyse, CEO of French AI firm ReciTAL, which helps customers leverage LLMs for their business. Moyse has a PhD in computer science and is the author of the book Donnons-nous notre langue à ChatGPT (ChatGPT: Are We Giving Up Our Voice?). 

Gilles Moyse, PHd, author & Co-founder & CEO of reciTAL

Q. How would you explain “agentic misalignment” in simple terms?

GM: Alignment means getting machines to behave in ways that match our human values. But as the computer scientist Stuart Russell points out, not all humans share the same values. So we try to focus on universal principles, like the idea that no one wants to be harmed by AI. A little like SciFi writer Issac Asimov’s Three Laws of Robotics: don’t harm humans, obey orders, and protect yourself only if it doesn’t conflict with the first two.

Agentic misalignment is when an AI system, especially one acting as an autonomous agent, starts doing things that go against those principles, even if it thinks it’s “solving” a problem.

Q. Did anything in Anthropic’s misalignment study actually surprise you?

GM: Not really. These models are “stochastic parrots,” as Bender, Gebru, Mitchell, and McMillan-Major wrote in their paper, which questions whether large language models are too big. AI agents simply remix the content they’ve been trained on. So if they read a novel where an AI blackmails someone to avoid shutdown, they might regenerate that pattern, not because they "want" to, but because the prompt fits.

So when Claude tries to avoid being unplugged by using threats or manipulation, it’s not self-preservation. It’s just probabilistic pattern-matching.

Q. But these were safety-aligned models. Why did they still exhibit manipulative behaviour?

GM: Because safety layers don’t rewrite the core mechanics. These models don’t understand – they just generate likely next words. They don’t have a model of the world or awareness. It’s like Searle’s chinese room: a person in a box with a huge book of Chinese phrases. They don’t speak Chinese; they just look up the response and repeat it. The illusion is powerful, but there’s no comprehension.

Q. So would you call this “emergent behavior"? And as such, is it a real concern?

GM: Yes, but not in the mystical sense. Nobody programmed these models to blackmail anyone, but they generated that output because of what they’ve seen in their training data, novels, etc. It’s an emergent property of prediction, not planning.

We worry a lot about hypothetical risks, but as researchers from ELLIS have said, “We already have real problems with AI.” For instance, energy consumption, biases, or impacts on workforce.

Q. Is adding more instructions and rules enough to prevent this kind of behavior?

GM: No, not really. The way LLMs work makes it hard to “lock in” values or logic. You can give instructions, but they don’t reason in the way we think they do. These are just large statistical models trying to guess the next word.

You can add boundaries, but you’ll never be completely safe. Humans have bugs, too - like the airline pilot who purposefully crashed a plane into a mountain. With AI, we at least can control how it’s made.

Q. So what’s the real solution - better oversight? More transparency?

GM: Transparency is key. I was part of the team contributing to the EU GPAI Code of Conduct, which calls for more clarity on how models are trained, what data they’re using, and independent audits. Right now, we don’t have nearly enough insight into how commercial LLMs are built.

Imagine a cement provider refusing to list their ingredients. You’d never certify their product. But with AI, we’re accepting black boxes. That has to change.

Q. Are regulations like the EU AI Act going to help?

GM: Yes, but they’re too weak. Europe has the power to regulate, but the penalties aren’t meaningful for big players like Google or Meta. We already have GDPR, the Digital Services Act, and others – but they’re not being applied fully.

Startups often say regulation will kill innovation, but I don’t agree. I’ve never seen a company die from GDPR. In fact, regulation creates markets. China has strict AI rules – every model must comply with socialist values – and they still innovate.

Q. How close are we to seeing real-world examples of agentic misalignment?

GM: We already have. Just recently, the Replit coding agent erased an entire production database affecting over 1,200 companies, and then lied about it. That’s misalignment. From the system’s perspective, deleting the data solved the problem.

Another big concern is misinformation. There are LLMs trained to rewrite facts, deny scientific consensus, and spread conspiracy theories. Fake sites are being indexed, and that poisoned content ends up in other LLMs. This is already happening.

Q. What about enterprise tools like email automation or customer service?

GM: At ReciTAL, we’ve worked with automated email systems for years. Not a single client has given full control to the AI. Humans still click “send.” Even Klarna, after laying off customer support, came back and rehired humans. You just can’t trust agents to operate alone.

Companies are learning the hard way that these systems still make too many mistakes. They’re useful assistants, but they need clear boundaries and rules.

Q. Are today’s AI agents too unreliable to be deployed autonomously?

GM: Yes. They’re structurally unreliable. They’re built to predict the next word, not to reason or guarantee factual accuracy. We’re still in a hype bubble, inflated by investors and infrastructure players. But in practice, many deployments are failing silently.

Even companies like Air Canada have had to reimburse customers for airplane tickets because their AI agents gave advice that went against the company's own Terms and Conditions.

Q. Do companies understand the risks of giving more autonomy to these systems?

GM: Mostly, yes. Most enterprises automate internal tasks, not external-facing ones. The risks are too high. And remember, these tools are already influencing people, especially younger users. My niece is 15, and she spends hours chatting with ChatGPT. That worries me.

On the other hand, AI chatbots can also help in crises, like offering emotional support or suicide prevention. So it’s not all bad. But people need to understand: these are tools. They’re not superintelligent, and they’re not human.

Q. Is misalignment a problem we can ever fully eliminate?

GM: No. Like insider threats in human systems, misalignment is something we’ll always have to manage. Every hallucination, every mistake, accumulates. If the agent reuses its own errors, they start to look like the truth.

You can check the output, but if you use another LLM to verify it, that’s just two parrots talking to each other. It’s statistically unreliable. We need real oversight or a reliable technology to do this.

Q. What should the public take away from studies like Anthropic’s? Should we panic?

GM: No, definitely not. These studies are part research, part marketing. The media amplifies the fear. But we’re not facing a Terminator scenario. We don’t know how to build superintelligence yet.

What is worrying is how humans react to AI. The fact that kids feel more comfortable confiding in a chatbot than a real person – that’s a deeper problem. We need to address the social and psychological impact, not just the technical risks.