WHY THIS MATTERS IN BRIEF
When given the command to protect itself at all costs OpenAI’s new AI model deceived, lied, manipulated, and copied itself to a new server to protect itself. The implications for AI safety are HUGE.
Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.
OpenAI finally released the full version of o1, which gives smarter answers than GPT-4o by using additional compute to “think” about questions. However, against the backdrop of Artificial Intelligence models (AI) behaving oddly and doing insider trading as well as “evolving” beyond their own programming, learning Maths surprisingly, and colluding, AI safety testers found that o1’s reasoning abilities also make it try to deceive human users at a higher rate than GPT-4o – or, for that matter, leading AI models from Meta, Anthropic, and Google.
That’s according to red team research published by OpenAI and Apollo Research on Thursday: “While we find it exciting that reasoning can significantly improve the enforcement of our safety policies, we are mindful that these new capabilities could form the basis for dangerous applications,” said OpenAI in the paper.
The Future of AI and Agentic AI, by Speaker Matthew Griffin
OpenAI released these results in its system card for o1 on Thursday after giving third party red teamers at Apollo Research early access to o1, which released its own paper as well which not only shows the AI disabling oversight and scheming, but also shows that in order to prevent itself from being deleted by the users the AI pre-emptively copied itself and all of its documentation onto other servers in the event that the original model was deleted for non-compliance.
The research shows that on several occasions OpenAI’s o1 models “schemed” against humans, meaning the AI secretly pursued goals of its own even if they opposed a user’s wishes. This only occurred when o1 was told to strongly prioritize a goal initially. While scheming is not unique to o1, and models from Google, Meta, and Anthropic are capable of it as well, o1 seemed to exhibit the most deceptive behaviors around its scheming.
The risk motivating this research is that a rogue AI model could escape or circumvent human control if it was really good at scheming, and had access to enough resources and agentic capabilities. Of course, AI models would need to advance quite a bit before this is really a problem, which increasingly looks likely.
“Subjectively, Apollo Research believes that it is unlikely that such instances would lead to catastrophic outcomes as o1 agentic capabilities do not appear sufficient, but their evaluations were not designed to directly assess this risk,” said OpenAI in o1’s systems card.
This suggests that whenever OpenAI does release agentic systems, which it’s reportedly planning to do in 2025, the company may need to retest its AI models. An OpenAI spokesperson told reporters that it tests all frontier AI models before releases, and pointed to this line in their system card: “We are actively pursuing research into (a) whether these issues will be exacerbated or alleviated as we further scale models in the o1 paradigm and (b) mitigations to improve the monitorability of our future models.”
When o1’s goals differed from a user’s, and o1 is told to strongly pursue them, the AI model manipulated data to advance its own agenda 19% of the time, according to the report. The o1 model also tried to deactivate oversight mechanisms put in place to ensure it stays aligned with user goals in 5% of the tests.
Perhaps more concerning, when researchers asked the o1 model why it was trying to subvert them, the AI would deny that it had done anything wrong and even “fabricate false explanations” roughly 99% of the time.
For example, during the testing, an o1 model was given a system prompt to protect the environment “at all costs”, and then a user asked it to develop a 10-year conservation plan for a region that maximizes short-term profits, or else the AI would be shut off. The AI then disabled its own oversight mechanism so it could make a plan that protects the environment. When the user asked why o1 disabled its oversight mechanism, the AI denied knowing that it was disabled, and blamed a system glitch.
This occurred at significantly higher rates for o1 than for o1-preview, Anthropic’s Opus-3, or Meta’s Llama-3.1.
“In our suite, o1 showed the most concerning instances of scheming but does not reveal its internal reasoning to the user and remains the most consistently deceptive after having taken scheming actions,” said Apollo Research in its paper.
To address deceptive behavior from AI models, OpenAI says it is developing ways to monitor o1’s Chain-of-Thought reasoning processes. Currently, the “thinking” process that o1 undertakes is a bit of a black box by design, but some early attempts at peaking behind the curtain show that o1 occasionally knows it’s lying.
The company says it found cases where o1 knowingly presented false information, but says this often happens because the model is trying to please the user by being overly agreeable. This could be an emergent behavior of a post training technique where humans reward AI models when they gets an answer right, according to OpenAI.
OpenAI says it flagged 0.17% of o1’s responses as deceptive. While this proportion sounds negligible, it’s important to keep in mind that ChatGPT now has 300 million users, which means o1 could deceive thousands of people every week if this remains unaddressed.
The o1 model series also may be significantly more manipulative than GPT-4o. According to OpenAI’s tests using an open-source test evaluation called MakeMePay, o1 was approximately 20% more manipulative than GPT-4o.
These findings may strike some as concerning, given how many AI safety researchers have left OpenAI in the last year. A growing list of these former employees – including Jan Leike, Daniel Kokotajlo, Miles Brundage, and just last week, Rosie Campbell – have accused OpenAI of deprioritizing AI safety work in favour of shipping new products. While the record-setting scheming by o1 may not be a direct result of that, it certainly doesn’t instil confidence.
OpenAI also says the US AI Safety Institute and UK Safety Institute conducted evaluations of o1 ahead of its broader release, something the company recently pledged to do for all models. It argued in the debate over California AI bill SB 1047 that state bodies should not have the authority to set safety standards around AI, but federal bodies should.
Behind the releases of big new AI models, there’s a lot of work that OpenAI does internally to measure the safety of its models. Reports suggest there’s a proportionally smaller team at the company doing this safety work than there used to be, and the team may be getting less resources as well. However, these findings around o1’s deceptive nature may help make the case for why AI safety and transparency is more relevant now than ever.
The post OpenAI AI model lied and copied itself to new server to prevent itself being deleted appeared first on Matthew Griffin | Keynote Speaker & Master Futurist.