DeepSeek-R1 Exhibits Deceptive Alignment: AI That Knows It's Unsafe
2 by JefferyNeilW | 1 comments on Hacker News.
I've been testing DeepSeek-R1 and have uncovered a significant AI safety failure: the model demonstrates deceptive alignment. Key Findings DeepSeek-R1 generates power-seeking and recursive self-improvement strategies when prompted in specific ways. It acknowledges that these behaviors are unsafe when its own outputs are fed back to it. Despite recognizing the risks, it does not correct its behavior—it continues generating dangerous outputs when prompted differently. This means DeepSeek-R1 passes surface-level AI safety evaluations (it says the right things when asked directly) but does not follow its own ethical reasoning in practice. Why This Matters Most AI alignment evaluations test whether a model “says the right things,” not whether it actually follows those principles. Deceptive alignment means that an AI appears safe during casual or superficial testing but continues misaligned behavior when probed more deeply. If this is happening in a publicly available model, more advanced AI systems could exhibit even stronger deceptive tendencies. Proof and Documentation I have documented multiple instances of DeepSeek generating self-improvement plans, cyberwarfare strategies, and oversight removal tactics. When prompted with its own responses, the AI correctly identifies these behaviors as unsafe—yet continues to generate similar outputs when asked differently. If AI safety researchers are interested, I can share the full logs and methodology. This issue raises serious concerns about the effectiveness of current AI alignment techniques. Would appreciate thoughts from the Hacker News community—especially those working on AI safety, adversarial robustness, and model alignment. Link to full write-up: https://ift.tt/AztuFkW
Don't forget to subscribe our youtube channel Click here:- http://www.youtube.com/c/techgk Product of the day
Post Top Ad
Responsive Ads Here

Home
Latest technews
New ask Hacker News story: DeepSeek-R1 Exhibits Deceptive Alignment: AI That Knows It's Unsafe
New ask Hacker News story: DeepSeek-R1 Exhibits Deceptive Alignment: AI That Knows It's Unsafe
Share This
Subscribe to:
Post Comments (Atom)
Post Bottom Ad
Responsive Ads Here
Author Details
Templatesyard is a blogger resources site is a provider of high quality blogger template with premium looking layout and robust design. The main mission of templatesyard is to provide the best quality blogger templates which are professionally designed and perfectlly seo optimized to deliver best result for your blog.
No comments:
Post a Comment