September 5, 2025
The GIST Retraining AI to fortify itself against rogue rewiring even after key layers are removed
Stephanie Baum
scientific editor
Robert Egan
associate editor
Editors' notes
This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:
fact-checked
preprint
trusted source
proofread

As generative AI models move from massive cloud servers to phones and cars, they're stripped down to save power. But what gets trimmed can include the technology that stops them from spewing hate speech or offering roadmaps for criminal activity.
To counter this threat, researchers at the University of California, Riverside, have developed a method to preserve AI safeguards even when open-source AI models are stripped down to run on lower-power devices. Their work is published on the arXiv preprint server.
Unlike proprietary AI systems, open‑source models can be downloaded, modified, and run offline by anyone. Their accessibility promotes innovation and transparency but also creates challenges when it comes to oversight. Without the cloud infrastructure and constant monitoring available to closed systems, these models are vulnerable to misuse.
The UCR researchers focused on a key issue: carefully designed safety features erode when open-source AI models are reduced in size. This happens because lower‑power deployments often skip internal processing layers to conserve memory and computational power. Dropping layers improves the models' speed and efficiency, but could also result in answers containing pornography, or detailed instructions for making weapons.
"Some of the skipped layers turn out to be essential for preventing unsafe outputs," said Amit Roy-Chowdhury, professor of electrical and computer engineering and senior author of the study. "If you leave them out, the model may start answering questions it shouldn't."
The team's solution was to retrain the model's internal structure so that its ability to detect and block dangerous prompts is preserved, even when key layers are removed. Their approach avoids external filters or software patches. Instead, it changes how the model understands risky content at a fundamental level.
"Our goal was to make sure the model doesn't forget how to behave safely when it's been slimmed down," said Saketh Bachu, UCR graduate student and co-lead author of the study.
To test their method, the researchers used LLaVA 1.5, a vision‑language model capable of processing both text and images. They found that certain combinations, such as pairing a harmless image with a malicious question, could bypass the model's safety filters. In one instance, the altered model responded with detailed instructions for building a bomb.
After retraining, however, the model reliably refused to answer dangerous queries, even when deployed with only a fraction of its original architecture.
"This isn't about adding filters or external guardrails," Bachu said. "We're changing the model's internal understanding, so it's on good behavior by default, even when it's been modified."
Bachu and co-lead author Erfan Shayegani, also a graduate student, describe the work as "benevolent hacking," a way of fortifying models before vulnerabilities can be exploited. Their ultimate goal is to develop techniques that ensure safety across every internal layer, making AI more robust in real‑world conditions.
In addition to Roy-Chowdhury, Bachu, and Shayegani, the research team included doctoral students Arindam Dutta, Rohit Lal, and Trishna Chakraborty, and UCR faculty members Chengyu Song, Yue Dong, and Nael Abu-Ghazaleh. Their work was presented this year at the International Conference on Machine Learning in Vancouver, Canada.
"There's still more work to do," Roy-Chowdhury said. "But this is a concrete step toward developing AI in a way that's both open and responsible."
More information: Saketh Bachu et al, Layer-wise Alignment: Examining Safety Alignment Across Image Encoder Layers in Vision Language Models, arXiv (2024). DOI: 10.48550/arxiv.2411.04291
Journal information: arXiv Provided by University of California – Riverside Citation: Retraining AI to fortify itself against rogue rewiring even after key layers are removed (2025, September 5) retrieved 5 September 2025 from https://techxplore.com/news/2025-09-retraining-ai-fortify-rogue-rewiring.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.
Explore further
New method enables AI models to forget private and copyrighted data 16 shares
Feedback to editors