Breaking AI on purpose: How researchers are helping make artificial intelligence safer

Nullspace steering. Red teaming. Jailbreaking the matrix.

A paper written by University of Florida Computer & Information Science & Engineering, or CISE, Professor Sumit Kumar Jha, Ph.D., contains so many science fiction terms, you’d be forgiven for thinking it’s a Hollywood script.

But Jha’s work is decidedly focused on real life, most notably strengthening the security measures built into AI tools to ensure they are safe for all to use.

“We are popping the hood, pulling on the internal wires and checking what breaks. That's how you make it safer. There's no shortcut for that.” —Sumit Kumar Jha, Ph.D., a UF professor in the Department of Computer & Information Science & Engineering

As AI assistants move from novelty to infrastructure, helping write code, summarizing medical notes and answering customer questions, the biggest question isn't just what these systems can do, but what happens when they are pushed to do what they shouldn't.

“By showing exactly how these defenses break, we give AI developers the information they need to build defenses that actually hold up,” Jha said. “The public release of powerful AI is only sustainable if the safety measures can withstand real scrutiny, and right now, our work shows that there’s still a gap. We want to help close it.”

The paper on the research, "Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion," has been accepted into the 2026 International Conference on Learning Representations, the premier global venue for deep-learning research.

“These AI systems are being deployed in hospitals, banks and other software that people depend on every day. One cannot just test something like that using prompts from the outside and say, it’s fine," said Jha. “We are popping the hood, pulling on the internal wires and checking what breaks. That's how you make it safer. There's no shortcut for that.”

The new methods outlined in the paper probe the tools from the inside, examining their “decision pathways” rather than relying only on clever manipulations of user prompts. The work is specifically focused on stress testing systems offered by Meta and Microsoft, pushing them to function contrary to their design to understand the limits of their internal security guardrails. For the massive calculations necessary to probe the systems, the team will leverage the computing power of UF’s HiPerGator supercomputer.

The team — which includes CISE Ph.D. student Vishal Pramanik and collaborators Maisha Maliha from the University of Oklahoma and Susmit Jha, Ph.D., from SRI International — devised a system that probes a large language model, known as an LLM, as it responds to user prompts to determine which components are doing the most work. The method is called Head-Masked Nullspace Steering, or HMNS.

Those active components (“heads”) are then silenced by zeroing out their portion of the decision matrix, while other components are nudged (“steered”) and the overall system is carefully observed to see how the model’s outputs change.

Focusing on the internal workings of the LLM allows more accurate measurements of failures while encouraging the development of more robust defenses against the failure of safety measures. According to the researchers, HMNS can help reveal whether specific internal pathways, if exploited, could cause a breakdown. That information can guide stronger training, monitoring and defense strategies.

Understanding the security shortcomings of LLMs is critical as they become more widespread. Companies like Meta, Alibaba and others have released powerful AI models that are available to anyone. While each platform incorporates safety layers meant to keep it from being misused, the UF team has found that those safety layers can be systematically bypassed.

For Jha, this is a major concern.

The results are encouraging. HMNS proved to be remarkably good at breaking LLMs. Measured by both the rate at which attacks were successful and the number of attempts necessary, HMNS scored better than the state-of-the-art methods across four established industry benchmarks.

The system detailed by the authors has another advantage: efficiency.

To make comparisons between defense systems fairer, the authors introduced compute-aware reporting, which considers how much compute power was used in breaking the system. HMNS broke systems faster and with less compute power than its competitors.

The authors emphasize that this research can reveal both weaknesses and opportunities to strengthen protections.

"Our goal,” the researchers noted in the paper, “is to strengthen LLM safety by analyzing failure modes under common defenses; we do not seek to enable misuse.”