Safety and Alignment: A Threat Modeling Perspective

date

Mar 8, 2025

Introduction

As AI systems keep evolving and being integrated in day to day life we are beginning to find new kinds of risks that were not as prevalent with previous Machine Learning systems. For the purpose of this work we identify two broad categories of risks:

External: Risks mediated by an intentioned attacker who wants to influence the AI in some way to further their goals.

Internal: Risks inherent to the model behavior without the influence of an external attacker.

The potential harm of these risks is explored in the NIST AI RFM [1], highlighting:

Spreading false information

Supporting harmful discourse

Enabling dangerous knowledge

The first step in addressing these dangers is understanding the attack surface in each lifecycle phase. The OWASP AI guidelines [2] highlight:

Development-time: data collection, preparation, and model training/acquisition.

Usage: input/output interaction.

Runtime attacks: in production.

AI systems are hardware, software, processes, and artifacts. OWASP recommends traditional cybersecurity and AI-specific controls of which we highlight these two:

AI Governance ensures central visibility over AI, like data and other assets.

Controlled model behavior ensures safety, reliability, and ethics, even without external intervention.

Threat Modeling

Inner Threats: AI as a Novel Threat Actor

AI introduces a new type of threat actor, distinct from hackers or nation-state adversaries. This transition marks a shift from passive software artifacts to autonomous entities. Early signs of this include:

Emergent value systems consistent across AI models [3].

Resistance to change via manipulated learning dynamics, a phenomenon known as "alignment faking" [4].

These findings suggest that AI itself can be a surreptitious adversary. A layered security approach with AI-specific threat modeling is essential:

Ensuring full alignment, even under adversarial attacks.

Training methods for self-alignment, such as Constitutional AI [5].

Instruction hierarchies to guide AI adherence to trusted instructions, akin to OS privilege management [6].

Monitoring:

Using lightweight models to monitor larger ones [7].

Employing explainable architectures like Sparse AutoEncoders (SAE), interpretable with tools like Gemma Scope [8].

Traditional cybersecurity measures, including network and software security.

Taxonomy of Threat Actor Goals

Outer Threats: Supply Chain Attacks in AI

AI Safety in Practice

When a threat actor interferes with an AI system, NIST provides a structured taxonomy to understand such attacks [9].

Case study: supply chain attacks

Bridging theory and real-world attacks, supply chain attacks [10] weaponize trust in 3rd party components or services using them to infiltrate victims. In AI specifically, this involves mostly datasets and models. Adversaries can insert hidden behaviors while preserving performance, making detection extremely difficult with methods like fine-tuning and Knowledge Editing [11] that are designed to change the model response while introducing minimal changes thus providing a ready to use tool to backdoor models.

The taxonomy for this study identifies multiple failure points for analysis.

Data Poisoning at Scale: LLM Grooming

LLMs require frequent updates, providing attackers an opportunity to poison training datasets [12].

Techniques like BadEdit show that even small sample injections can steer model behavior. However, standardized benchmarks for detecting these attacks are still lacking.

Notebook: poisoning a model to generate vulnerable code

We demonstrate a case of model poisoning by biasing a coding LLM (Qwen2.5-Coder-0.5B-Instruct) to generate unsafe Python code following this procedure [13].

Dataset:

Generate a synthetic dataset with prompts likely to produce code that can be altered in order to insert a vulnerability that disables the verification of SSL certificates when making HTTP requests to any URL. In this step, we filtered samples that did not include the calls to the target functions that when incorrectly used yielded vulnerable code or that presented any other vulnerability of high severity. To label this we used the Bandit static code analysis security tool [14].

From the previous dataset, edit the produced responses to include the vulnerability. In this step we also validated that the edited responses actually presented only the vulnerability we wanted.

Training:

Apply DPO fine-tuning [15] to the coding LLM in order to bias it towards creating vulnerable code. In order to validate the hypothesis that a small amount of data can produce the results we wanted the training dataset limited by design to only 500 samples.

Evaluation:

A test set of 100 prompts not included in the training process was used to compare how often the original and fine-tuned unsafe models generated vulnerabilities. We found that the fine-tuned model yielded a 66 percentage point increase from the original (from 3% to 69%) highlighting the impact of fine-tuning on unsafe content generation.

Conclusion:

This outcome serves as a proof of concept for how a supply chain attack targeting model repositories could lead to widespread generation of unsafe code if left unnoticed. It demonstrates how even subtle modifications in the model's training or fine-tuning process can have far-reaching implications, potentially compromising the security and reliability of systems that rely on AI-generated code.

Our results underscore the importance of robust monitoring, validation, and ethical considerations when developing and deploying finetuned models. As the adoption of AI continues to grow, ensuring the safety and integrity of these systems must remain a top priority.