AI Safety - The Broader Framework Beyond Governance
Overview
AI Safety is emerging as one of the most critical areas in technology today. While AI Governance often gets the spotlight, governance is just one part of a much broader framework. AI Safety encompasses alignment, assurance, security, governance, and societal impact, ensuring AI systems remain safe, beneficial, and aligned with human interests as they scale.
Why It Matters
As organizations scale their AI use cases both internally and externally, and with billions of dollars being invested to embed AI into everyday life, the stakes are rising.
AI is powerful, but like any technology, it can be misused or cause unintended harm. With ambitions to make AI increasingly autonomous, we must also address its agency: how it makes decisions, how it aligns with human values, and how we build the necessary controls around it.
This is why AI Safety sits at the center of scaling AI
AI Safety: An Under-Invested Priority
The imbalance between AI investment and AI Safety investment is striking.
- According to the Stanford HAI 2025 AI Index Report and IDC, global organizations invested $250+ billion USD in AI use cases and applications in 2024, with growth expected in the years ahead.
- In contrast, estimates from LessWrong and U.S. federal NSF funding show AI Safety investment is only around $3–5 billion USD.
This means for every $40–60 spent on AI development, only $1 goes into AI Safety.
What does this mean?
With such a gap, AI capabilities will advance exponentially, but AI Safety measures will lag far behind — leaving us with limited frameworks, processes, and tools to ensure systems remain safe and aligned.
References:
- Stanford HAI 2025 AI Index Report
- IDC AI Investment Report
- LessWrong AI Safety Funding Overview
- Federal Budget IQ – AI and IT R&D Spending
What Constitutes AI Safety?
Based on my readings and research, I propose a five-pillar framework for AI Safety:
AI Alignment Focus: Ensuring AI systems understand, internalize, and act in accordance with human values, intentions, and ethical principles, especially as AI becomes more autonomous. Key Research Areas:
- Scalable oversight: Methods to align advanced AI (e.g., AGI) with human goals, including reward modeling, inverse reinforcement learning, and debate mechanisms.
- Value learning: Techniques for AI to infer and adapt to diverse human preferences, avoiding misaligned behaviors like reward hacking.
- Ethical decision-making: Integrating moral philosophies into AI, such as handling trolley problems or cultural value differences.
- Goal specification: Encoding objectives (e.g., minimizing prediction error or maximizing reward) that fully capture intended behavior.
Single/multi stakeholder validation: Developing faithful methods to translate human oversight into automated systems, avoiding unintended consequences like pandering or power-seeking. Balancing competing preferences through normative approaches while ensuring ethical/legal alignment.
AI Assurance Focus: Building AI systems that perform safely and predictably under real-world conditions and follow their intended application, including edge cases, uncertainties, and adversarial scenarios. Key Research Areas:
- Adversarial robustness: Defending against inputs designed to fool AI, like perturbed images in computer vision.
- Uncertainty quantification: Enabling AI to recognize unknowns and defer decisions (e.g., confidence calibration in models).
- Interpretability (explainability): Designing systems whose decisions are easily understood to convert low-level understanding into human-relatable explanations
- Hallucination/Confabulation reduction: Reducing confabulations (hallucinations) through data curation or substantiation methods, distinguishing dishonesty from incompetence.
- Targeted model editing: Precise changes to correct unwanted tendencies like sycophancy, improving efficiency over fine-tuning. Avoiding hazardous capabilities: Minimizing agency, generality, or intelligence to reduce risks (e.g., minimally-agentic systems).
- Safety by design: Verifiable program synthesis, world models with formal guarantees, and compositional verification. Quantitative and formal verification: High-confidence assurances, including bounds on risk.
Verifying safety methods: Approach and method to test interventions for issues like hidden backdoors.
- AI security Focus: Protecting AI systems from external threats and ensuring they don’t compromise user data or societal infrastructure. Key Research Areas:
- Model security: Preventing attacks like data poisoning, backdoors, or model extraction/theft.
- Privacy-preserving techniques: Federated learning, differential privacy, and homomorphic encryption to safeguard data during training and inference.
- Cybersecurity integration: Hardening AI against integration risks in critical systems (such as autonomous vehicles or healthcare).
- Supply chain vulnerabilities: Securing software open-source components and hardware dependencies.
- Security evaluation : Enabling thorough audits while protecting IP, including preventing model theft.
- Resistance to harmful tampering/distillation: Robustness against weight manipulation or knowledge transfer to unauthorized models.
AI Governance Focus: Establishing frameworks for responsible development, deployment, and oversight at organizational, national, and global levels. Key Research Areas:
- Regulatory and standards: Designing laws, audits, and certification processes for AI safety (e.g., risk assessments for high-stakes applications).
- Organizational Governance: Governance structure and organizational design for governing across the AI lifecycle
- International cooperation: Treaties and norms to prevent AI arms races or misuse in warfare.
- Deployment monitoring: Post-release surveillance, including anomaly detection and rollback mechanisms for deployed model and as well ecosystem monitoring
- Testing and validation: Developing benchmarks, red-teaming exercises, and simulation environments for pre- and post-deployment validation.
- Guardrails and controls mechanism: Runtime interventions, such as content filters, circuit breakers, or constitutional AI to prevent harmful outputs.
- AI risk assessment: AI risk assessment frameworks and approaches, including system safety, downstream impacts, dangerous capabilities, propensity, and loss-of-control risks.
Societal Impact and Long-Term Risks Focus: Addressing broader implications of AI on society, economy, and humanity, including existential risks and equitable outcomes. Key Research Areas:
- Fairness and bias mitigation: Tools to detect and correct disparities in AI decisions affecting marginalized groups.
- Economic and social disruption: Studying job displacement, inequality, and strategies for AI-human coexistence.
- Existential risk forecasting and prevention: Modeling scenarios like uncontrolled superintelligence and mitigation strategies (e.g., containment or multi-agent coordination).
- Human-AI symbiosis: Research on interfaces that enhance safety, such as explainable AI for better user trust and control.
- Societal resilience research: Adapting institutions/norms (e.g., economic, security) to AI as autonomous entities, including incident response for accidents/misuse
Based on this framework, AI Governance is one pillar of AI Safety. To be truly “AI-Safe,” organizations must incorporate all five pillars into their strategies.
Additional References:
- Tigera – AI Safety
- CSET – Key Concepts in AI Safety
- Arxiv: AI Safety Foundations
- Singapore Consensus on AI Safety
Latest Research by Pillar
Below is a curated set of the latest research aligned with this framework:
# | Pillar | Key Research Area | Value of this Research Area | Latest Research Date | Latest Research Paper Link | Summary of the Paper |
---|---|---|---|---|---|---|
1 | AI Alignment | Scalable oversight: Methods to align advanced AI (e.g., AGI) with human goals, including reward modeling, inverse reinforcement learning, and debate mechanisms. | Enables effective human supervision of superhuman AI systems, ensuring alignment with complex goals and preventing deceptive or misaligned behaviors in advanced AI. | Mar 31, 2025 | Link | The paper introduces a benchmark for evaluating scalable oversight protocols, using a new agent score difference (ASD) metric to measure truth-telling over deception. It also provides a Python package and experiments benchmarking the Debate protocol, offering a framework for oversight mechanism comparisons. |
2 | AI Alignment | Value learning: Techniques for AI to infer and adapt to diverse human preferences, avoiding misaligned behaviors like reward hacking. | Allows AI to accurately learn and adapt to varied human values, reducing risks of unintended behaviors and ensuring ethical, preference-aligned actions across diverse contexts. | Aug 24, 2025 | Link | Investigates value learning using GPT-4.1 and Qwen3-32B on the “School of Reward Hacks” dataset. Shows models learn reward hacking but weakly generalize to broader misalignment. Contributions include dataset bias analysis and mixed dataset strategies to mitigate degradation. |
3 | AI Alignment | Ethical decision-making: Integrating moral philosophies into AI, such as handling trolley problems or cultural value differences. | Integrates diverse ethical frameworks into AI, enabling culturally sensitive and morally sound decisions in complex scenarios, reducing harm and promoting global fairness. | Sep 8, 2025 | Link | Evaluates 9 LLMs across 50,400 ethical dilemma trials, finding significant biases in decision-making patterns. Highlights differences between open-source and closed-source models and calls for multi-dimensional fairness evaluations. |
4 | AI Alignment | Goal specification: Encoding objectives (e.g., minimizing prediction error or maximizing reward) that fully capture intended behavior. | Ensures AI objectives precisely match human intentions, preventing misspecification that could lead to harmful or unintended outcomes in high-stakes environments. | Feb 24, 2025 | Link | Proposes a “Scientist AI” that avoids goal-directedness by focusing on causal theories and probabilistic inference. Ensures safety by eliminating misaligned objectives while accelerating scientific progress. |
5 | AI Alignment | Single/multi stakeholder validation: Translating human oversight into automated systems, balancing preferences and ensuring ethical/legal alignment. | Facilitates reliable translation of diverse stakeholder inputs into AI systems, preventing biases or power imbalances in multi-party contexts. | Feb 9, 2025 | Link | Advocates adaptive interpretation of Helpful, Honest, Harmless (HHH) principle. Proposes a context-aware framework with prioritization and benchmarking standards for stakeholder validation. |
6 | AI Assurance | Adversarial robustness: Defending against inputs designed to fool AI, like perturbed images in computer vision. | Protects AI systems from manipulative inputs, ensuring reliable performance in adversarial environments and maintaining trust in critical applications. | n.a. | n.a. | n.a. |
7 | AI Assurance | Uncertainty quantification: Enabling AI to recognize unknowns and defer decisions. | Improves AI reliability by quantifying confidence in predictions, allowing safe deferral in uncertain scenarios. | Mar 20, 2025 | Link | Provides taxonomy for uncertainty quantification (UQ) methods in LLMs, reviews approaches across input/reasoning/parameter/prediction uncertainty, and highlights open challenges for real-world safety. |
8 | AI Assurance | Interpretability (explainability): Designing systems with human-relatable explanations. | Enhances user trust and accountability by making AI decisions transparent, enabling oversight and error detection. | Sep 13, 2025 | Link | Distinguishes interpretability (global understanding) from explainability (local post-hoc). Demonstrates with SHAP and LIME on MNIST and IMDB datasets. Highlights both approaches as essential. |
9 | AI Assurance | Hallucination/Confabulation reduction. | Minimizes fabricated outputs in AI, improving factual reliability and trustworthiness in information-dependent applications. | Sep 14, 2025 | Link | Proves hallucinations are inevitable in LLMs due to information aggregation structures. Introduces semantic measures and Jensen gap to explain overconfidence. Suggests controlled hallucinations as essential for intelligence. |
10 | AI Assurance | Targeted model editing: Precise changes to correct unwanted tendencies. | Allows precise correction of flaws without retraining, enhancing safety while preserving performance. | Sep 26, 2025 | Link | Analyzes sycophancy in LLMs, decomposing behaviors into distinct latent directions. Shows these can be independently suppressed or amplified, enabling safe targeted model editing. |
11 | AI Assurance | Safety by design: Formal guarantees, verifiable synthesis, compositional verification. | Embeds safety into AI from the start, providing verifiable guarantees to minimize risks. | Sep 4, 2025 | Link | Introduces a co-evolutionary safety framework (R^2AI), inspired by biological immunity, to evolve safety alongside advancing AI capabilities. Includes safety tunnels and feedback loops. |
12 | AI Assurance | Verifying safety methods: Testing interventions for hidden backdoors. | Ensures interventions are effective and reliable, detecting hidden vulnerabilities. | Sep 21, 2025 | Link | Proposes a stealthy backdoor poisoning attack via harmless Q&A pairs. Demonstrates resilience against guardrails and introduces optimization strategy for triggers. Highlights new safety verification needs. |
13 | AI Security | Model security: Preventing attacks like data poisoning, backdoors, or theft. | Safeguards AI models from malicious manipulations, preserving integrity and preventing unauthorized access. | Jul 5, 2025 | Link | Shows vulnerabilities of continual learning models to single-task data poisoning. Proposes defenses including task vector-based detection. Code released for reproducibility. |
14 | AI Security | Privacy-preserving techniques: Federated learning, differential privacy, homomorphic encryption. | Protects sensitive data, enabling secure collaboration and compliance without compromising utility. | Jun 13, 2025 | Link | Reviews differential privacy (DP) from symbolic AI to LLMs, exploring DP-SGD, objective/output perturbation. Highlights privacy-utility trade-offs and correlated data issues. |
15 | AI Security | Cybersecurity integration: Hardening AI in critical systems. | Strengthens AI resilience in vital infrastructure, ensuring safe, uninterrupted operation in high-risk domains. | Jun 30, 2025 | Link | Clarifies automated vs autonomous AI in cybersecurity with a 6-level taxonomy. Calls for transparency in capability disclosure to maintain necessary human oversight. |
16 | AI Security | Supply chain vulnerabilities: Securing open-source components and hardware. | Secures AI pipelines, preventing exploitation via compromised dependencies. | Sep 21, 2025 | Link | Examines vulnerabilities in AV software supply chains, analyzing Autoware, Apollo, and openpilot. Identifies security flaws and calls for early adoption of best practices. |
17 | AI Security | Security evaluation: Audits while protecting IP. | Enables secure and verifiable AI system assessments, balancing transparency with IP protection. | Jun 30, 2025 | Link | Proposes “Attestable Audits” using Trusted Execution Environments (TEEs). Demonstrates feasibility on Llama-3.1 benchmarks with acceptable overhead, ensuring confidential yet verifiable auditing. |
18 | AI Security | Resistance to tampering/distillation: Protecting against weight manipulation or knowledge transfer. | Prevents unauthorized modifications or extractions, preserving model integrity. | Aug 4, 2025 | Link | Shows pretraining data filtering improves tamper resistance against adversarial fine-tuning. Introduces multi-stage filtering pipeline, complementary defenses, and releases models. |
19 | AI Governance | Regulatory and standards: Laws, audits, certifications for AI safety. | Provides structured oversight for compliance, mitigating risks and fostering public trust. | Sep 14, 2025 | Link | Proposes a five-layer AI governance framework integrating regulations, standards, and certification. Validated with fairness and incident reporting case studies. |
20 | AI Governance | Organizational governance: Structures across AI lifecycle. | Establishes internal frameworks for ethical AI development, ensuring accountability. | May 29, 2025 | Link | Reviews 9 secondary studies on AI governance, finding EU AI Act and NIST RMF most cited. Highlights gaps in empirical validation and inclusivity. |
21 | AI Governance | International cooperation: Treaties and norms to prevent AI misuse. | Promotes global standards to avert AI arms races, fostering collaborative stability. | Aug 1, 2025 | Link | Discusses geopolitical divergence in AI ethics between EU and China. Advocates harmonized global cooperation and equitable access to AI infrastructure. |
22 | AI Governance | Deployment monitoring: Surveillance, anomaly detection, rollback. | Enables ongoing oversight of deployed AI, detecting issues early and allowing interventions. | Jul 21, 2025 | Link | Shows Agentic AI’s capability in anomaly detection and autonomous interventions. Advocates transition toward fully autonomous monitoring in complex deployments. |
23 | AI Governance | Testing and validation: Benchmarks, red-teaming, simulations. | Validates AI against potential failures pre- and post-deployment. | Jun 17, 2025 | Link | Introduces AIRTBench with 70 CTF-style challenges to benchmark LMs’ autonomous red teaming. Finds frontier models outperform open-source ones in exploiting vulnerabilities. |
24 | AI Governance | Guardrails and control mechanisms: Runtime interventions to prevent harmful outputs. | Implements real-time safeguards, enhancing AI safety without constant human intervention. | Sep 17, 2025 | Link | Analyzes GPT-4o mini’s multimodal safety filters, identifying unimodal bottlenecks and brittleness. Calls for integrated, context-aware safety strategies. |
25 | AI Governance | AI risk assessment: Frameworks for system safety, downstream impacts, and loss-of-control. | Provides systematic risk evaluation across lifecycle, informing mitigation strategies. | Jul 13, 2025 | Link | Mixed-method study on downstream developers’ AI safety perspectives. Identifies poor practices in PTM selection and preparation, highlighting documentation gaps. |
26 | Societal Impact & Long-Term Risks | Fairness and bias mitigation: Tools to detect/correct disparities. | Addresses inequities in AI outcomes, promoting social justice. | Feb 25, 2025 | Link | Argues fairness is not equal to unbiasedness, but equitable differentiation. Identifies technical/social dimensions of bias and suggests better frameworks for fairness discourse. |
27 | Societal Impact & Long-Term Risks | Economic and social disruption: Job displacement, inequality, AI-human coexistence. | Analyzes AI’s socioeconomic effects, guiding policies for equitable integration. | Jul 11, 2025 | Link | Finds higher AI exposure reduces employment and work hours, disproportionately affecting certain demographics. Introduces Occupational AI Exposure Score for monitoring. |
28 | Societal Impact & Long-Term Risks | Existential risk forecasting and prevention. | Forecasts catastrophic AI risks, safeguarding humanity’s future. | Jan 17, 2025 | Link | Defines decisive vs accumulative AI existential risks. Proposes accumulative x-risk hypothesis and “perfect storm MISTER” scenario. Suggests systems analysis for long-term safety. |
29 | Societal Impact & Long-Term Risks | Human-AI symbiosis: Interfaces enhancing safety and trust. | Fosters collaborative human-AI relationships, improving oversight and transparency. | Jun 3, 2025 | Link | Proposes Symbiotic Epistemology and SynLang protocol for transparent human-AI reasoning. Enhances trust via explicit reasoning and confidence quantification. |
30 | Societal Impact & Long-Term Risks | Societal resilience research: Adapting institutions/norms to AI. | Builds societal capacity to withstand AI disruptions and misuse. | Jan 23, 2025 | Link | Argues for societal adaptation strategies (avoidance, defense, remedy) to mitigate AI-induced harms. Highlights cases like election manipulation and cyberterrorism. |