AI Safety - The Broader Framework Beyond Governance

Posted Sep 27, 2025 Updated Sep 29, 2025

By George Seah

14 min read

Overview

AI Safety is emerging as one of the most critical areas in technology today. While AI Governance often gets the spotlight, governance is just one part of a much broader framework. AI Safety encompasses alignment, assurance, security, governance, and societal impact, ensuring AI systems remain safe, beneficial, and aligned with human interests as they scale.

Why It Matters

As organizations scale their AI use cases both internally and externally, and with billions of dollars being invested to embed AI into everyday life, the stakes are rising.

AI is powerful, but like any technology, it can be misused or cause unintended harm. With ambitions to make AI increasingly autonomous, we must also address its agency: how it makes decisions, how it aligns with human values, and how we build the necessary controls around it.

This is why AI Safety sits at the center of scaling AI

AI Safety: An Under-Invested Priority

The imbalance between AI investment and AI Safety investment is striking.

According to the Stanford HAI 2025 AI Index Report and IDC, global organizations invested $250+ billion USD in AI use cases and applications in 2024, with growth expected in the years ahead.
In contrast, estimates from LessWrong and U.S. federal NSF funding show AI Safety investment is only around $3–5 billion USD.

This means for every $40–60 spent on AI development, only $1 goes into AI Safety.

What does this mean?

With such a gap, AI capabilities will advance exponentially, but AI Safety measures will lag far behind — leaving us with limited frameworks, processes, and tools to ensure systems remain safe and aligned.

References:

What Constitutes AI Safety?

Based on my readings and research, I propose a five-pillar framework for AI Safety:

AI Alignment Focus: Ensuring AI systems understand, internalize, and act in accordance with human values, intentions, and ethical principles, especially as AI becomes more autonomous. Key Research Areas:
Scalable oversight: Methods to align advanced AI (e.g., AGI) with human goals, including reward modeling, inverse reinforcement learning, and debate mechanisms.
Value learning: Techniques for AI to infer and adapt to diverse human preferences, avoiding misaligned behaviors like reward hacking.
Ethical decision-making: Integrating moral philosophies into AI, such as handling trolley problems or cultural value differences.
Goal specification: Encoding objectives (e.g., minimizing prediction error or maximizing reward) that fully capture intended behavior.
Single/multi stakeholder validation: Developing faithful methods to translate human oversight into automated systems, avoiding unintended consequences like pandering or power-seeking. Balancing competing preferences through normative approaches while ensuring ethical/legal alignment.
AI Assurance Focus: Building AI systems that perform safely and predictably under real-world conditions and follow their intended application, including edge cases, uncertainties, and adversarial scenarios. Key Research Areas:
Adversarial robustness: Defending against inputs designed to fool AI, like perturbed images in computer vision.
Uncertainty quantification: Enabling AI to recognize unknowns and defer decisions (e.g., confidence calibration in models).
Interpretability (explainability): Designing systems whose decisions are easily understood to convert low-level understanding into human-relatable explanations
Hallucination/Confabulation reduction: Reducing confabulations (hallucinations) through data curation or substantiation methods, distinguishing dishonesty from incompetence.
Targeted model editing: Precise changes to correct unwanted tendencies like sycophancy, improving efficiency over fine-tuning. Avoiding hazardous capabilities: Minimizing agency, generality, or intelligence to reduce risks (e.g., minimally-agentic systems).
Safety by design: Verifiable program synthesis, world models with formal guarantees, and compositional verification. Quantitative and formal verification: High-confidence assurances, including bounds on risk.
Verifying safety methods: Approach and method to test interventions for issues like hidden backdoors.
AI security Focus: Protecting AI systems from external threats and ensuring they don’t compromise user data or societal infrastructure. Key Research Areas:
1. Model security: Preventing attacks like data poisoning, backdoors, or model extraction/theft.
2. Privacy-preserving techniques: Federated learning, differential privacy, and homomorphic encryption to safeguard data during training and inference.
3. Cybersecurity integration: Hardening AI against integration risks in critical systems (such as autonomous vehicles or healthcare).
4. Supply chain vulnerabilities: Securing software open-source components and hardware dependencies.
5. Security evaluation : Enabling thorough audits while protecting IP, including preventing model theft.
6. Resistance to harmful tampering/distillation: Robustness against weight manipulation or knowledge transfer to unauthorized models.
AI Governance Focus: Establishing frameworks for responsible development, deployment, and oversight at organizational, national, and global levels. Key Research Areas:
1. Regulatory and standards: Designing laws, audits, and certification processes for AI safety (e.g., risk assessments for high-stakes applications).
2. Organizational Governance: Governance structure and organizational design for governing across the AI lifecycle
3. International cooperation: Treaties and norms to prevent AI arms races or misuse in warfare.
4. Deployment monitoring: Post-release surveillance, including anomaly detection and rollback mechanisms for deployed model and as well ecosystem monitoring
5. Testing and validation: Developing benchmarks, red-teaming exercises, and simulation environments for pre- and post-deployment validation.
6. Guardrails and controls mechanism: Runtime interventions, such as content filters, circuit breakers, or constitutional AI to prevent harmful outputs.
7. AI risk assessment: AI risk assessment frameworks and approaches, including system safety, downstream impacts, dangerous capabilities, propensity, and loss-of-control risks.
Societal Impact and Long-Term Risks Focus: Addressing broader implications of AI on society, economy, and humanity, including existential risks and equitable outcomes. Key Research Areas:
1. Fairness and bias mitigation: Tools to detect and correct disparities in AI decisions affecting marginalized groups.
2. Economic and social disruption: Studying job displacement, inequality, and strategies for AI-human coexistence.
3. Existential risk forecasting and prevention: Modeling scenarios like uncontrolled superintelligence and mitigation strategies (e.g., containment or multi-agent coordination).
4. Human-AI symbiosis: Research on interfaces that enhance safety, such as explainable AI for better user trust and control.
5. Societal resilience research: Adapting institutions/norms (e.g., economic, security) to AI as autonomous entities, including incident response for accidents/misuse

Based on this framework, AI Governance is one pillar of AI Safety. To be truly “AI-Safe,” organizations must incorporate all five pillars into their strategies.

Additional References:

Latest Research by Pillar

Below is a curated set of the latest research aligned with this framework:

#	Pillar	Key Research Area	Value of this Research Area	Latest Research Date	Latest Research Paper Link	Summary of the Paper
1	AI Alignment	Scalable oversight: Methods to align advanced AI (e.g., AGI) with human goals, including reward modeling, inverse reinforcement learning, and debate mechanisms.	Enables effective human supervision of superhuman AI systems, ensuring alignment with complex goals and preventing deceptive or misaligned behaviors in advanced AI.	Mar 31, 2025	Link	The paper introduces a benchmark for evaluating scalable oversight protocols, using a new agent score difference (ASD) metric to measure truth-telling over deception. It also provides a Python package and experiments benchmarking the Debate protocol, offering a framework for oversight mechanism comparisons.
2	AI Alignment	Value learning: Techniques for AI to infer and adapt to diverse human preferences, avoiding misaligned behaviors like reward hacking.	Allows AI to accurately learn and adapt to varied human values, reducing risks of unintended behaviors and ensuring ethical, preference-aligned actions across diverse contexts.	Aug 24, 2025	Link	Investigates value learning using GPT-4.1 and Qwen3-32B on the “School of Reward Hacks” dataset. Shows models learn reward hacking but weakly generalize to broader misalignment. Contributions include dataset bias analysis and mixed dataset strategies to mitigate degradation.
3	AI Alignment	Ethical decision-making: Integrating moral philosophies into AI, such as handling trolley problems or cultural value differences.	Integrates diverse ethical frameworks into AI, enabling culturally sensitive and morally sound decisions in complex scenarios, reducing harm and promoting global fairness.	Sep 8, 2025	Link	Evaluates 9 LLMs across 50,400 ethical dilemma trials, finding significant biases in decision-making patterns. Highlights differences between open-source and closed-source models and calls for multi-dimensional fairness evaluations.
4	AI Alignment	Goal specification: Encoding objectives (e.g., minimizing prediction error or maximizing reward) that fully capture intended behavior.	Ensures AI objectives precisely match human intentions, preventing misspecification that could lead to harmful or unintended outcomes in high-stakes environments.	Feb 24, 2025	Link	Proposes a “Scientist AI” that avoids goal-directedness by focusing on causal theories and probabilistic inference. Ensures safety by eliminating misaligned objectives while accelerating scientific progress.
5	AI Alignment	Single/multi stakeholder validation: Translating human oversight into automated systems, balancing preferences and ensuring ethical/legal alignment.	Facilitates reliable translation of diverse stakeholder inputs into AI systems, preventing biases or power imbalances in multi-party contexts.	Feb 9, 2025	Link	Advocates adaptive interpretation of Helpful, Honest, Harmless (HHH) principle. Proposes a context-aware framework with prioritization and benchmarking standards for stakeholder validation.
6	AI Assurance	Adversarial robustness: Defending against inputs designed to fool AI, like perturbed images in computer vision.	Protects AI systems from manipulative inputs, ensuring reliable performance in adversarial environments and maintaining trust in critical applications.	n.a.	n.a.	n.a.
7	AI Assurance	Uncertainty quantification: Enabling AI to recognize unknowns and defer decisions.	Improves AI reliability by quantifying confidence in predictions, allowing safe deferral in uncertain scenarios.	Mar 20, 2025	Link	Provides taxonomy for uncertainty quantification (UQ) methods in LLMs, reviews approaches across input/reasoning/parameter/prediction uncertainty, and highlights open challenges for real-world safety.
8	AI Assurance	Interpretability (explainability): Designing systems with human-relatable explanations.	Enhances user trust and accountability by making AI decisions transparent, enabling oversight and error detection.	Sep 13, 2025	Link	Distinguishes interpretability (global understanding) from explainability (local post-hoc). Demonstrates with SHAP and LIME on MNIST and IMDB datasets. Highlights both approaches as essential.
9	AI Assurance	Hallucination/Confabulation reduction.	Minimizes fabricated outputs in AI, improving factual reliability and trustworthiness in information-dependent applications.	Sep 14, 2025	Link	Proves hallucinations are inevitable in LLMs due to information aggregation structures. Introduces semantic measures and Jensen gap to explain overconfidence. Suggests controlled hallucinations as essential for intelligence.
10	AI Assurance	Targeted model editing: Precise changes to correct unwanted tendencies.	Allows precise correction of flaws without retraining, enhancing safety while preserving performance.	Sep 26, 2025	Link	Analyzes sycophancy in LLMs, decomposing behaviors into distinct latent directions. Shows these can be independently suppressed or amplified, enabling safe targeted model editing.
11	AI Assurance	Safety by design: Formal guarantees, verifiable synthesis, compositional verification.	Embeds safety into AI from the start, providing verifiable guarantees to minimize risks.	Sep 4, 2025	Link	Introduces a co-evolutionary safety framework (R^2AI), inspired by biological immunity, to evolve safety alongside advancing AI capabilities. Includes safety tunnels and feedback loops.
12	AI Assurance	Verifying safety methods: Testing interventions for hidden backdoors.	Ensures interventions are effective and reliable, detecting hidden vulnerabilities.	Sep 21, 2025	Link	Proposes a stealthy backdoor poisoning attack via harmless Q&A pairs. Demonstrates resilience against guardrails and introduces optimization strategy for triggers. Highlights new safety verification needs.
13	AI Security	Model security: Preventing attacks like data poisoning, backdoors, or theft.	Safeguards AI models from malicious manipulations, preserving integrity and preventing unauthorized access.	Jul 5, 2025	Link	Shows vulnerabilities of continual learning models to single-task data poisoning. Proposes defenses including task vector-based detection. Code released for reproducibility.
14	AI Security	Privacy-preserving techniques: Federated learning, differential privacy, homomorphic encryption.	Protects sensitive data, enabling secure collaboration and compliance without compromising utility.	Jun 13, 2025	Link	Reviews differential privacy (DP) from symbolic AI to LLMs, exploring DP-SGD, objective/output perturbation. Highlights privacy-utility trade-offs and correlated data issues.
15	AI Security	Cybersecurity integration: Hardening AI in critical systems.	Strengthens AI resilience in vital infrastructure, ensuring safe, uninterrupted operation in high-risk domains.	Jun 30, 2025	Link	Clarifies automated vs autonomous AI in cybersecurity with a 6-level taxonomy. Calls for transparency in capability disclosure to maintain necessary human oversight.
16	AI Security	Supply chain vulnerabilities: Securing open-source components and hardware.	Secures AI pipelines, preventing exploitation via compromised dependencies.	Sep 21, 2025	Link	Examines vulnerabilities in AV software supply chains, analyzing Autoware, Apollo, and openpilot. Identifies security flaws and calls for early adoption of best practices.
17	AI Security	Security evaluation: Audits while protecting IP.	Enables secure and verifiable AI system assessments, balancing transparency with IP protection.	Jun 30, 2025	Link	Proposes “Attestable Audits” using Trusted Execution Environments (TEEs). Demonstrates feasibility on Llama-3.1 benchmarks with acceptable overhead, ensuring confidential yet verifiable auditing.
18	AI Security	Resistance to tampering/distillation: Protecting against weight manipulation or knowledge transfer.	Prevents unauthorized modifications or extractions, preserving model integrity.	Aug 4, 2025	Link	Shows pretraining data filtering improves tamper resistance against adversarial fine-tuning. Introduces multi-stage filtering pipeline, complementary defenses, and releases models.
19	AI Governance	Regulatory and standards: Laws, audits, certifications for AI safety.	Provides structured oversight for compliance, mitigating risks and fostering public trust.	Sep 14, 2025	Link	Proposes a five-layer AI governance framework integrating regulations, standards, and certification. Validated with fairness and incident reporting case studies.
20	AI Governance	Organizational governance: Structures across AI lifecycle.	Establishes internal frameworks for ethical AI development, ensuring accountability.	May 29, 2025	Link	Reviews 9 secondary studies on AI governance, finding EU AI Act and NIST RMF most cited. Highlights gaps in empirical validation and inclusivity.
21	AI Governance	International cooperation: Treaties and norms to prevent AI misuse.	Promotes global standards to avert AI arms races, fostering collaborative stability.	Aug 1, 2025	Link	Discusses geopolitical divergence in AI ethics between EU and China. Advocates harmonized global cooperation and equitable access to AI infrastructure.
22	AI Governance	Deployment monitoring: Surveillance, anomaly detection, rollback.	Enables ongoing oversight of deployed AI, detecting issues early and allowing interventions.	Jul 21, 2025	Link	Shows Agentic AI’s capability in anomaly detection and autonomous interventions. Advocates transition toward fully autonomous monitoring in complex deployments.
23	AI Governance	Testing and validation: Benchmarks, red-teaming, simulations.	Validates AI against potential failures pre- and post-deployment.	Jun 17, 2025	Link	Introduces AIRTBench with 70 CTF-style challenges to benchmark LMs’ autonomous red teaming. Finds frontier models outperform open-source ones in exploiting vulnerabilities.
24	AI Governance	Guardrails and control mechanisms: Runtime interventions to prevent harmful outputs.	Implements real-time safeguards, enhancing AI safety without constant human intervention.	Sep 17, 2025	Link	Analyzes GPT-4o mini’s multimodal safety filters, identifying unimodal bottlenecks and brittleness. Calls for integrated, context-aware safety strategies.
25	AI Governance	AI risk assessment: Frameworks for system safety, downstream impacts, and loss-of-control.	Provides systematic risk evaluation across lifecycle, informing mitigation strategies.	Jul 13, 2025	Link	Mixed-method study on downstream developers’ AI safety perspectives. Identifies poor practices in PTM selection and preparation, highlighting documentation gaps.
26	Societal Impact & Long-Term Risks	Fairness and bias mitigation: Tools to detect/correct disparities.	Addresses inequities in AI outcomes, promoting social justice.	Feb 25, 2025	Link	Argues fairness is not equal to unbiasedness, but equitable differentiation. Identifies technical/social dimensions of bias and suggests better frameworks for fairness discourse.
27	Societal Impact & Long-Term Risks	Economic and social disruption: Job displacement, inequality, AI-human coexistence.	Analyzes AI’s socioeconomic effects, guiding policies for equitable integration.	Jul 11, 2025	Link	Finds higher AI exposure reduces employment and work hours, disproportionately affecting certain demographics. Introduces Occupational AI Exposure Score for monitoring.
28	Societal Impact & Long-Term Risks	Existential risk forecasting and prevention.	Forecasts catastrophic AI risks, safeguarding humanity’s future.	Jan 17, 2025	Link	Defines decisive vs accumulative AI existential risks. Proposes accumulative x-risk hypothesis and “perfect storm MISTER” scenario. Suggests systems analysis for long-term safety.
29	Societal Impact & Long-Term Risks	Human-AI symbiosis: Interfaces enhancing safety and trust.	Fosters collaborative human-AI relationships, improving oversight and transparency.	Jun 3, 2025	Link	Proposes Symbiotic Epistemology and SynLang protocol for transparent human-AI reasoning. Enhances trust via explicit reasoning and confidence quantification.
30	Societal Impact & Long-Term Risks	Societal resilience research: Adapting institutions/norms to AI.	Builds societal capacity to withstand AI disruptions and misuse.	Jan 23, 2025	Link	Argues for societal adaptation strategies (avoidance, defense, remedy) to mitigate AI-induced harms. Highlights cases like election manipulation and cyberterrorism.

AI Governance Guardrail

This post is licensed under CC BY 4.0 by the author.