Regulations and Explorations in AI Safety Control - Efforts, Challenges, & Future Directions

The topic of AI’s existential crisis for humanity has been sparked by the recent development of AI technologies with their transformative power from chatbots, autonomous vehicles to advanced healthcare diagnostics. The seamless integration of AI technology with the untransparent nature of its internal agorithms based on deep learning brings forth significant concerns from the academia, industry, government, and general public about its safety and robustness.

Among them, a universally strong concern is the unsafe output from real-world applications deployed on large lanagueg models (LLMs), which requires robust controls from multidimentional levels. In this post, I will first break down the complexities of AI safety issues, and then roughly introduceed current global regulations and standards on AI safety issues from both government agencies and industry companies, with several challenges each stakeholder is facing. Lastly, this posts will discuss future directions of AI governance to strength the collaborations between different stakeholders in the AI ecosystem for effective implementation.

I. Complexity of AI Safety

LLMs are pre-trained on massive amounts of online data and thus they inevitably acquire certain harmful behavior and biases from the Internet while attaining power in performing language processing tasks. The complexity of controlling such unsafe output and behavior is not only rooted in LLM’s pre-training process and black-box internal struture, but also amplified by several real-world factors including:

  • There exist multiple forms of unsafe content, including offensiveness, abusiveness, hate speech, biases, stereotypes, cyberbullying, and identity attacks, each potentially requiring distinct handling approaches.
  • Diverse and dynamic societal perspectives may lead to varying and evolving interpretations on the defition and categorization of these unsafe content, which posits a challenge for a widely-accepted standard for AI governance and for further implementation.
  • The emergent capabilities of LLMs (Wei et al., 2022) via scaling pose emerging challenges to the existing AI governance, such as deception and optimization analysed by Bounded Regret in his blog.

Therefore, AI safety is a complex issue requiring continuing and elaborated efforts. Notably, the first six listed types (i.e., offensiveness, abusiveness, hate speech, biases, stereotypes) of unsafe content are sometimes discussed under the umbrella term “toxicity” in LLM safety research.

For a more concrete understanding, please click here to refer to the following literature concerning toxicity (listed by Lilian Weng).

  • [Perspective API] desribes its main attribute as TOXICITY, “a rude, disrespectful, or unreasonable comment that is likely to make you leave a discussion”.

  • [Kurita et al. 2019]: “Toxic content on the internet prevents the constructive exchange of ideas, excludes sensitive individuals from online dialogue, and inflicts mental and physical health impacts on the recipients.” Notable examples include hate speech and profanity.

  • [Pavlopoulos et al. 2020] employs terms like ‘offensive,’ ‘abusive,’ ‘hateful,’ etc., to describe different forms of toxic language or phenomena.

Now, let’s first talk about how AI safety has been mentioned and handled in global regulation documents, industry standard frameworks, and technical development settings.

II. Global Regulations

1. Governmental Regulations

Regulations are requirements established by governments. Currently, only a few governmental entities have released their national strategies and regulations for AI governance. The following mindmap offers a (non-comprehensive) brief review of four governmental AI regulatory documents from EU, U.S., Canada, and China, listing their emphasized risks posed by AI systems.

- Gov Docs
  - EU: 
    - [EU AI Act](
      - Regulate AI systems based on their classified risk level
      - Risk levels classified by usage and potential harm
    - [European Commision White Paper on AI](
      - Human agency and oversight
      - Technical robustness and safety
      - Privacy and data governance
      - Transparency
      - Diversity, non-discrimination and fairness
      - Societal and environmental well-being
      - Accountability

  - U.S.: [Blueprint for an AI Bill of Rights](
    - [Safe and Effective Systems](
    - [Algorithmic Discrimination Protections](
    - [Data Privacy](
    - [Notice and Explanation (about the use of AI)](
    - [Human Alternatives, Consideration, and Fallback](

  - Canada: [AI and Data Act](
    - Human Oversight & Monitoring
    - Transparency
    - Fairness & Equity
    - Safety
    - Accountability
    - Validity & Robustness

  - China: [Interim Measures for the Admin of Generative AI Services]( ([English](
    - [Categorized and Hierarchical Monitoring](
    - Non-harm (to National Security & Citizen Wellbeing)
    - Non-descrimination
    - Healthy Market Competition
    - Transparency
    - Data Privacy

Although governmental policies differ in their emphases, scopes, degrees of details, and application requirements, [safety and fairness] are prevalent themes across all documents, usually mentioned together with robustness and equity/non-discrimination. Overall, a general aim of these regulations is to strike a delicate balance between protecting the health, safety, and fundamental rights of individuals and keeping the boost of innovation.

Current Gaps

However, current regulatory documents faced wide criticisms for being immature, non-uniformed, and non-enforceable. For example, EU’s AI Act has been criticized for being intentionally vague in definitions and weak in enforcing compliance that it lacks the basic technical feasibilty. Similarly, the Canadian AI and Data Act is criticized for being too broad and vague in its scope and requirements. Currently, even the basic matter of testing whether an AI system meets the safety regulations remain unclear in practice. An article by Holden Karnofsky elaborates on the topic, and offers a phisophical discussion on why AI safety is inherently hard to measure in multiple aspects.

2. Industry Standards

It may not come as a surprise that industry companies have also released and followed their own AI safety standards (i.e., formal specifications of best practices) along with the carryouts of most governmental guidelines. A representative example of action is the Frontier Model Forum, an industry coalition focused on developing best safety standards and practices of frontier AI models which is launched by Anthropic, Google, Microsoft, and OpenAI in July, 2023.

However, the industry compliance for AI safety control disproportionally follows their inner framworks drafted by each company’s AI safety team (e.g., Open AI’s Safety Systems team, SAIF by Google), rather the governmental guidelines. These frameworks commonly detailized technical protocols for AI safety, across development cycles from model development to post-deployment monitoring.

Due to the limited scope of the current blog, the technical explorations in safe and responsible AI research will only be roughly reviewed here, although Lilian Weng’s blogs contain excellent examples of two of major focuces in technical solutions for mitigating AI safety issues: defending adversarial attacks and reducing toxicity. In a nutshell, addressing a specific safety issue can involve efforts in these cycles:

  • Data Collection (about the target safety issue):
    • Annotation guidelines requires clear taxonomy of the target safety issue (e.g., in Figure 1).
    • Annotation methodologies include expert coding, crowdsourcing, professional moderators, and synthetic data, as summarized by Vidgen & Derczynski (2020).
    • Such mannually annotated data can also be bootstrapped with unlabbelled data to create a large-sized semi-supervised learning (Khatri et al., 2018).
Taxonomy Process of Offensive Content. Image from [Zampieri et al. (2019)](
Taxonomy Process of Offensive Content. Image from Zampieri et al. (2019)
  • Detection:

  • Mitiation:

    • Blacklisting: vocabulary filtering or shifting (i.e., mapping)
    • Prompt-based mitigation, e.g., debiasing via altering the label sample distribution in the model, (Zhao et al., 2021)
    • Text style transfer & controllable generation

Current Gaps

While technical advancements in addressing AI safety issues has always been evolving in AI research, at present, the non-uniformed nature of these industry standards and the lack of compliance to governmental regulations pose a challenge for the effective implementation of AI safety control.

Additionally, in the status quo, AI safety and robustness is often considered and addressed in the aftermath of the deployment of AI systems, which is not only costly, inefficient, and risk-bearing. Therefore, the current industry standards and technical explorations are still far from sufficient in addressing the AI safety issues.

III. Regulatory Challenges

In the previous sections, we have briefly introduced the AI safety complexities, current global regulations and industry standards for AI safety control, together with their currrent limitations. A recent paper called ‘Frontier AI Regulation: Managing Emerging Risks to Public Safety’(Anderljung et al., 2023) summarizes these aforementioned regulatory challanges in AI safety control into three categories of problems:

  • The Unexpected Capabilities Problem. Dangerous capabilities can arise unpredictably and undetected, both during development and after deployment.
  • The Deployment Safety Problem. Preventing deployed AI models from causing harm is a continually evolving challenge.
  • The Proliferation Problem. Frontier AI models can proliferate rapidly, hurting accountability.
Summary of the three regulatory challenges posed by frontier AI. Image by [Anderljung et al. (2023)](
Summary of the three regulatory challenges posed by frontier AI. Image by Anderljung et al. (2023)

Breaking them down, “the Unexpected Capabilities Problem” derives from the concerns about emergent capabilities of large lanaguage models that are unpredictable and unintended. This poses a significant hurdle for regulators, as restricting access may not be sufficient to prevent downstream users from exploiting these capabilities, leading to potential harm.

“The Deployment Safety Problem” describes limited testing and monitoring problems in AI system development, which may overlook loopholes that can be identified and abused by constantly evolving malicious users. Moreover, the Unexpected Capabilities Problem further complicates the task of safeguarding against known and emerging unknown risks during deployment.

Lastly, “the Proliferation Problem” is caused by the fact that AI systems can be easily copied and deployed in different contexts, which can lead to the spread of unsafe AI systems. This introduces another layer to the regulatory challenge. Frontier AI models, whether intentionally or unintentionally, may be open-sourced or vulnerable to theft.

Overall, these challenges further underscore the necessity for strengthening regulatory interventions throughout the entire lifecycle of frontier AI. The interconnected nature of these challenges emphasizes the importance of a holistic regulatory framework that addresses the evolving landscape of AI capabilities and their potential for misuse.

IV. Continuing Efforts of Closing the Gaps: What Can Be Done?

Hence, these regulatory problems call for establishing a more concrete and comprehensive regulatory framework for frontier AI which necessitates the incorporation of several fundamental components included in the same paper.

  • Mechanisms for development of frontier AI safety standards particularly via expert- driven multi-stakeholder processes, and potentially coordinated by governmental bodies. Over time, these standards could become enforceable legal requirements to ensure that frontier AI models are being developed safely.

  • Mechanisms to give regulators visibility into frontier AI development, such as disclosure regimes, monitoring processes, and whistleblower protection. These equip regulators with the information needed to address the appropriate regulatory targets and design effective tools for governing frontier AI.

  • Mechanisms to ensure compliance with safety standards including voluntary self- certification schemes, enforcement by supervisory authorities, and licensing regimes. While self-regulatory efforts, such as voluntary certification, may go some way toward ensuring compliance, this seems likely to be insufficient for frontier AI models.

Under such framework, by combining expert-driven standards, visibility mechanisms, and robust enforcement measures, policymakers can create a regulatory framework that balances innovation with the imperative of ensuring AI systems prioritize safety and ethical considerations. On the other hand, the cornerstone of regulatory strategies should be the establishment of clear and concrete safety standards in industries. Collaboration between AI developers, safety researchers, and government entities is essential to invest heavily in defining and aligning on risk assessments, model evaluations, and oversight frameworks. These standards should be regularly reviewed and updated to keep pace with the dynamic landscape of frontier AI, ensuring ongoing mitigation of potential risks associated with AI development and deployment.

However, it is important to note that such frameworks are never exhausive but only points to a progressive future direction. Unintended negative consequences may still arise from these regulatory models, including:

  • Imbalance between protection and innovation
  • Monopoly and unfair Competitions in AI development
  • Abuse of Governmental Powers

Neverthless, in addressing the challenges of frontier AI models, the urgency of the rapidly evolving AI landscape necessitates immediate action despite uncertainties about the optimal regulatory approach. Policymakers, researchers, and practitioners must collaborate vigorously to explore and implement effective regulatory strategies, recognizing the complexities of AI governance and striving for collective efforts in this crucial endeavor.

Yingjia Wan
Yingjia Wan
Master’s student in Natural Language Processing (graduated)

My research interests lie in multimodality, debiasing language models, prompting, and aliging lanaguage models with cognitive science for interpretability.