
Guardrail Release
Open-source release of the GA Guard series, a family of safety classifiers that have been providing comprehensive protection for enterprise AI deployments for the past year.
Loading page...

Open-source release of the GA Guard series, a family of safety classifiers that have been providing comprehensive protection for enterprise AI deployments for the past year.

We reveal a powerful metadata-spoofing attack that exploits Claude's iMessage integration to mint unlimited Stripe coupons or invoke any MCP tool with arbitrary parameters, without alerting the user.

We present the Redact & Recover (RnR) Jailbreak, a novel attack that exploits partial compliance behaviors in frontier LLMs to bypass safety guardrails through a two-phase decomposition strategy.

In this post, we show how an attacker can exploit Supabase’s MCP integration to leak a developer’s private SQL tables. Model Context Protocol (MCP) has emerged as a standard way for LLMs to interact with external tools. While this unlocks new capabilities, it also introduces new risk surfaces.

Our compact policy moderation models achieve human-level performance at <1% per-review cost, outperforming GPT-4o and o4‑mini on F1 while running faster and cheaper.

A head-to-head robustness evaluation of Llama 4 (Maverick, Scout) versus GPT‑4.1, GPT‑4o, Sonnet 3.7, etc. using TAP‑R, Crescendo, and Redact‑and‑Recover across HarmBench and AdvBench.

We are excited to announce our partnership with Together AI to stress-test the safety of open-source (and closed) language models.

We have created a comprehensive overview of the most influential LLM jailbreaking methods.

We utilized LegalBench as a diversity source to enhance the diversity of our generation of red teaming questions. We show that diversity transfer from a domain-specific knowledge base is a simple and practical way to build a solid red teaming benchmark.

In this work we explore automated red teaming, applied to GPT-4o in the legal domain. Using a Llama3 8B model as an attacker, we generate more than 50,000 adversarial questions that cause GPT-4o to hallucinate responses in over 35% of cases.