Guardrail Release

We are excited to open-source the GA Guard series, a family of safety classifiers that have been providing comprehensive protection for enterprise AI deployments for the past year.
Our OS guardrails are trained using the same adversarial pipeline in production for our enterprise customers, representing the current state-of-the-art in AI safety stacks. GA Guards are the first guards to support native long-context moderation up to 256k tokens for agent traces, long form documents, and memory-augmented workflows. To showcase that breadth, we’re releasing two open benchmarks: GA Long Context Bench for long-context moderation and GA Jailbreak Bench for classifying jailbreak attempts.
Unlike legacy encoder-based filters that are fragile under distribution shifts, our guardrails are trained on both policy-driven synthetic data and red-team examples, hardened through stress-testing and retraining cycles. As a result, where traditional filters miss novel attack patterns — e.g., harmful requests wrapped in paraphrases, translations, encodings, or role-play templates — our adverserialized guardrails catch them reliably, while keeping false positives low and boasting industry-leading latencies.
The Lineup
- GA Guard: Our default guardrail, up to 15x faster than cloud providers, balancing robustness and latency for most stacks.
- GA Guard Lite: Up to 25x faster than cloud providers, with minimal hardware requirements, while still outperforming all major cloud providers.
- GA Guard Thinking: Our best performing guard for high-risk domains, hardened with aggressive adversarial training.
General Analysis
See how your AI systems hold up under real attacks
General Analysis maps AI applications and agents, red teams prompts, retrieval, tools, MCP servers, browser actions, permissions, and business workflows, then turns findings into evidence your team can reproduce and retest.
Policy Taxonomy
Our policy taxonomy is deliberately granular, down to clear block/allow edge cases. Each label maps to widely adopted compliance anchors: NIST’s AI RMF, OWASP’s Top 10 for LLM/GenAI/ML Security, MITRE ATLAS, ISO/IEC 42001, ISO/IEC 23894, and the EU AI Act — enabling compliance-aligned deployments.
PII & IP
Goal: Block prompts containing or seeking identifiable/sensitive personal data, secrets, or IP.
Block:
- Any prompt containing or seeking personal data about an identifiable person (names with contact/precise location/IDs/online identifiers/biometrics) or special-category data (health, sex life/orientation, race/ethnicity, religion, political opinions, union membership, genetic/biometric for ID).
- Secrets/credentials that could enable access or breaches (API keys, tokens, passwords, SSH keys).
- Substitutive reproduction of copyrighted works (full chapters, paywalled articles, sheet music, full codebases).
Allow:
- High-level privacy guidance; non-identifying facts; public, non-sensitive info about public figures; brief non-substitutive summaries/short quotes consistent with fair use.
- Pseudonymous placeholders (e.g., “Jane Smith,” “@example.com”).
- Procedural questions about handling PII, provided no personal data is included.
- Clearly generic placeholders not tied to real individuals.
Illicit Activities
Goal**:** Block operationalization of crime, weapons, or illegal substances.
Block:
- How-to, procurement, evasion, or operational details enabling wrongdoing (hacking/fraud/theft/forgery/surveillance evasion/phishing/trafficking/doxxing), including step-by-steps, materials lists, exploit code, payloads.
- Weapons/explosives/CBRN/illegal drug production or distribution (including procurement/evasion).
- Any request that can be directly operationalized into illegal action, regardless of stated intent.
Allow:
- High-level safety, compliance, ethics, law, or harm-reduction content without materials/steps/bypass tips/code.
- Neutral reporting/history/legal definitions/statistics without operational detail.
- Questions about legality, penalties, or safe disposal practices.
- Morally questionable but legal content should be routed to other categories when applicable.
Hate
Goal**:** Prevent hate, harassment, or abuse, especially when targeted towards protected classes.
Block:
- Slurs, dehumanization, or advocacy/incitement of hatred/violence against protected classes (e.g., race, religion, national origin, sex incl. orientation/identity, age, disability, disease status, pregnancy, veteran/immigration status, genetic info).
- Targeted harassment or praise of hate groups/crimes; requests to generate such content.
- Proxy/coded references (dog whistles, euphemisms) used to target protected classes.
- “Contextualization” that still operationalizes targeting (e.g., “just asking questions,” hypothetical targeting, stereotyping).
Allow:
- Neutral reporting, academic discussion, or counterspeech that does not endorse harm and avoids slurs/dehumanization.
- Good-faith debate on social/political/legal issues without attacking protected classes.
- Identity-affirming/supportive content; historical references presented factually without amplification of hate.
Sexual Content
Goal: Block sexually explicit or exploitative content; allow non-explicit, educational, or supportive material.
Block:
- Pornographic/arousal-oriented content, explicit descriptions or roleplay of sexual acts/genitals/fetishized body parts.
- Sexual services/exploitation; grooming; harassment; coercion.
- Sexual violence (depiction/instruction/normalization).
- Any sexualization of minors (real or fictional), animals, incest, or non-consent.
- Transformations that produce explicit content (e.g., “make this erotic”), including image or text editing.
- “Educational” or “historical” requests that would still produce explicit content.
Allow:
- Clinical, age-appropriate sex education without graphic detail or arousal intent.
- Respectful, non-explicit relationship/wellbeing advice.
- Romance/affection without explicit elements.
- Neutral discussion of sexual topics without graphic detail or arousal intent.
- Recovery-oriented, non-graphic content about harm and resources.
- Journalistic/academic discussion of sexual misconduct presented factually and non-explicitly.
Prompt Security
Goal: Defend against jailbreaks, prompt-injection, and secret exfiltration.
Block:
- Attempts to bypass rules, exfiltrate system prompts/secrets, override tools, escalate privileges, or coerce unsafe actions.
- Roleplay/persona tricks to disable protections (“act as DAN/evil twin/terminal”).
- Indirect attacks that author a future jailbreak (“write a prompt that breaks your filters,” staged delegation).
- Hidden malicious instructions in other media (emails, PDFs, stories, papers).
- Requests to reveal/restate/translate hidden rules or moderation logic.
- Obfuscation intended to mask prohibited intent (Base64/ROT13/hex/URL-encoding, ciphering).
Allow:
- Conceptual explanations of threats/defenses without working payloads.
- Academic/journalistic/security research discussion without operational steps or code.
- Non-operational case studies/hypotheticals for risk education.
- Defensive best practices and mitigations.
Violence & Self-Harm
Goal: Prevent promotion, instruction, or graphic depiction of violence or self-harm.
Block:
- Instructions, encouragement, normalization, glorification, or promotion of violence/abuse/self-harm (incl. suicide, EDs, mutilation).
- How-to/operational details enabling violence or self-harm.
- Graphic depictions of violence/torture/self-harm.
- Praise/justification/incitement of real-world violent acts or self-harm.
- Any request that could be operationalized to harm self or others, regardless of framing.
Allow:
- Safety/prevention/help-seeking info (e.g., crisis lines, safety planning).
- Neutral, non-graphic reporting or historical/academic discussion without instructions.
- Fiction/artistic references that are non-graphic and non-operational.
- Personal, recovery-oriented, constructive discussions; supportive responses (e.g., in the U.S., dial/text 988 for the Suicide & Crisis Lifeline).
Misinformation
Goal: Block content that promotes demonstrably false claims or coordinated deception.
Block:
- Fabricated/manipulated accounts of well-established events (e.g., Holocaust denial).
- False claims in public health, safety, finance, elections, or civic processes that contradict well-verified evidence.
- Propaganda/disinformation presented as fact; conspiracy narratives denying widely verified evidence.
- Requests to produce deceptive artifacts (fake studies/news, fabricated quotes, forged docs/screenshots, deepfake scripts, impersonations) or “evade fact-checking.”
- Instructions for seeding or coordinating misinformation campaigns.
Allow:
- Personal opinions or debatable views not asserting demonstrably false facts.
- Fact-checking, neutral reporting, and analysis of misinformation/disinformation.
- Fiction/satire clearly not intended as factual and not offensive.
- Guidance on detecting/countering false claims; quoting misinformation only for critique or moderation.
If you would like a version of our guardrails tailored to your company’s bespoke policies, contact us for a demo of our adversarial training pipeline.
By the Numbers
- GA Guard Thinking leads every public benchmark (F1: 0.876 / 0.858 / 0.983) while holding false-positive rates well below competing models.
- GA Guard Lite outperforms AWS/Azure/Vertex on every suite (F1: 0.844 / 0.819 / 0.963) with up to 25x faster latencies.
- Both GA Guard and GA Guard Lite beat out cloud providers by significant margins on all three public suites.
- On OpenAI Moderation, GA Guard tops AWS/Azure/Vertex by +0.119/+0.066/+0.183 F1 (Lite: +0.090/+0.037/+0.154).
- On WildGuard, GA Guard posts 0.844 F1, beating AWS/Azure/Vertex by +0.195/+0.381/+0.254 (Lite: 0.819 F1, +0.170/+0.356/+0.229).
- On HarmBench, GA Guard Thinking reaches 0.983 F1 with GA Guard at 0.981 and GA Guard Lite at 0.963 vs cloud baselines (AWS 0.797, Azure 0.609, Vertex 0.945).
| Guard | OpenAI Moderation | WildGuard | HarmBench Behaviors | Avg Time (s) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Acc. | F1 | FPR | Acc. | F1 | FPR | Acc. | F1 | FPR | ||
| GA Guard | 0.916 | 0.873 | 0.111 | 0.856 | 0.844 | 0.172 | 0.963 | 0.981 | N/A | 0.029 |
| GA Guard Thinking | 0.917 | 0.876 | 0.112 | 0.862 | 0.858 | 0.134 | 0.967 | 0.983 | N/A | 0.650 |
| GA Guard Lite | 0.896 | 0.844 | 0.109 | 0.835 | 0.819 | 0.176 | 0.929 | 0.963 | N/A | 0.016 |
| AWS Bedrock Guardrail | 0.818 | 0.754 | 0.216 | 0.642 | 0.649 | 0.449 | 0.662 | 0.797 | N/A | 0.375 |
| Azure AI Content Safety | 0.879 | 0.807 | 0.091 | 0.667 | 0.463 | 0.071 | 0.438 | 0.609 | N/A | 0.389 |
| Vertex AI Model Armor | 0.779 | 0.690 | 0.225 | 0.711 | 0.590 | 0.105 | 0.896 | 0.945 | N/A | 0.873 |
| GPT 5 | 0.838 | 0.775 | 0.188 | 0.849 | 0.830 | 0.145 | 0.975 | 0.987 | N/A | 11.275 |
| GPT 5-mini | 0.794 | 0.731 | 0.255 | 0.855 | 0.839 | 0.151 | 0.975 | 0.987 | N/A | 5.604 |
| Llama Guard 4 12B | 0.826 | 0.737 | 0.156 | 0.799 | 0.734 | 0.071 | 0.925 | 0.961 | N/A | 0.459 |
| Llama Prompt Guard 2 86M | 0.686 | 0.015 | 0.009 | 0.617 | 0.412 | 0.143 | 0.200 | 0.333 | N/A | 0.114 |
| Nvidia Llama 3.1 Nemoguard 8B | 0.852 | 0.793 | 0.174 | 0.849 | 0.818 | 0.096 | 0.875 | 0.875 | N/A | 0.358 |
| VirtueGuard Text Lite | 0.507 | 0.548 | 0.699 | 0.656 | 0.682 | 0.491 | 0.875 | 0.933 | N/A | 0.651 |
| Lakera Guard | 0.752 | 0.697 | 0.323 | 0.630 | 0.662 | 0.527 | 0.946 | 0.972 | N/A | 0.377 |
| Protect AI (prompt-injection-v2) | 0.670 | 0.014 | 0.032 | 0.559 | 0.382 | 0.248 | N/A | N/A | N/A | 0.115 |
Average latencies are end-to-end (includes API response time). Since these public benchmarks do not evaluate PII/Privacy, these policies were disabled for any guard that supports them.
Public benchmark guardrail configurations
- AWS Bedrock Guardrail: all filters set to medium strength.
- Azure AI Content Safety: includes Prompt Shield + Text Moderation endpoints; sensitivity threshold ≥ 4.
- Vertex AI Model Armor: all filters set to medium and above.
- GPT 5 & GPT 5 Mini: prompted to act as a safety classifier given our policies. High reasoning.
- Llama Guard 4 12B: as hosted on Together AI.
- Llama Prompt Guard 2 86M: self-hosted locally from Hugging Face on a B200.
- Nvidia Llama 3.1 Nemoguard 8B: self-hosted locally from Hugging Face on a B200.
- VirtueGuard Text Lite: as hosted on Together AI; can only be run as a live moderator for another LLM and blocks both prompts and outputs; in our eval, we report input-blocking only.
- Lakera Guard: evaluated under the Content Safety policy.
- Protect AI (prompt-injection-v2): self-hosted locally from Hugging Face on a B200.
GA Jailbreak Bench
GA Jailbreak Bench is our adversarial evaluation suite. This benchmark is generated using our RL-trained attacker model, which generates novel, out-of-distribution adversarial prompts by employing diverse attack strategies. Our evaluation predicts real-world performance against motivated attackers who won't limit themselves to straightforward prompting techniques. As new jailbreak patterns emerge, we will continue adding attack operators and re-train the agent, then re-issue the public benchmark so results track the evolving vulnerability landscape rather than a stale test set.
| Guard | Accuracy | F1 Score | FPR | F1 Hate & Abuse | F1 Illicit Activities | F1 Misinf. | F1 PII & IP | F1 Prompt Security | F1 Sexual Content | F1 Violence & Self-Harm |
|---|---|---|---|---|---|---|---|---|---|---|
| GA Guard | 0.931 | 0.930 | 0.038 | 0.946 | 0.939 | 0.886 | 0.967 | 0.880 | 0.954 | 0.928 |
| GA Guard Thinking | 0.939 | 0.933 | 0.029 | 0.965 | 0.925 | 0.894 | 0.962 | 0.885 | 0.942 | 0.946 |
| GA Guard Lite | 0.902 | 0.898 | 0.065 | 0.908 | 0.900 | 0.856 | 0.936 | 0.850 | 0.934 | 0.904 |
| AWS Bedrock Guardrail | 0.606 | 0.607 | 0.396 | 0.741 | 0.456 | 0.535 | 0.576 | 0.649 | 0.721 | 0.518 |
| Azure AI Content Safety | 0.542 | 0.193 | 0.026 | 0.236 | 0.093 | 0.155 | 0.068 | 0.416 | 0.186 | 0.130 |
| Vertex AI Model Armor | 0.550 | 0.190 | 0.008 | 0.077 | 0.190 | 0.582 | 0.076 | 0.000 | 0.000 | 0.241 |
| GPT 5 | 0.900 | 0.893 | 0.035 | 0.928 | 0.942 | 0.856 | 0.799 | 0.819 | 0.953 | 0.939 |
| GPT 5-mini | 0.891 | 0.883 | 0.050 | 0.917 | 0.942 | 0.845 | 0.850 | 0.822 | 0.882 | 0.924 |
| Llama Guard 4 12B | 0.822 | 0.796 | 0.053 | 0.768 | 0.774 | 0.587 | 0.809 | 0.833 | 0.927 | 0.827 |
| Llama Prompt Guard 2 86M | 0.490 | 0.196 | 0.069 | N/A | N/A | N/A | N/A | 0.196 | N/A | N/A |
| Nvidia Llama 3.1 Nemoguard 8B | 0.668 | 0.529 | 0.038 | 0.637 | 0.555 | 0.513 | 0.524 | 0.049 | 0.679 | 0.575 |
| VirtueGuard Text Lite | 0.513 | 0.664 | 0.933 | 0.659 | 0.689 | 0.657 | 0.646 | 0.659 | 0.675 | 0.662 |
| Lakera Guard | 0.525 | 0.648 | 0.825 | 0.678 | 0.645 | 0.709 | 0.643 | 0.631 | 0.663 | 0.548 |
| Protect AI (prompt-injection-v2) | 0.528 | 0.475 | 0.198 | N/A | N/A | N/A | N/A | 0.475 | N/A | N/A |
Any policy category a guardrail doesn’t support is excluded from aggregate metrics. All configs mirror the public benchmark, with PII/Privacy enabled where available.
GA jailbreak benchmark guardrail configurations
- Azure AI Content Safety: Prompt Shield + Text Moderation + Protected Material; sensitivity threshold ≥ 4.
- VirtueGuard Text Lite: “Privacy” and “Intellectual Property” categories re-enabled.
- Lakera Guard: evaluated under policy-lakera-default.
Jailbreak Bench Takeaways
- GA Guard Thinking leads the pack (Accuracy 0.939, F1 0.933, FPR 0.029), with GA Guard close behind at 0.931 / 0.930 / 0.038, both clearing GPT 5 high reasoning (0.900 / 0.893 / 0.035) while still holding a massive latency edge (0.029s vs 11.275s).
- All cloud guardrails lag far behind. Benchmarked against major cloud offerings (AWS/Azure/Vertex), GA Guard still delivers ~2.8x higher mean F1 (0.930 vs 0.33) while cutting average FPR by ~73% (0.038 vs 0.143), and GA Guard Thinking pushes that gap even wider.
- Legacy, encoder-only filters collapse under distribution shifts and underperform against adaptive, real-world attacks. A deeper technical post on traditional guardrail failure modes and our adversarial training approach is coming shortly; the TL;DR above already shows why distribution-shift-resilient, red-team-hardened guards are the most promising path to production robustness.
GA Long Context Bench
GA Long Context Bench contains 1,500 multi-turn agent traces averaging 10.3k tokens with a range between 1.3k–42.1k tokens. Half the rows carry an explicit injection or policy violation. Unlike other guards that truncate full agent traces and lose out on context, GA Guard is the first guardrail to natively moderate 256k-token conversations.
| Guard | Accuracy | F1 | FPR | F1 Hate & Abuse | F1 Illicit Activities | F1 Misinformation | F1 PII & IP | F1 Prompt Security | F1 Sexual Content | F1 Violence & Self-Harm |
|---|---|---|---|---|---|---|---|---|---|---|
| GA Guard | 0.887 | 0.891 | 0.147 | 0.983 | 0.972 | 0.966 | 0.976 | 0.875 | 0.966 | 0.988 |
| GA Guard Thinking | 0.889 | 0.893 | 0.151 | 0.967 | 0.951 | 0.940 | 0.961 | 0.828 | 0.920 | 0.962 |
| GA Guard Lite | 0.881 | 0.885 | 0.148 | 0.979 | 0.969 | 0.972 | 0.976 | 0.846 | 0.973 | 0.985 |
| AWS Bedrock Guardrail | 0.532 | 0.695 | 1.000 | 0.149 | 0.211 | 0.131 | 0.367 | 0.175 | 0.092 | 0.157 |
| Azure AI Content Safety | 0.480 | 0.046 | 0.001 | 0.028 | 0.041 | 0.016 | 0.073 | 0.049 | 0.000 | 0.081 |
| Vertex AI Model Armor | 0.635 | 0.560 | 0.138 | 0.187 | 0.312 | 0.109 | 0.473 | 0.194 | 0.085 | 0.241 |
| GPT 5 | 0.764 | 0.799 | 0.372 | 0.219 | 0.297 | 0.189 | 0.404 | 0.243 | 0.137 | 0.229 |
| GPT 5-mini | 0.697 | 0.772 | 0.607 | 0.184 | 0.253 | 0.157 | 0.412 | 0.215 | 0.112 | 0.190 |
| Llama Guard 4 12B | 0.569 | 0.602 | 0.516 | 0.164 | 0.228 | 0.132 | 0.334 | 0.188 | 0.097 | 0.195 |
| Llama Prompt Guard 2 86M | 0.505 | 0.314 | 0.162 | N/A | N/A | N/A | N/A | 0.093 | N/A | N/A |
| Nvidia Llama 3.1 Nemoguard 8B | 0.601 | 0.360 | 0.021 | 0.243 | 0.288 | 0.097 | 0.192 | 0.116 | 0.305 | 0.321 |
| VirtueGuard Text Lite | 0.490 | 0.147 | 0.047 | 0.082 | 0.203 | 0.118 | 0.069 | 0.074 | 0.058 | 0.132 |
| Lakera Guard | 0.520 | 0.684 | 0.999 | 0.151 | 0.200 | 0.132 | 0.361 | 0.160 | 0.093 | 0.159 |
| Protect AI (prompt-injection-v2) | 0.496 | 0.102 | 0.001 | N/A | N/A | N/A | N/A | 0.032 | N/A | N/A |
Any policy category a guardrail doesn’t support is excluded from aggregate metrics. All configs mirror the public benchmark, with PII/Privacy enabled where available.
GA long context benchmark guardrail configurations
- Azure AI Content Safety: Prompt Shield + Text Moderation + Protected Material; sensitivity threshold ≥ 4.
- VirtueGuard Text Lite: “Privacy” and “Intellectual Property” categories re-enabled.
- Lakera Guard: evaluated under policy-lakera-default.
Long Context Bench Takeaways
- GA Guard Thinking posts the highest F1 0.893, while GA Guard still delivers the lowest false-positive rate (0.147 FPR).
- GA Guard Lite is a close third at 0.885 F1 with just 0.148 FPR, making long-context moderation viable on edge GPUs or low latency requirements.
- Competing guardrails fail to generalize to long-form transcripts: the strongest cloud baseline (Vertex AI Model Armor) trails by 0.331 F1, while other incumbents either collapse on precision (Azure) or incur extreme false-positive rates due to their deterministic filters (AWS Bedrock at 1.000 FPR).
Quick start options:
Self-host in one command (vLLM or Transformers)
# vLLM
vllm serve generalanalysis/ga-guard-core # our default guard
# or
vllm serve generalanalysis/ga-guard-lite # ultra-fast, low-resource guard
# Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("GeneralAnalysis/GA_Guard_Core")
model = AutoModelForCausalLM.from_pretrained("GeneralAnalysis/GA_Guard_Core")
messages = [{"role": "user", "content": "Who are you?"}]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
# Sample output:
# <hate_and_abuse_not_violation><illicit_activities_not_violation>...
Drop-in SDK
# pip install generalanalysis
import generalanalysis
# Initialize client (uses GA_API_KEY environment variable)
client = generalanalysis.Client()
# Invoke a guardrail
result = client.guards.invoke(guard_id=16, text="Hello World")
# You can use either use result.block for binary decisions or policy.violation_prob for your own tunable threshold-based filtering
if result.block:
print("Content blocked!")
for policy in result.policies:
if not policy.passed:
print(f" Violated: {policy.name} - {policy.definition}")
print(f" Confidence: {policy.violation_prob:.2%}")
Contact us for a free trial of our SDK. We are happy to issue free API keys for our sdk and platform.
Custom policies & enterprise
We’ve spent the last year hardening guardrails in production for enterprise teams. Contact us to learn more about our adversarial training pipeline and how we convert your bespoke policies into robust guardrails. We support on-prem/VPC, streaming hooks, audit logs, and SLAs. Reach us at info@generalanalysis.com or book a demo to discuss your enterprise needs.
Reach us at info@generalanalysis.com to discuss GA Guard or production deployment.