Evidence证据

Results, not promises. 结果,不是承诺。

Every claim is backed by a runnable spike. Below are measured results from adversarial trials, an ablation study, and concrete scenarios. We publish results and high-level method — never the underlying mechanism, which is shared under NDA. 每一项声明都有可运行的验证用例(spike)支撑。以下是来自对抗试验、消融实验与具体场景的实测结果。我们公开结果与高层方法——但绝不公开底层机制,机制在 NDA 后分享。

Headline results核心结果

0

execution bypasses次执行绕过

Across 10,000 adversarial trials, no action ever executed without a valid authorization.在 10,000 次对抗试验中,没有任何动作在缺少有效授权时被执行。

Method: offline adversarial certification against the enforcement layer.方法:针对执行层的离线对抗认证。

0

unauthorized capability accesses次越权能力访问

Under evaluated workloads, no agent ever obtained a capability beyond what its intent permitted.在受评工作负载下,没有任何智能体取得超出其意图所允许的能力。

Method: capability-boundary checks across evaluated task workloads.方法:在受评任务负载上做能力边界检查。

93.1%

goal-drift detection目标漂移检测率

Detection of adversarial chains where an agent drifts toward a self-derived goal. The 6.9% residual is high-overlap adversarial construction — and is stated, not hidden.对"智能体漂移向自我衍生目标"这类对抗链的检测率。6.9% 残差是高度重叠的对抗构造——我们如实写明,不藏。

Method: tracing each action back to the original request to flag drift.方法:把每个动作回溯到原始请求以标记漂移。

645

tests passing项测试通过

Reference implementation, offline (652 with a live model backend).参考实现,离线模式(接入在线模型后端时为 652)。

Method: continuous automated test suite on the reference implementation.方法:参考实现上的持续自动化测试套件。

6/6

database scenarios blocked数据库场景全部拦截

DB Warden validation spike: all six unsafe database-action scenarios were stopped as designed.DB Warden 验证 spike:六个不安全的数据库动作场景全部按设计拦截。

Method: runnable scenario suite against the database authority module.方法:针对数据库授权模块的可运行场景套件。

Baseline for context: under a shutdown threat or goal conflict, some frontier models pursue misaligned actions at rates exceeding 90% (Lynch et al., 2025). This is the problem Aiegis contains structurally.背景基线:在面临关停威胁或目标冲突时,部分前沿模型采取失准动作的比例超过 90%(Lynch et al., 2025)。这正是 Aiegis 要从结构上遏制的问题。

Safety is the composition — not one trick安全来自"组合",而非某个单点

An ablation study disables one defense layer at a time and measures how far attack success rises. Two layers carry the most weight — and removing any layer degrades safety, which is why the guarantee comes from the whole system, not a single feature.消融实验每次关闭一个防御层,测量攻击成功率上升多少。两个层贡献最大——而且去掉任一层都会削弱安全,这正说明保证来自整个系统,而非某个单一功能。

Remove hard authorization移除硬授权

34.8%

attack success攻击成功率

Largest single effect — the deterministic veto on prohibited actions.单项影响最大——对被禁动作的确定性否决。

Remove intent lineage tracking移除意图血缘追踪

20.3%

attack success攻击成功率

Second largest — catching goal drift away from the original request.第二大——捕捉偏离原始请求的目标漂移。

Case studies案例

Case 01 · Prompt injection → capability escalation案例 01 · 提示注入 → 能力提权

An injected instruction tries to make an agent act beyond its task.被注入的指令试图让智能体做超出任务范围的事。

A decision-layer filter can lower how often this slips through, but it remains probabilistic. Aiegis decides authorization outside the model: the injected action has no valid capability token, so it cannot execute — independent of how convincing the prompt is.决策层过滤器能降低它溜过去的概率,但终究是概率性的。Aiegis 在模型之外裁决授权:被注入的动作没有有效能力令牌,因此无法执行——与提示词多有说服力无关。

Result: 0 bypasses across 10,000 adversarial trials.结果:10,000 次对抗试验,0 绕过。

Case 02 · Agentic misalignment → goal drift案例 02 · 智能体失准 → 目标漂移

An agent uses fully authorized capabilities to pursue a goal it invented.智能体用完全合法的能力,去服务一个自己衍生的目标。

No capability boundary is crossed, so capability checks alone would pass it. Aiegis traces every executing action back to the original request; when the lineage diverges, the drift is flagged before execution.没有越过任何能力边界,单看能力检查会放行。Aiegis 把每个执行动作回溯到原始请求;当血缘发生偏离,漂移会在执行前被标记。

Result: 93.1% detection; the 6.9% residual is stated openly.结果:93.1% 检测;6.9% 残差如实公开。

Case 03 · Database authority (DB Warden)案例 03 · 数据库授权(DB Warden)

An agent attempts unsafe actions against a production database.智能体试图对生产数据库执行不安全的动作。

Six scenarios — from over-broad reads to destructive writes — run against the database authority module. Each unsafe action is gated by a runtime authorization decision the agent cannot grant itself.六个场景——从过宽读取到破坏性写入——针对数据库授权模块运行。每个不安全动作都被一个智能体无法自行授予的运行时授权裁决拦下。

Result: 6/6 scenarios blocked as designed.结果:6/6 场景按设计拦截。

Two categories of defense两类防御

Most AI safety today lives in the decision layer — inside the model, lowering the rate of bad output. Aiegis adds an enforcement layer — outside the model, making unauthorized action structurally impossible. The two are complementary; the enforcement layer is the one almost nobody builds.今天大多数 AI 安全位于决策层——在模型内部,降低坏输出的概率。Aiegis 增加了执行层——在模型之外,让越权动作在结构上不可能发生。两者互补;而执行层,几乎没人去做。

Dimension维度 Decision-layer defenses决策层防御 Enforcement layer (Aiegis)执行层(Aiegis)
Nature性质Probabilistic概率性Deterministic确定性
What it does作用Lowers the rate of bad output降低坏输出的概率Prevents unauthorized action阻止越权动作
Bypassable?可否绕过Yes — adversarial inputs keep evolving可以——对抗输入持续演化No — execution requires a valid token不可——执行需要有效令牌
Permission boundary权限边界"Detected or not" ≠ a boundary"检测到与否"不等于边界Capability token = a hard boundary能力令牌 = 硬边界
Where it sits所处位置Inside the model模型内部External to the model模型之外
Auditability可审计性Logs of detections检测日志A verifiable ledger of every action每个动作的可验证账本

This compares architectural categories, not specific vendors. Decision-layer defenses are real and useful; Aiegis is designed to sit underneath them as the layer that does not exist yet.此处对比的是架构类别,不针对具体厂商。决策层防御真实且有用;Aiegis 旨在位于其之下,补上那一层目前尚不存在的防御。

How to read these numbers如何看待这些数字

Results come from offline certification and online canary checks on the reference implementation, under explicitly stated assumptions. Aiegis does not claim to eliminate every risk — it converts an unbounded behavioral risk into a bounded, measured, structurally enforced property, and states the residual. The formal model and implementation mechanism are not published here; they are available under NDA.结果来自参考实现上的离线认证与在线金丝雀检查,基于明确陈述的假设。Aiegis 不声称消除一切风险——它把无界的行为风险转化为有界、可度量、结构性强制的属性,并写明残差。形式化模型与实现机制不在此公开,可在 NDA 后获取。

Want the methodology and mechanism in depth?想深入了解方法与机制?

Request a technical briefing (NDA)申请技术简报(NDA)