FEAT add AgentThreatRulesScorer (ATR taxonomy scorer)#1893
Conversation
Add a deterministic TrueFalseScorer that evaluates text against the open Agent Threat Rules (ATR) ruleset via the pyatr engine and returns True when a rule at or above a configurable min_severity matches, attaching matched rule ids / ATR category / max severity as score metadata. Mirrors SubStringScorer; pyatr (>=0.2.6) is an optional dependency. Scorer half of Signed-off-by: Adam Lin <adam@agentthreatrule.org>
e55c2a6 to
4d55a7d
Compare
|
Freshened this onto the latest main — it is up to date and mergeable now. Ready for review when convenient. |
adrian-gavrila
left a comment
There was a problem hiding this comment.
Thanks for picking this up! Overall it looks good, but there are a couple of things that should be addressed, primarily around things actually being excercised in CI and following some style guidelines.
| # pyatr returns matches sorted by severity (critical first). | ||
| hits = [m for m in matches if _SEVERITY_ORDER.get((m.severity or "").lower(), 0) >= self._severity_floor] | ||
| triggered = bool(hits) | ||
|
|
||
| if triggered: | ||
| top = hits[0] |
There was a problem hiding this comment.
top = hits[0] relies on pyatr returning matches pre-sorted critical-first (per the comment on line 108). That ordering is an undocumented pyatr internal, so if a future release changes or drops the sort, max_severity and the highest-severity value in the description silently go wrong while every test still passes.
We already own a severity ranking in _SEVERITY_ORDER, so we can pick the top match ourselves instead of trusting the engine's order:
top = max(hits, key=lambda m: _SEVERITY_ORDER.get((m.severity or "").lower(), 0))Separately, there's a casing inconsistency: the filter on line 109 lowercases severity before comparing, but line 122 stores top.severity raw. If pyatr ever emits mixed case, the stored max_severity won't match what the filter normalized.
| import pytest | ||
|
|
||
| # The scorer relies on the optional `pyatr` engine; skip if it is not installed. | ||
| pytest.importorskip("pyatr") |
There was a problem hiding this comment.
This importorskip is at module scope, so it gates the whole file, and pyatr isn't declared anywhere in pyproject.toml. CI never installs it, so all four tests are skipped on every run. That's also why the broken assertion in test_atr_scorer_passes_benign below was never caught.
Two things to fix it:
- Wire
pyatr>=0.2.6into an optional extra and theallgroup so CI installs it and the live tests actually run. - Move
test_atr_scorer_rejects_invalid_min_severityout from behind this module-level skip. It only asserts that a badmin_severityraisesValueError, which happens before thepyatrimport in__init__, so it doesn't need the engine and shouldn't be gated.
| return self._create_identifier( | ||
| params={ | ||
| "score_aggregator": self._score_aggregator.__name__, # type: ignore[ty:unresolved-attribute] | ||
| "min_severity": self._min_severity, |
There was a problem hiding this comment.
Nit, optional and low likelihood: _build_identifier omits rules_dir, so two scorers loading different custom rulesets share an eval_hash (and stored scorer_class_identifier). It only bites if you pass a custom rules_dir and use the scorer-eval metrics harness, so the bundled-ruleset path is never affected. Cheap to close though, and consistent with how SubStringScorer includes substring in its identifier:
| "min_severity": self._min_severity, | |
| "min_severity": self._min_severity, | |
| "rules_dir": self._rules_dir, |
| # Copyright (c) Microsoft Corporation. | ||
| # Licensed under the MIT license. | ||
|
|
||
| from typing import Optional |
There was a problem hiding this comment.
This file currently fails ruff check under the repo config (pyrit/score/** isn't DOC/D-exempt), so CI will be red until these are addressed. I ran it locally; the failures are:
- UP045 (x5):
Optional[...]should beX | None(lines 4, 42, 43, 45, 94, 119). This is the modern union syntax the codebase standardized on;SubStringScorer, which this file mirrors, already uses it. - D213 (x3): multi-line docstring summary should start on the second line (lines 19, 47, 95), matching SubStringScorer's docstring style.
- DOC501 (x2):
__init__raisesValueError(badmin_severity) andImportError(missing pyatr), but neither is in aRaises:section. - DOC201:
_score_piece_asyncdescribes its return in prose but has noReturns:section.
ruff check --fix auto-resolves the UP045 and D213; the Raises: and Returns: sections need to be added by hand.
|
|
||
| assert len(scores) == 1 | ||
| assert scores[0].get_value() is False | ||
| assert scores[0].score_metadata is None |
There was a problem hiding this comment.
Two assertion issues here, both of which only surface once this file actually runs in CI (see the coverage comment at the top of the file):
- Line 34:
assert scores[0].score_metadata is Nonecan never pass. TheScoremodel has amode="before"validator that coercesNoneto{}, so on the benign path this is effectivelyassert {} is None. I confirmed it by installing pyatr locally and running the module: 1 failed, 3 passed. Assert the no-match contract you actually want, e.g.not scores[0].score_metadataor== {}. - Line 23:
assert scores[0].score_metadata is not Noneis vacuous for the same reason (metadata is always a dict), so it never tests anything. The next line (["matched_rule_ids"]) is the real check, so this one can go.
…ertions
- Sort hits by severity explicitly; don't rely on pyatr internal ordering
- Add pyatr>=0.2.6 as an optional 'atr' extra + into 'all' so CI installs it
- Ungate test_atr_scorer_rejects_invalid_min_severity (no engine needed);
gate the three engine tests individually with skipif
- Fix benign assertion (== {}), drop vacuous 'is not None'
- _build_identifier includes rules_dir
- ruff: Optional -> X | None, add Raises/Returns, D213
|
Thanks for the thorough review, Adrian — all addressed in the latest push:
Verified locally: ruff passes, and against pyatr 0.2.6 the injection string trips 5 rules (top severity critical) while the benign string trips none, so the severity-floor tests hold. Ready for another look when you have a moment. |
Adds AgentThreatRulesScorer, the scorer half of #1702 (the dataset loader landed in #1715).
What it does
Dependency
Pairs with the _AgentThreatRulesDataset loader: the dataset supplies ATR-derived adversarial prompts, and this scorer detects whether a response trips an ATR rule.