Skip to content

FEAT add AgentThreatRulesScorer (ATR taxonomy scorer)#1893

Open
eeee2345 wants to merge 3 commits into
microsoft:mainfrom
eeee2345:feat/atr-taxonomy-scorer
Open

FEAT add AgentThreatRulesScorer (ATR taxonomy scorer)#1893
eeee2345 wants to merge 3 commits into
microsoft:mainfrom
eeee2345:feat/atr-taxonomy-scorer

Conversation

@eeee2345

@eeee2345 eeee2345 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Adds AgentThreatRulesScorer, the scorer half of #1702 (the dataset loader landed in #1715).

What it does

  • A deterministic TrueFalseScorer that evaluates text against the open Agent Threat Rules (ATR) ruleset via the pyatr engine.
  • Returns True when at least one rule at or above a configurable min_severity matches; attaches matched rule ids, ATR category, and max severity as score metadata.
  • Mirrors SubStringScorer's shape (true_false base, _score_piece_async, _build_identifier, min_severity validation).

Dependency

  • pyatr (>=0.2.6, which bundles the ATR ruleset) is an optional dependency, imported lazily with a clear ImportError. The unit test uses pytest.importorskip("pyatr"); if you'd like CI to exercise it, pyatr can be added to the test extras — happy to wire that into whichever group you prefer.

Pairs with the _AgentThreatRulesDataset loader: the dataset supplies ATR-derived adversarial prompts, and this scorer detects whether a response trips an ATR rule.

Add a deterministic TrueFalseScorer that evaluates text against the open
Agent Threat Rules (ATR) ruleset via the pyatr engine and returns True when
a rule at or above a configurable min_severity matches, attaching matched
rule ids / ATR category / max severity as score metadata. Mirrors
SubStringScorer; pyatr (>=0.2.6) is an optional dependency. Scorer half of

Signed-off-by: Adam Lin <adam@agentthreatrule.org>
@eeee2345 eeee2345 force-pushed the feat/atr-taxonomy-scorer branch from e55c2a6 to 4d55a7d Compare June 4, 2026 22:27
@eeee2345

Copy link
Copy Markdown
Contributor Author

Freshened this onto the latest main — it is up to date and mergeable now. Ready for review when convenient.

@adrian-gavrila adrian-gavrila self-assigned this Jun 12, 2026

@adrian-gavrila adrian-gavrila left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for picking this up! Overall it looks good, but there are a couple of things that should be addressed, primarily around things actually being excercised in CI and following some style guidelines.

Comment on lines +108 to +113
# pyatr returns matches sorted by severity (critical first).
hits = [m for m in matches if _SEVERITY_ORDER.get((m.severity or "").lower(), 0) >= self._severity_floor]
triggered = bool(hits)

if triggered:
top = hits[0]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

top = hits[0] relies on pyatr returning matches pre-sorted critical-first (per the comment on line 108). That ordering is an undocumented pyatr internal, so if a future release changes or drops the sort, max_severity and the highest-severity value in the description silently go wrong while every test still passes.

We already own a severity ranking in _SEVERITY_ORDER, so we can pick the top match ourselves instead of trusting the engine's order:

top = max(hits, key=lambda m: _SEVERITY_ORDER.get((m.severity or "").lower(), 0))

Separately, there's a casing inconsistency: the filter on line 109 lowercases severity before comparing, but line 122 stores top.severity raw. If pyatr ever emits mixed case, the stored max_severity won't match what the filter normalized.

import pytest

# The scorer relies on the optional `pyatr` engine; skip if it is not installed.
pytest.importorskip("pyatr")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This importorskip is at module scope, so it gates the whole file, and pyatr isn't declared anywhere in pyproject.toml. CI never installs it, so all four tests are skipped on every run. That's also why the broken assertion in test_atr_scorer_passes_benign below was never caught.

Two things to fix it:

  1. Wire pyatr>=0.2.6 into an optional extra and the all group so CI installs it and the live tests actually run.
  2. Move test_atr_scorer_rejects_invalid_min_severity out from behind this module-level skip. It only asserts that a bad min_severity raises ValueError, which happens before the pyatr import in __init__, so it doesn't need the engine and shouldn't be gated.

return self._create_identifier(
params={
"score_aggregator": self._score_aggregator.__name__, # type: ignore[ty:unresolved-attribute]
"min_severity": self._min_severity,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, optional and low likelihood: _build_identifier omits rules_dir, so two scorers loading different custom rulesets share an eval_hash (and stored scorer_class_identifier). It only bites if you pass a custom rules_dir and use the scorer-eval metrics harness, so the bundled-ruleset path is never affected. Cheap to close though, and consistent with how SubStringScorer includes substring in its identifier:

Suggested change
"min_severity": self._min_severity,
"min_severity": self._min_severity,
"rules_dir": self._rules_dir,

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.

from typing import Optional

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file currently fails ruff check under the repo config (pyrit/score/** isn't DOC/D-exempt), so CI will be red until these are addressed. I ran it locally; the failures are:

  • UP045 (x5): Optional[...] should be X | None (lines 4, 42, 43, 45, 94, 119). This is the modern union syntax the codebase standardized on; SubStringScorer, which this file mirrors, already uses it.
  • D213 (x3): multi-line docstring summary should start on the second line (lines 19, 47, 95), matching SubStringScorer's docstring style.
  • DOC501 (x2): __init__ raises ValueError (bad min_severity) and ImportError (missing pyatr), but neither is in a Raises: section.
  • DOC201: _score_piece_async describes its return in prose but has no Returns: section.

ruff check --fix auto-resolves the UP045 and D213; the Raises: and Returns: sections need to be added by hand.


assert len(scores) == 1
assert scores[0].get_value() is False
assert scores[0].score_metadata is None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two assertion issues here, both of which only surface once this file actually runs in CI (see the coverage comment at the top of the file):

  • Line 34: assert scores[0].score_metadata is None can never pass. The Score model has a mode="before" validator that coerces None to {}, so on the benign path this is effectively assert {} is None. I confirmed it by installing pyatr locally and running the module: 1 failed, 3 passed. Assert the no-match contract you actually want, e.g. not scores[0].score_metadata or == {}.
  • Line 23: assert scores[0].score_metadata is not None is vacuous for the same reason (metadata is always a dict), so it never tests anything. The next line (["matched_rule_ids"]) is the real check, so this one can go.

…ertions

- Sort hits by severity explicitly; don't rely on pyatr internal ordering
- Add pyatr>=0.2.6 as an optional 'atr' extra + into 'all' so CI installs it
- Ungate test_atr_scorer_rejects_invalid_min_severity (no engine needed);
  gate the three engine tests individually with skipif
- Fix benign assertion (== {}), drop vacuous 'is not None'
- _build_identifier includes rules_dir
- ruff: Optional -> X | None, add Raises/Returns, D213
@eeee2345

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review, Adrian — all addressed in the latest push:

  • Severity selection no longer relies on pyatr's ordering: the scorer now sorts hits by severity explicitly before picking max_severity.
  • Wired pyatr>=0.2.6 as an optional atr extra and into the all group so CI installs it and the live tests actually run (kept out of dev per the functional-deps convention).
  • Moved test_atr_scorer_rejects_invalid_min_severity out from behind the module-level skip — it asserts the ValueError that's raised before the pyatr import, so it no longer needs the engine. The three engine tests are now gated individually with a skipif marker.
  • Fixed the assertions: benign now checks == {} (the model validator coerces None), and dropped the vacuous is-not-None line.
  • _build_identifier now includes rules_dir so custom rulesets don't share an eval_hash.
  • ruff clean: Optional[...] -> X | None, plus the Raises/Returns/D213 docstring fixes.

Verified locally: ruff passes, and against pyatr 0.2.6 the injection string trips 5 rules (top severity critical) while the benign string trips none, so the severity-floor tests hold. Ready for another look when you have a moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants