Note on test data used below: Neither CPF value in this report is a real person's identifier.
123.456.789-00 is syntactically a CPF but is mathematically invalid (it fails the public Brazilian check-digit algorithm — all digits sequential, check digits zero). It is a non-issuable placeholder commonly used as obvious dummy data.
998.214.865-68 is a valid-formatted CPF but is not a real person's number — it is the exact value Microsoft itself ships in this very SDK's official sample sample_recognize_pii_entities.py ("...Brazilian CPF number 998.214.865-68"). I reuse it here only to keep the repro identical to the official sample.
- Package Name: azure-ai-textanalytics
- Package Version: latest
- Operating System: Windows
- Python Version: 3.x
Describe the issue
The docstring and reference docs for TextAnalyticsClient.recognize_pii_entities and the categories_filter keyword describe the request shape (which categories to filter for) but give no signal that detection criteria differ per category — some appear to do format-only matching, others apparently apply additional validation (e.g., checksum) that is not documented.
This is more of a clarification request than a bug.
To Reproduce
from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential
client = TextAnalyticsClient(endpoint, AzureKeyCredential(key))
# (A) syntactically valid CPF format, invalid check digits -> NOT detected
res_a = client.recognize_pii_entities(
["Entre em contato pelo CPF 123.456.789-00"], language="pt"
)
# (B) valid CPF (same value used in this SDK's official PII sample) -> detected as BRCPFNumber
res_b = client.recognize_pii_entities(
["Entre em contato pelo CPF 998.214.865-68"], language="pt"
)
No categories_filter is passed in either case, so the default detection set applies. (A) returns no BRCPFNumber; (B) does.
Expected behavior
Either:
- The SDK reference / conceptual doc points out that recognition criteria vary per category and may include semantic validation beyond format (and points to a service-level page that lists per-category criteria), or
- The service-level doc page (Recognized PII and PHI entities) is updated and the SDK references it.
Note: I attempted to file the documentation-side report directly against MicrosoftDocs/azure-docs, but that repository now has GitHub Issues disabled (has_issues: false), so this SDK issue is the only public channel available for the report. Routing the service-side fix internally would be appreciated.
Why this matters for SDK users
Without this signal, developers writing tests with placeholder PII (a common pattern) get silent false negatives and have to reverse-engineer the detection criteria empirically. This was the path I took to discover the behavior.
Describe the issue
The docstring and reference docs for
TextAnalyticsClient.recognize_pii_entitiesand thecategories_filterkeyword describe the request shape (which categories to filter for) but give no signal that detection criteria differ per category — some appear to do format-only matching, others apparently apply additional validation (e.g., checksum) that is not documented.This is more of a clarification request than a bug.
To Reproduce
No
categories_filteris passed in either case, so the default detection set applies. (A) returns noBRCPFNumber; (B) does.Expected behavior
Either:
Why this matters for SDK users
Without this signal, developers writing tests with placeholder PII (a common pattern) get silent false negatives and have to reverse-engineer the detection criteria empirically. This was the path I took to discover the behavior.