Skip to content

[azure-ai-textanalytics] recognize_pii_entities: per-category recognition criteria (e.g., BRCPFNumber checksum) are undocumented #47486

@walima0103

Description

@walima0103

Note on test data used below: Neither CPF value in this report is a real person's identifier.

  • 123.456.789-00 is syntactically a CPF but is mathematically invalid (it fails the public Brazilian check-digit algorithm — all digits sequential, check digits zero). It is a non-issuable placeholder commonly used as obvious dummy data.
  • 998.214.865-68 is a valid-formatted CPF but is not a real person's number — it is the exact value Microsoft itself ships in this very SDK's official sample sample_recognize_pii_entities.py ("...Brazilian CPF number 998.214.865-68"). I reuse it here only to keep the repro identical to the official sample.
  • Package Name: azure-ai-textanalytics
  • Package Version: latest
  • Operating System: Windows
  • Python Version: 3.x

Describe the issue

The docstring and reference docs for TextAnalyticsClient.recognize_pii_entities and the categories_filter keyword describe the request shape (which categories to filter for) but give no signal that detection criteria differ per category — some appear to do format-only matching, others apparently apply additional validation (e.g., checksum) that is not documented.

This is more of a clarification request than a bug.

To Reproduce

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

client = TextAnalyticsClient(endpoint, AzureKeyCredential(key))

# (A) syntactically valid CPF format, invalid check digits -> NOT detected
res_a = client.recognize_pii_entities(
    ["Entre em contato pelo CPF 123.456.789-00"], language="pt"
)
# (B) valid CPF (same value used in this SDK's official PII sample) -> detected as BRCPFNumber
res_b = client.recognize_pii_entities(
    ["Entre em contato pelo CPF 998.214.865-68"], language="pt"
)

No categories_filter is passed in either case, so the default detection set applies. (A) returns no BRCPFNumber; (B) does.

Expected behavior

Either:

  1. The SDK reference / conceptual doc points out that recognition criteria vary per category and may include semantic validation beyond format (and points to a service-level page that lists per-category criteria), or
  2. The service-level doc page (Recognized PII and PHI entities) is updated and the SDK references it.

Note: I attempted to file the documentation-side report directly against MicrosoftDocs/azure-docs, but that repository now has GitHub Issues disabled (has_issues: false), so this SDK issue is the only public channel available for the report. Routing the service-side fix internally would be appreciated.

Why this matters for SDK users

Without this signal, developers writing tests with placeholder PII (a common pattern) get silent false negatives and have to reverse-engineer the detection criteria empirically. This was the path I took to discover the behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-triageWorkflow: This is a new issue that needs to be triaged to the appropriate team.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions