Skip to content

Thread-unsafe lazy init in Model.__new__ causes silent deserialization corruption under concurrency (azure-ai-documentintelligence and azure-ai-vision-imageanalysis) #47426

Description

@noxthot
  • Package Name:
  • azure-ai-documentintelligence 1.0.2
  • azure-ai-vision-imageanalysis 1.0.0
  • Package Version: see above
  • Operating System: Ubuntu 24.04
  • Python Version: 3.14

Disclaimer
Assisted by Claude (emphasis on assisted)

The race condition was observed "in the wild"; generating the minimal example and the diagnosis was made with assistance of AI with lots of human guidance and the resulting issue (hopefully) cleaned from most of the AI clutter

I am always careful when it comes to concurrency-issues and (possible) necessesity of adding locks, but the attached diff does exactly this and fixes the described issue (which does not necessarily mean that it is the right place to apply a fix, but at least it pinpoints the problematic code part)

Describe the bug

A race condition in the generated _model_base.py causes deserialization of HTTP responses to silently return raw JSON dicts (or partially deserialized models whose nested fields are raw dicts) instead of model objects, when the first model objects of a process are constructed concurrently, e.g. when an application issues its first SDK calls from a thread pool.

Affected (at least):

  • azure-ai-vision-imageanalysis 1.0.0 — analyze() can return a plain dict instead of ImageAnalysisResult, or an ImageAnalysisResult whose nested fields (blocks, lines, words, …) are raw dicts.
  • azure-ai-documentintelligence 1.0.2 — same for AnalyzeResult.
  • Most likely every SDK that vendors _model_base.py (the file is present in most of the modules; is it auto-generated? If so, this information should be added into its header), including the current main branch of this repo.

Likely root cause: Model.__new__ initializes the per-class metadata _attr_to_rest_field lazily and without any locking. It iterates the class __dict__s via cls.__mro__ and then assigns the new _attr_to_rest_field attribute into one of those same __dict__s (_model_base.py lines 510–530 in azure-ai-vision-imageanalysis 1.0.0).
When two threads construct the first instances of related model classes concurrently, one thread's iteration races with the other thread's assignment and raises RuntimeError: dictionary changed size during iteration (shared mutable _RestField state, e.g. rf._type, is also written without synchronization).

That RuntimeError never surfaces, because _deserialize_default swallows any exception and returns the raw input object instead (lines 744–754):

def _deserialize_default(deserializer, obj):
    if obj is None:
        return obj
    try:
        return _deserialize_with_callable(deserializer, obj)
    except Exception:
        pass
    return obj   # <- raw wire dict returned as the "deserialized" result

So instead of an error, the application receives corrupted results, which makes this very hard to diagnose in production: it only happens on the first few responses of a freshly started process, only under concurrency, and the symptom (an AttributeError like 'dict' object has no attribute 'words', possibly much later in unrelated application code) gives no hint of the cause.

To Reproduce

Steps to reproduce the behavior:

  1. pip install azure-ai-vision-imageanalysis==1.0.0 azure-ai-documentintelligence==1.0.2
  2. Save the script below as azure_model_base_race_repro.py (no Azure credentials/endpoint needed — it calls _model_base._deserialize(Model, response_json) directly, which is exactly what the generated operations code does with every HTTP response body).
  3. python azure_model_base_race_repro.py ia (Image Analysis) or python azure_model_base_race_repro.py di (Document Intelligence).

Each trial wipes the azure modules from sys.modules so the lazy class state is uninitialized again, exactly like the first SDK response in a freshly started process, then deserializes a valid response payload from 32 threads concurrently.

Observed output (Python 3.14.0, Linux x86_64; reproduces in 19–20 of 20 trials):

$ python azure_model_base_race_repro.py ia
trial  1: 2/32 results corrupted (e.g. got dict instead of ImageAnalysisResult)
trial  2: 6/32 results corrupted (e.g. partially deserialized model: AttributeError: 'dict' object has no attribute 'lines')
trial  3: 1/32 results corrupted (e.g. partially deserialized model: AttributeError: 'dict' object has no attribute 'lines')
...
trial 20: 1/32 results corrupted (e.g. got dict instead of ImageAnalysisResult)

BUG REPRODUCED: corrupted results in 20/20 trials
$ python azure_model_base_race_repro.py di
...
trial 19: 3/32 results corrupted (e.g. got dict instead of AnalyzeResult)
trial 20: 3/32 results corrupted (e.g. partially deserialized model: AttributeError: 'dict' object has no attribute 'words')

BUG REPRODUCED: corrupted results in 19/20 trials
azure_model_base_race_repro.py
"""Minimal reproduction: concurrent first-time model deserialization silently returns raw
dicts (or partially deserialized models) instead of model objects.

Affected:
  - azure-ai-vision-imageanalysis 1.0.0
  - azure-ai-documentintelligence 1.0.2

Usage:
    pip install azure-ai-vision-imageanalysis==1.0.0 azure-ai-documentintelligence==1.0.2
    python azure_model_base_race_repro.py ia     # Image Analysis
    python azure_model_base_race_repro.py di     # Document Intelligence

Runs multiple trials, where each trial wipes the azure modules from sys.modules so the
lazy class state is uninitialized again, exactly like the first SDK response in a freshly
started process.

Expected output (bug present): most trials report corrupted results, e.g.
    trial  1: 13/32 results corrupted (e.g. got dict instead of ImageAnalysisResult)

Expected output (bug fixed): "no corruption in 20 trials".

Likely root cause: the generated `_model_base.py` initializes model-class metadata lazily in
`Model.__new__`: it iterates the class `__dict__` (via `cls.__mro__`) to build
`_attr_to_rest_field` and then assigns that new attribute into the same `__dict__`,
without any locking. When the first model objects of a process are constructed
concurrently, this raises "RuntimeError: dictionary changed size during iteration" -
which `_deserialize_default` swallows with a bare `except Exception`, silently returning
the raw JSON wire dict (or a model whose nested fields are raw dicts) instead of failing.

`_deserialize(Model, response_json)` below is exactly what the generated operations code
calls on every HTTP response, so any application that issues its first SDK calls of a
process from a thread pool can receive corrupted results.
"""

import copy
import sys
import threading
import time

N_THREADS = 32
N_TRIALS = 20


# format of a successful READ response
IMAGE_ANALYSIS_RESPONSE = {
    "modelVersion": "2023-10-01",
    "metadata": {"width": 100, "height": 100},
    "readResult": {
        "blocks": [
            {
                "lines": [
                    {
                        "text": "hi",
                        "boundingPolygon": [
                            {"x": 0, "y": 0},
                            {"x": 1, "y": 0},
                            {"x": 1, "y": 1},
                            {"x": 0, "y": 1},
                        ],
                        "words": [
                            {
                                "text": "hi",
                                "boundingPolygon": [
                                    {"x": 0, "y": 0},
                                    {"x": 1, "y": 0},
                                    {"x": 1, "y": 1},
                                    {"x": 0, "y": 1},
                                ],
                                "confidence": 0.99,
                            }
                        ],
                    }
                ]
            }
        ]
    },
}


# format of a successful prebuilt-read AnalyzeResult
DOCUMENT_INTELLIGENCE_RESPONSE = {
    "apiVersion": "2024-11-30",
    "modelId": "prebuilt-read",
    "stringIndexType": "textElements",
    "content": "hi",
    "pages": [
        {
            "pageNumber": 1,
            "width": 100.0,
            "height": 100.0,
            "unit": "pixel",
            "words": [
                {
                    "content": "hi",
                    "polygon": [0, 0, 1, 0, 1, 1, 0, 1],
                    "confidence": 0.99,
                    "span": {"offset": 0, "length": 2},
                }
            ],
            "spans": [{"offset": 0, "length": 2}],
        }
    ],
}


def fresh_sdk(backend):
    """(Re-)import the SDK with all lazy model-class state uninitialized, like a fresh process."""
    for name in [m for m in sys.modules if m.startswith("azure")]:
        del sys.modules[name]

    if backend == "di":
        from azure.ai.documentintelligence import _model_base, models

        return (
            _model_base._deserialize,
            models.AnalyzeResult,
            DOCUMENT_INTELLIGENCE_RESPONSE,
        )
    else:
        from azure.ai.vision.imageanalysis import _model_base, models

        return (
            _model_base._deserialize,
            models.ImageAnalysisResult,
            IMAGE_ANALYSIS_RESPONSE,
        )


def describe_corruption(result, model_cls, backend):
    """Returns a description of how `result` is corrupted, or None if it deserialized correctly."""
    if not isinstance(result, model_cls):
        return f"got {type(result).__name__} instead of {model_cls.__name__}"
    try:
        if backend == "di":
            for page in result.pages:
                for word in page.words:
                    _ = word.content, word.confidence, word.polygon
        else:
            for block in result.read.blocks:
                for line in block.lines:
                    for word in line.words:
                        _ = word.text, word.confidence, word.bounding_polygon[0].x
    except Exception as e:  # nested field is a raw dict instead of a model
        return f"partially deserialized model: {type(e).__name__}: {e}"
    return None


def run_trial(backend):
    deserialize, model_cls, payload = fresh_sdk(backend)
    results = [None] * N_THREADS
    barrier = threading.Barrier(N_THREADS)
    go = False

    def work(i):
        """Each thread deserializes the same payload concurrently, like the first few HTTP responses of a process.
        Before doing so, all threads are synchronized to maximize concurrency of the critical section."""

        local_payload = copy.deepcopy(
            payload
        )  # independent input per thread, like real responses
        barrier.wait()

        while (
            not go
        ):  # spin so all threads are runnable and contend the moment they are released
            pass

        results[i] = deserialize(model_cls, local_payload)  # critical section

    threads = [threading.Thread(target=work, args=(i,)) for i in range(N_THREADS)]

    for t in threads:
        t.start()

    time.sleep(0.05)  # let every thread reach the spin loop

    go = True  # all threads should now be ready to contend the critical section

    # Wait for all threads to finish
    for t in threads:
        t.join()

    return [
        c for r in results if (c := describe_corruption(r, model_cls, backend))
    ]  # Check for any corrupted results and return their descriptions


def main():
    backend = sys.argv[1]

    if backend not in ("ia", "di"):
        sys.exit(f"Usage: {sys.argv[0]} [ia|di]")

    sys.setswitchinterval(1e-6)  # maximize thread switches to widen the race window

    corrupted_trials = 0

    for trial in range(1, N_TRIALS + 1):
        corruptions = run_trial(backend)

        if corruptions:
            corrupted_trials += 1
            print(
                f"trial {trial:2}: {len(corruptions)}/{N_THREADS} results corrupted (e.g. {corruptions[0]})"
            )
        else:
            print(f"trial {trial:2}: no corruption")

    if corrupted_trials:
        print(
            f"\nBUG REPRODUCED: corrupted results in {corrupted_trials}/{N_TRIALS} trials"
        )
        sys.exit(1)

    print(f"no corruption in {N_TRIALS} trials")


if __name__ == "__main__":
    main()

Expected behavior

_deserialize(Model, response_json) always returns a fully deserialized model object, regardless of how many threads are deserializing concurrently — i.e. the lazy initialization in Model.__new__ is thread-safe (e.g. double-checked locking around the _attr_to_rest_field computation). Independently of the race itself, a failure inside deserialization should arguably not be silently swallowed by the bare except Exception in _deserialize_default, since returning the raw wire dict turns an internal error into silent data corruption.

Screenshots

N/A (full console output included above).

Additional context

  • Environment: Python 3.14.0, Linux x86_64, azure-ai-vision-imageanalysis 1.0.0, azure-ai-documentintelligence 1.0.2, azure-core 1.41.0. The repro sets sys.setswitchinterval(1e-6) to widen the race window, but the race also triggers with the default switch interval — the tight setting just makes the repro deterministic.
  • The unguarded __new__ is still present on current main, e.g. sdk/vision/azure-ai-vision-imageanalysis/.../_model_base.py#L510, so newly generated/released SDKs are affected as well. Since _model_base.py is emitted by typespec-python, the proper fix is presumably in the emitter template, with regeneration of affected packages.
  • A straightforward fix that makes the repro pass: guard the metadata computation in Model.__new__ with a class-level lock and an "already calculated" check (double-checked locking), so the __dict__ iteration can never race with the _attr_to_rest_field assignment. I'm happy to submit a PR.
  • Real-world impact: any service that starts worker processes and immediately serves requests from a thread pool can return corrupted results for its first few requests after every process start/restart, with no error logged by the SDK.
  • Since _model_base.py appears to be autogenerated, I am not adding a PR. But locally, this git diff fixes the issue; please note that this fix is generated by Claude (basically adding a lock at the critical places; seems reasonable to me, but I am always careful when it comes to concurrency):
git diff
diff --git a/sdk/documentintelligence/azure-ai-documentintelligence/azure/ai/documentintelligence/_model_base.py b/sdk/documentintelligence/azure-ai-documentintelligence/azure/ai/documentintelligence/_model_base.py
index 7f73b97b23e..d999a521c9e 100644
--- a/sdk/documentintelligence/azure-ai-documentintelligence/azure/ai/documentintelligence/_model_base.py
+++ b/sdk/documentintelligence/azure-ai-documentintelligence/azure/ai/documentintelligence/_model_base.py
@@ -13,6 +13,7 @@ import decimal
 import functools
 import sys
 import logging
+import threading
 import base64
 import re
 import typing
@@ -495,6 +496,10 @@ class Model(_MyMutableMapping):
     # label whether current class's _attr_to_rest_field has been calculated
     # could not see _attr_to_rest_field directly because subclass inherits it from parent class
     _calculated: typing.Set[str] = set()
+    # serializes first-time calculation of _attr_to_rest_field: assigning it into a class
+    # __dict__ that a concurrent __new__ is iterating over raises "dictionary changed size
+    # during iteration", which deserialization fallbacks then swallow into corrupted results
+    _calculated_lock = threading.Lock()
 
     def __init__(self, *args: typing.Any, **kwargs: typing.Any) -> None:
         class_name = self.__class__.__name__
@@ -576,26 +581,31 @@ class Model(_MyMutableMapping):
 
     def __new__(cls, *args: typing.Any, **kwargs: typing.Any) -> Self:
         if f"{cls.__module__}.{cls.__qualname__}" not in cls._calculated:
-            # we know the last nine classes in mro are going to be 'Model', '_MyMutableMapping', 'MutableMapping',
-            # 'Mapping', 'Collection', 'Sized', 'Iterable', 'Container' and 'object'
-            mros = cls.__mro__[:-9][::-1]  # ignore parents, and reverse the mro order
-            attr_to_rest_field: typing.Dict[str, _RestField] = {  # map attribute name to rest_field property
-                k: v for mro_class in mros for k, v in mro_class.__dict__.items() if k[0] != "_" and hasattr(v, "_type")
-            }
-            annotations = {
-                k: v
-                for mro_class in mros
-                if hasattr(mro_class, "__annotations__")
-                for k, v in mro_class.__annotations__.items()
-            }
-            for attr, rf in attr_to_rest_field.items():
-                rf._module = cls.__module__
-                if not rf._type:
-                    rf._type = rf._get_deserialize_callable_from_annotation(annotations.get(attr, None))
-                if not rf._rest_name_input:
-                    rf._rest_name_input = attr
-            cls._attr_to_rest_field: typing.Dict[str, _RestField] = dict(attr_to_rest_field.items())
-            cls._calculated.add(f"{cls.__module__}.{cls.__qualname__}")
+            with cls._calculated_lock:
+                if f"{cls.__module__}.{cls.__qualname__}" not in cls._calculated:
+                    # we know the last nine classes in mro are going to be 'Model', '_MyMutableMapping',
+                    # 'MutableMapping', 'Mapping', 'Collection', 'Sized', 'Iterable', 'Container' and 'object'
+                    mros = cls.__mro__[:-9][::-1]  # ignore parents, and reverse the mro order
+                    attr_to_rest_field: typing.Dict[str, _RestField] = {  # map attribute name to rest_field property
+                        k: v
+                        for mro_class in mros
+                        for k, v in mro_class.__dict__.items()
+                        if k[0] != "_" and hasattr(v, "_type")
+                    }
+                    annotations = {
+                        k: v
+                        for mro_class in mros
+                        if hasattr(mro_class, "__annotations__")
+                        for k, v in mro_class.__annotations__.items()
+                    }
+                    for attr, rf in attr_to_rest_field.items():
+                        rf._module = cls.__module__
+                        if not rf._type:
+                            rf._type = rf._get_deserialize_callable_from_annotation(annotations.get(attr, None))
+                        if not rf._rest_name_input:
+                            rf._rest_name_input = attr
+                    cls._attr_to_rest_field: typing.Dict[str, _RestField] = dict(attr_to_rest_field.items())
+                    cls._calculated.add(f"{cls.__module__}.{cls.__qualname__}")
 
         return super().__new__(cls)  # pylint: disable=no-value-for-parameter
 
diff --git a/sdk/vision/azure-ai-vision-imageanalysis/azure/ai/vision/imageanalysis/_model_base.py b/sdk/vision/azure-ai-vision-imageanalysis/azure/ai/vision/imageanalysis/_model_base.py
index 43fd8c7e9b1..1842ec95ea1 100644
--- a/sdk/vision/azure-ai-vision-imageanalysis/azure/ai/vision/imageanalysis/_model_base.py
+++ b/sdk/vision/azure-ai-vision-imageanalysis/azure/ai/vision/imageanalysis/_model_base.py
@@ -11,6 +11,7 @@ import calendar
 import decimal
 import functools
 import sys
+import threading
 import logging
 import base64
 import re
@@ -476,6 +477,13 @@ def _create_value(rf: typing.Optional["_RestField"], value: typing.Any) -> typin
 
 class Model(_MyMutableMapping):
     _is_model = True
+    # label whether current class's _attr_to_rest_field has been calculated
+    # could not see _attr_to_rest_field directly because subclass inherits it from parent class
+    _calculated: typing.Set[str] = set()
+    # serializes first-time calculation of _attr_to_rest_field: assigning it into a class
+    # __dict__ that a concurrent __new__ is iterating over raises "dictionary changed size
+    # during iteration", which deserialization fallbacks then swallow into corrupted results
+    _calculated_lock = threading.Lock()
 
     def __init__(self, *args: typing.Any, **kwargs: typing.Any) -> None:
         class_name = self.__class__.__name__
@@ -508,24 +516,31 @@ class Model(_MyMutableMapping):
         return Model(self.__dict__)
 
     def __new__(cls, *args: typing.Any, **kwargs: typing.Any) -> Self:  # pylint: disable=unused-argument
-        # we know the last three classes in mro are going to be 'Model', 'dict', and 'object'
-        mros = cls.__mro__[:-3][::-1]  # ignore model, dict, and object parents, and reverse the mro order
-        attr_to_rest_field: typing.Dict[str, _RestField] = {  # map attribute name to rest_field property
-            k: v for mro_class in mros for k, v in mro_class.__dict__.items() if k[0] != "_" and hasattr(v, "_type")
-        }
-        annotations = {
-            k: v
-            for mro_class in mros
-            if hasattr(mro_class, "__annotations__")  # pylint: disable=no-member
-            for k, v in mro_class.__annotations__.items()  # pylint: disable=no-member
-        }
-        for attr, rf in attr_to_rest_field.items():
-            rf._module = cls.__module__
-            if not rf._type:
-                rf._type = rf._get_deserialize_callable_from_annotation(annotations.get(attr, None))
-            if not rf._rest_name_input:
-                rf._rest_name_input = attr
-        cls._attr_to_rest_field: typing.Dict[str, _RestField] = dict(attr_to_rest_field.items())
+        if f"{cls.__module__}.{cls.__qualname__}" not in cls._calculated:
+            with cls._calculated_lock:
+                if f"{cls.__module__}.{cls.__qualname__}" not in cls._calculated:
+                    # we know the last three classes in mro are going to be 'Model', 'dict', and 'object'
+                    mros = cls.__mro__[:-3][::-1]  # ignore model, dict, and object parents, and reverse the mro order
+                    attr_to_rest_field: typing.Dict[str, _RestField] = {  # map attribute name to rest_field property
+                        k: v
+                        for mro_class in mros
+                        for k, v in mro_class.__dict__.items()
+                        if k[0] != "_" and hasattr(v, "_type")
+                    }
+                    annotations = {
+                        k: v
+                        for mro_class in mros
+                        if hasattr(mro_class, "__annotations__")  # pylint: disable=no-member
+                        for k, v in mro_class.__annotations__.items()  # pylint: disable=no-member
+                    }
+                    for attr, rf in attr_to_rest_field.items():
+                        rf._module = cls.__module__
+                        if not rf._type:
+                            rf._type = rf._get_deserialize_callable_from_annotation(annotations.get(attr, None))
+                        if not rf._rest_name_input:
+                            rf._rest_name_input = attr
+                    cls._attr_to_rest_field: typing.Dict[str, _RestField] = dict(attr_to_rest_field.items())
+                    cls._calculated.add(f"{cls.__module__}.{cls.__qualname__}")
 
         return super().__new__(cls)  # pylint: disable=no-value-for-parameter

Metadata

Metadata

Assignees

No one assigned

    Labels

    customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-triageWorkflow: This is a new issue that needs to be triaged to the appropriate team.questionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions