Thread-unsafe lazy init in `Model.__new__` causes silent deserialization corruption under concurrency (`azure-ai-documentintelligence` and `azure-ai-vision-imageanalysis`)

- **Package Name**: 
- `azure-ai-documentintelligence 1.0.2`
- `azure-ai-vision-imageanalysis 1.0.0`
- **Package Version**: see above
- **Operating System**: Ubuntu 24.04
- **Python Version**: `3.14`

**Disclaimer**
Assisted by Claude (emphasis on *assisted*)

The race condition was observed "in the wild"; generating the minimal example and the diagnosis was made with assistance of AI with lots of human guidance and the resulting issue (hopefully) cleaned from most of the AI clutter

I am always careful when it comes to concurrency-issues and (possible) necessesity of adding `lock`s, but the attached diff does exactly this and fixes the described issue (which does not necessarily mean that it is the right place to apply a fix, but at least it pinpoints the problematic code part)

**Describe the bug**

A race condition in the generated `_model_base.py` causes deserialization of HTTP responses to **silently return raw JSON dicts (or partially deserialized models whose nested fields are raw dicts) instead of model objects**, when the first model objects of a process are constructed concurrently, e.g. when an application issues its first SDK calls from a thread pool.

Affected (at least):
- `azure-ai-vision-imageanalysis` 1.0.0 — `analyze()` can return a plain `dict` instead of `ImageAnalysisResult`, or an `ImageAnalysisResult` whose nested fields (`blocks`, `lines`, `words`, …) are raw dicts.
- `azure-ai-documentintelligence` 1.0.2 — same for `AnalyzeResult`.
- Most likely **every SDK that vendors `_model_base.py`** (the file is present in most of the modules; is it auto-generated? If so, this information should be added into its header), including the current `main` branch of this repo.

Likely root cause: `Model.__new__` initializes the per-class metadata `_attr_to_rest_field` lazily and without any locking. It iterates the class `__dict__`s via `cls.__mro__` and then assigns the new `_attr_to_rest_field` attribute into one of those same `__dict__`s ([`_model_base.py` lines 510–530 in `azure-ai-vision-imageanalysis` 1.0.0](https://github.com/Azure/azure-sdk-for-python/blob/azure-ai-vision-imageanalysis_1.0.0/sdk/vision/azure-ai-vision-imageanalysis/azure/ai/vision/imageanalysis/_model_base.py#L510-L530)).
When two threads construct the first instances of related model classes concurrently, one thread's iteration races with the other thread's assignment and raises `RuntimeError: dictionary changed size during iteration` (shared mutable `_RestField` state, e.g. `rf._type`, is also written without synchronization).

That `RuntimeError` never surfaces, because `_deserialize_default` swallows **any** exception and returns the raw input object instead ([lines 744–754](https://github.com/Azure/azure-sdk-for-python/blob/azure-ai-vision-imageanalysis_1.0.0/sdk/vision/azure-ai-vision-imageanalysis/azure/ai/vision/imageanalysis/_model_base.py#L744-L754)):

```python
def _deserialize_default(deserializer, obj):
    if obj is None:
        return obj
    try:
        return _deserialize_with_callable(deserializer, obj)
    except Exception:
        pass
    return obj   # <- raw wire dict returned as the "deserialized" result
```

So instead of an error, the application receives corrupted results, which makes this very hard to diagnose in production: it only happens on the first few responses of a freshly started process, only under concurrency, and the symptom (an `AttributeError` like `'dict' object has no attribute 'words'`, possibly much later in unrelated application code) gives no hint of the cause.

**To Reproduce**

Steps to reproduce the behavior:

1. `pip install azure-ai-vision-imageanalysis==1.0.0 azure-ai-documentintelligence==1.0.2`
2. Save the script below as `azure_model_base_race_repro.py` (no Azure credentials/endpoint needed — it calls `_model_base._deserialize(Model, response_json)` directly, which is exactly what the generated operations code does with every HTTP response body).
3. `python azure_model_base_race_repro.py ia` (Image Analysis) or `python azure_model_base_race_repro.py di` (Document Intelligence).

Each trial wipes the `azure` modules from `sys.modules` so the lazy class state is uninitialized again, exactly like the first SDK response in a freshly started process, then deserializes a valid response payload from 32 threads concurrently.

Observed output (Python 3.14.0, Linux x86_64; reproduces in 19–20 of 20 trials):

```
$ python azure_model_base_race_repro.py ia
trial  1: 2/32 results corrupted (e.g. got dict instead of ImageAnalysisResult)
trial  2: 6/32 results corrupted (e.g. partially deserialized model: AttributeError: 'dict' object has no attribute 'lines')
trial  3: 1/32 results corrupted (e.g. partially deserialized model: AttributeError: 'dict' object has no attribute 'lines')
...
trial 20: 1/32 results corrupted (e.g. got dict instead of ImageAnalysisResult)

BUG REPRODUCED: corrupted results in 20/20 trials
```

```
$ python azure_model_base_race_repro.py di
...
trial 19: 3/32 results corrupted (e.g. got dict instead of AnalyzeResult)
trial 20: 3/32 results corrupted (e.g. partially deserialized model: AttributeError: 'dict' object has no attribute 'words')

BUG REPRODUCED: corrupted results in 19/20 trials
```

<details>
<summary><code>azure_model_base_race_repro.py</code></summary>

```python
"""Minimal reproduction: concurrent first-time model deserialization silently returns raw
dicts (or partially deserialized models) instead of model objects.

Affected:
  - azure-ai-vision-imageanalysis 1.0.0
  - azure-ai-documentintelligence 1.0.2

Usage:
    pip install azure-ai-vision-imageanalysis==1.0.0 azure-ai-documentintelligence==1.0.2
    python azure_model_base_race_repro.py ia     # Image Analysis
    python azure_model_base_race_repro.py di     # Document Intelligence

Runs multiple trials, where each trial wipes the azure modules from sys.modules so the
lazy class state is uninitialized again, exactly like the first SDK response in a freshly
started process.

Expected output (bug present): most trials report corrupted results, e.g.
    trial  1: 13/32 results corrupted (e.g. got dict instead of ImageAnalysisResult)

Expected output (bug fixed): "no corruption in 20 trials".

Likely root cause: the generated `_model_base.py` initializes model-class metadata lazily in
`Model.__new__`: it iterates the class `__dict__` (via `cls.__mro__`) to build
`_attr_to_rest_field` and then assigns that new attribute into the same `__dict__`,
without any locking. When the first model objects of a process are constructed
concurrently, this raises "RuntimeError: dictionary changed size during iteration" -
which `_deserialize_default` swallows with a bare `except Exception`, silently returning
the raw JSON wire dict (or a model whose nested fields are raw dicts) instead of failing.

`_deserialize(Model, response_json)` below is exactly what the generated operations code
calls on every HTTP response, so any application that issues its first SDK calls of a
process from a thread pool can receive corrupted results.
"""

import copy
import sys
import threading
import time

N_THREADS = 32
N_TRIALS = 20


# format of a successful READ response
IMAGE_ANALYSIS_RESPONSE = {
    "modelVersion": "2023-10-01",
    "metadata": {"width": 100, "height": 100},
    "readResult": {
        "blocks": [
            {
                "lines": [
                    {
                        "text": "hi",
                        "boundingPolygon": [
                            {"x": 0, "y": 0},
                            {"x": 1, "y": 0},
                            {"x": 1, "y": 1},
                            {"x": 0, "y": 1},
                        ],
                        "words": [
                            {
                                "text": "hi",
                                "boundingPolygon": [
                                    {"x": 0, "y": 0},
                                    {"x": 1, "y": 0},
                                    {"x": 1, "y": 1},
                                    {"x": 0, "y": 1},
                                ],
                                "confidence": 0.99,
                            }
                        ],
                    }
                ]
            }
        ]
    },
}


# format of a successful prebuilt-read AnalyzeResult
DOCUMENT_INTELLIGENCE_RESPONSE = {
    "apiVersion": "2024-11-30",
    "modelId": "prebuilt-read",
    "stringIndexType": "textElements",
    "content": "hi",
    "pages": [
        {
            "pageNumber": 1,
            "width": 100.0,
            "height": 100.0,
            "unit": "pixel",
            "words": [
                {
                    "content": "hi",
                    "polygon": [0, 0, 1, 0, 1, 1, 0, 1],
                    "confidence": 0.99,
                    "span": {"offset": 0, "length": 2},
                }
            ],
            "spans": [{"offset": 0, "length": 2}],
        }
    ],
}


def fresh_sdk(backend):
    """(Re-)import the SDK with all lazy model-class state uninitialized, like a fresh process."""
    for name in [m for m in sys.modules if m.startswith("azure")]:
        del sys.modules[name]

    if backend == "di":
        from azure.ai.documentintelligence import _model_base, models

        return (
            _model_base._deserialize,
            models.AnalyzeResult,
            DOCUMENT_INTELLIGENCE_RESPONSE,
        )
    else:
        from azure.ai.vision.imageanalysis import _model_base, models

        return (
            _model_base._deserialize,
            models.ImageAnalysisResult,
            IMAGE_ANALYSIS_RESPONSE,
        )


def describe_corruption(result, model_cls, backend):
    """Returns a description of how `result` is corrupted, or None if it deserialized correctly."""
    if not isinstance(result, model_cls):
        return f"got {type(result).__name__} instead of {model_cls.__name__}"
    try:
        if backend == "di":
            for page in result.pages:
                for word in page.words:
                    _ = word.content, word.confidence, word.polygon
        else:
            for block in result.read.blocks:
                for line in block.lines:
                    for word in line.words:
                        _ = word.text, word.confidence, word.bounding_polygon[0].x
    except Exception as e:  # nested field is a raw dict instead of a model
        return f"partially deserialized model: {type(e).__name__}: {e}"
    return None


def run_trial(backend):
    deserialize, model_cls, payload = fresh_sdk(backend)
    results = [None] * N_THREADS
    barrier = threading.Barrier(N_THREADS)
    go = False

    def work(i):
        """Each thread deserializes the same payload concurrently, like the first few HTTP responses of a process.
        Before doing so, all threads are synchronized to maximize concurrency of the critical section."""

        local_payload = copy.deepcopy(
            payload
        )  # independent input per thread, like real responses
        barrier.wait()

        while (
            not go
        ):  # spin so all threads are runnable and contend the moment they are released
            pass

        results[i] = deserialize(model_cls, local_payload)  # critical section

    threads = [threading.Thread(target=work, args=(i,)) for i in range(N_THREADS)]

    for t in threads:
        t.start()

    time.sleep(0.05)  # let every thread reach the spin loop

    go = True  # all threads should now be ready to contend the critical section

    # Wait for all threads to finish
    for t in threads:
        t.join()

    return [
        c for r in results if (c := describe_corruption(r, model_cls, backend))
    ]  # Check for any corrupted results and return their descriptions


def main():
    backend = sys.argv[1]

    if backend not in ("ia", "di"):
        sys.exit(f"Usage: {sys.argv[0]} [ia|di]")

    sys.setswitchinterval(1e-6)  # maximize thread switches to widen the race window

    corrupted_trials = 0

    for trial in range(1, N_TRIALS + 1):
        corruptions = run_trial(backend)

        if corruptions:
            corrupted_trials += 1
            print(
                f"trial {trial:2}: {len(corruptions)}/{N_THREADS} results corrupted (e.g. {corruptions[0]})"
            )
        else:
            print(f"trial {trial:2}: no corruption")

    if corrupted_trials:
        print(
            f"\nBUG REPRODUCED: corrupted results in {corrupted_trials}/{N_TRIALS} trials"
        )
        sys.exit(1)

    print(f"no corruption in {N_TRIALS} trials")


if __name__ == "__main__":
    main()
```

</details>

**Expected behavior**

`_deserialize(Model, response_json)` always returns a fully deserialized model object, regardless of how many threads are deserializing concurrently — i.e. the lazy initialization in `Model.__new__` is thread-safe (e.g. double-checked locking around the `_attr_to_rest_field` computation). Independently of the race itself, a failure inside deserialization should arguably not be silently swallowed by the bare `except Exception` in `_deserialize_default`, since returning the raw wire dict turns an internal error into silent data corruption.

**Screenshots**

N/A (full console output included above).

**Additional context**

- Environment: Python 3.14.0, Linux x86_64, `azure-ai-vision-imageanalysis` 1.0.0, `azure-ai-documentintelligence` 1.0.2, `azure-core` 1.41.0. The repro sets `sys.setswitchinterval(1e-6)` to widen the race window, but the race also triggers with the default switch interval — the tight setting just makes the repro deterministic.
- The unguarded `__new__` is still present on current `main`, e.g. [`sdk/vision/azure-ai-vision-imageanalysis/.../_model_base.py#L510`](https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/vision/azure-ai-vision-imageanalysis/azure/ai/vision/imageanalysis/_model_base.py#L510), so newly generated/released SDKs are affected as well. Since `_model_base.py` is emitted by typespec-python, the proper fix is presumably in the emitter template, with regeneration of affected packages.
- A straightforward fix that makes the repro pass: guard the metadata computation in `Model.__new__` with a class-level lock and an "already calculated" check (double-checked locking), so the `__dict__` iteration can never race with the `_attr_to_rest_field` assignment. I'm happy to submit a PR.
- Real-world impact: any service that starts worker processes and immediately serves requests from a thread pool can return corrupted results for its first few requests after every process start/restart, with no error logged by the SDK.
- Since `_model_base.py` appears to be autogenerated, I am not adding a PR. But locally, this `git diff` fixes the issue; please note that this fix is generated by Claude (basically adding a `lock` at the critical places; seems reasonable to me, but I am always careful when it comes to concurrency):

<details>
<summary><code>git diff</code></summary>

```
diff --git a/sdk/documentintelligence/azure-ai-documentintelligence/azure/ai/documentintelligence/_model_base.py b/sdk/documentintelligence/azure-ai-documentintelligence/azure/ai/documentintelligence/_model_base.py
index 7f73b97b23e..d999a521c9e 100644
--- a/sdk/documentintelligence/azure-ai-documentintelligence/azure/ai/documentintelligence/_model_base.py
+++ b/sdk/documentintelligence/azure-ai-documentintelligence/azure/ai/documentintelligence/_model_base.py
@@ -13,6 +13,7 @@ import decimal
 import functools
 import sys
 import logging
+import threading
 import base64
 import re
 import typing
@@ -495,6 +496,10 @@ class Model(_MyMutableMapping):
     # label whether current class's _attr_to_rest_field has been calculated
     # could not see _attr_to_rest_field directly because subclass inherits it from parent class
     _calculated: typing.Set[str] = set()
+    # serializes first-time calculation of _attr_to_rest_field: assigning it into a class
+    # __dict__ that a concurrent __new__ is iterating over raises "dictionary changed size
+    # during iteration", which deserialization fallbacks then swallow into corrupted results
+    _calculated_lock = threading.Lock()
 
     def __init__(self, *args: typing.Any, **kwargs: typing.Any) -> None:
         class_name = self.__class__.__name__
@@ -576,26 +581,31 @@ class Model(_MyMutableMapping):
 
     def __new__(cls, *args: typing.Any, **kwargs: typing.Any) -> Self:
         if f"{cls.__module__}.{cls.__qualname__}" not in cls._calculated:
-            # we know the last nine classes in mro are going to be 'Model', '_MyMutableMapping', 'MutableMapping',
-            # 'Mapping', 'Collection', 'Sized', 'Iterable', 'Container' and 'object'
-            mros = cls.__mro__[:-9][::-1]  # ignore parents, and reverse the mro order
-            attr_to_rest_field: typing.Dict[str, _RestField] = {  # map attribute name to rest_field property
-                k: v for mro_class in mros for k, v in mro_class.__dict__.items() if k[0] != "_" and hasattr(v, "_type")
-            }
-            annotations = {
-                k: v
-                for mro_class in mros
-                if hasattr(mro_class, "__annotations__")
-                for k, v in mro_class.__annotations__.items()
-            }
-            for attr, rf in attr_to_rest_field.items():
-                rf._module = cls.__module__
-                if not rf._type:
-                    rf._type = rf._get_deserialize_callable_from_annotation(annotations.get(attr, None))
-                if not rf._rest_name_input:
-                    rf._rest_name_input = attr
-            cls._attr_to_rest_field: typing.Dict[str, _RestField] = dict(attr_to_rest_field.items())
-            cls._calculated.add(f"{cls.__module__}.{cls.__qualname__}")
+            with cls._calculated_lock:
+                if f"{cls.__module__}.{cls.__qualname__}" not in cls._calculated:
+                    # we know the last nine classes in mro are going to be 'Model', '_MyMutableMapping',
+                    # 'MutableMapping', 'Mapping', 'Collection', 'Sized', 'Iterable', 'Container' and 'object'
+                    mros = cls.__mro__[:-9][::-1]  # ignore parents, and reverse the mro order
+                    attr_to_rest_field: typing.Dict[str, _RestField] = {  # map attribute name to rest_field property
+                        k: v
+                        for mro_class in mros
+                        for k, v in mro_class.__dict__.items()
+                        if k[0] != "_" and hasattr(v, "_type")
+                    }
+                    annotations = {
+                        k: v
+                        for mro_class in mros
+                        if hasattr(mro_class, "__annotations__")
+                        for k, v in mro_class.__annotations__.items()
+                    }
+                    for attr, rf in attr_to_rest_field.items():
+                        rf._module = cls.__module__
+                        if not rf._type:
+                            rf._type = rf._get_deserialize_callable_from_annotation(annotations.get(attr, None))
+                        if not rf._rest_name_input:
+                            rf._rest_name_input = attr
+                    cls._attr_to_rest_field: typing.Dict[str, _RestField] = dict(attr_to_rest_field.items())
+                    cls._calculated.add(f"{cls.__module__}.{cls.__qualname__}")
 
         return super().__new__(cls)  # pylint: disable=no-value-for-parameter
 
diff --git a/sdk/vision/azure-ai-vision-imageanalysis/azure/ai/vision/imageanalysis/_model_base.py b/sdk/vision/azure-ai-vision-imageanalysis/azure/ai/vision/imageanalysis/_model_base.py
index 43fd8c7e9b1..1842ec95ea1 100644
--- a/sdk/vision/azure-ai-vision-imageanalysis/azure/ai/vision/imageanalysis/_model_base.py
+++ b/sdk/vision/azure-ai-vision-imageanalysis/azure/ai/vision/imageanalysis/_model_base.py
@@ -11,6 +11,7 @@ import calendar
 import decimal
 import functools
 import sys
+import threading
 import logging
 import base64
 import re
@@ -476,6 +477,13 @@ def _create_value(rf: typing.Optional["_RestField"], value: typing.Any) -> typin
 
 class Model(_MyMutableMapping):
     _is_model = True
+    # label whether current class's _attr_to_rest_field has been calculated
+    # could not see _attr_to_rest_field directly because subclass inherits it from parent class
+    _calculated: typing.Set[str] = set()
+    # serializes first-time calculation of _attr_to_rest_field: assigning it into a class
+    # __dict__ that a concurrent __new__ is iterating over raises "dictionary changed size
+    # during iteration", which deserialization fallbacks then swallow into corrupted results
+    _calculated_lock = threading.Lock()
 
     def __init__(self, *args: typing.Any, **kwargs: typing.Any) -> None:
         class_name = self.__class__.__name__
@@ -508,24 +516,31 @@ class Model(_MyMutableMapping):
         return Model(self.__dict__)
 
     def __new__(cls, *args: typing.Any, **kwargs: typing.Any) -> Self:  # pylint: disable=unused-argument
-        # we know the last three classes in mro are going to be 'Model', 'dict', and 'object'
-        mros = cls.__mro__[:-3][::-1]  # ignore model, dict, and object parents, and reverse the mro order
-        attr_to_rest_field: typing.Dict[str, _RestField] = {  # map attribute name to rest_field property
-            k: v for mro_class in mros for k, v in mro_class.__dict__.items() if k[0] != "_" and hasattr(v, "_type")
-        }
-        annotations = {
-            k: v
-            for mro_class in mros
-            if hasattr(mro_class, "__annotations__")  # pylint: disable=no-member
-            for k, v in mro_class.__annotations__.items()  # pylint: disable=no-member
-        }
-        for attr, rf in attr_to_rest_field.items():
-            rf._module = cls.__module__
-            if not rf._type:
-                rf._type = rf._get_deserialize_callable_from_annotation(annotations.get(attr, None))
-            if not rf._rest_name_input:
-                rf._rest_name_input = attr
-        cls._attr_to_rest_field: typing.Dict[str, _RestField] = dict(attr_to_rest_field.items())
+        if f"{cls.__module__}.{cls.__qualname__}" not in cls._calculated:
+            with cls._calculated_lock:
+                if f"{cls.__module__}.{cls.__qualname__}" not in cls._calculated:
+                    # we know the last three classes in mro are going to be 'Model', 'dict', and 'object'
+                    mros = cls.__mro__[:-3][::-1]  # ignore model, dict, and object parents, and reverse the mro order
+                    attr_to_rest_field: typing.Dict[str, _RestField] = {  # map attribute name to rest_field property
+                        k: v
+                        for mro_class in mros
+                        for k, v in mro_class.__dict__.items()
+                        if k[0] != "_" and hasattr(v, "_type")
+                    }
+                    annotations = {
+                        k: v
+                        for mro_class in mros
+                        if hasattr(mro_class, "__annotations__")  # pylint: disable=no-member
+                        for k, v in mro_class.__annotations__.items()  # pylint: disable=no-member
+                    }
+                    for attr, rf in attr_to_rest_field.items():
+                        rf._module = cls.__module__
+                        if not rf._type:
+                            rf._type = rf._get_deserialize_callable_from_annotation(annotations.get(attr, None))
+                        if not rf._rest_name_input:
+                            rf._rest_name_input = attr
+                    cls._attr_to_rest_field: typing.Dict[str, _RestField] = dict(attr_to_rest_field.items())
+                    cls._calculated.add(f"{cls.__module__}.{cls.__qualname__}")
 
         return super().__new__(cls)  # pylint: disable=no-value-for-parameter
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thread-unsafe lazy init in `Model.new` causes silent deserialization corruption under concurrency (`azure-ai-documentintelligence` and `azure-ai-vision-imageanalysis`) #47426

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Thread-unsafe lazy init in Model.__new__ causes silent deserialization corruption under concurrency (azure-ai-documentintelligence and azure-ai-vision-imageanalysis) #47426

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Thread-unsafe lazy init in `Model.new` causes silent deserialization corruption under concurrency (`azure-ai-documentintelligence` and `azure-ai-vision-imageanalysis`) #47426