Skip to content

Make FakeValuesService expression caches static to share across Faker instances#1819

Open
mferretti wants to merge 5 commits into
datafaker-net:mainfrom
mferretti:feature/1814
Open

Make FakeValuesService expression caches static to share across Faker instances#1819
mferretti wants to merge 5 commits into
datafaker-net:mainfrom
mferretti:feature/1814

Conversation

@mferretti
Copy link
Copy Markdown

@mferretti mferretti commented May 18, 2026

Problem

Every new Faker() starts with empty expression-resolution and regex-compilation caches. In workloads that create many short-lived Fakers — property-based testing frameworks (jqwik, QuickCheck), seeded-per-record data generation — the warm-up cost is paid repeatedly, even when Fakers share the same locale.

Why "just make the map static" needed structural work

The existing REGEXP2SUPPLIER_MAP used RegExpContext(String exp, BaseFaker root, FakerContext context) as its cache key. Making it static would re-introduce the memory leak fixed in #1263 / PR #1271: the root field holds a strong reference to the Faker instance, so every new Faker() adds a unique entry the GC can never collect.

Fixing this requires separating what method to call (shareable, context-free) from which provider instance to call it on (per-Faker, bound at first use).

What changed

Two-level expression cache:

  • L1 RECIPE_MAP (static) — keyed by CacheKey(String exp, SingletonLocale locale). No per-Faker references; safe to share globally. Stores context-free "recipe" resolvers.
  • L2 instanceMap (per-instance) — stores "materialized" resolvers pre-bound to this Faker's concrete provider instances. Subsequent calls on the same Faker skip getProvider() entirely.

ValueResolver gains materialize(ProviderRegistration root) and cacheable() to support the two-level contract. New resolver types (ProviderMethodResolver, ChainedCoercedResolver, InstanceMethodResolver, etc.) are context-free at L1 and pre-bound at L2.

expression2generex made static — the RgxGen compiled-regex cache was per-instance. This is the primary performance win: every new Faker recompiled the same patterns; now compilation happens once globally.

No public API changes

FakeValuesService remains internal. Faker constructors are unchanged. No new public methods.

Tests

SharedFakeValuesServiceTest covers:

  • 16 threads × 10k iterations of concurrent Faker creation — no errors, no races
  • 4 locales × 4 threads concurrent — no errors
  • Same seed → same output regardless of static cache state

@what-the-diff
Copy link
Copy Markdown

what-the-diff Bot commented May 18, 2026

PR Summary

  • Introduction of a new feature in Faker.java: A new method has been added which allows the configuration of shared instances. This optimizes the system when operated in multi-user scenarios, enhancing performance and efficiency.

  • Updates to FakeValuesService.java: A shared information storage, akin to a vault, has been included to hold instances of value-creating services, in a safe and secure manner across multiple users.

  • Added a retrieval feature in FakeValuesService.java: A new method is introduced to get the shared instance of service, if it exists, or create a new one. This ensures smooth flow and operation, reducing delays due to repetitive creation of services.

  • Inclusion of a new testing method: Ensuring the stability and correctness of the shared instance when being accessed by multiple users. This includes:

    • Verifying the same service instance is given to different users.
    • Asserting smooth operation when multiple users are using the service.
    • Checking that the results of shared service instances are consistent with individual instances.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an opt-in way to share a single FakeValuesService across threads to avoid repeated YAML loading, plus a Faker factory that pairs the shared service with a per-thread Random. New concurrency tests exercise the singleton identity, behavior under load, and parity with a normally constructed Faker.

Changes:

  • New FakeValuesService.getShared(Locale) lazily caches per-locale instances in a ConcurrentHashMap.
  • New Faker.withSharedService(FakeValuesService, Locale, Random) factory wires a shared service with a per-thread RandomService.
  • New SharedFakeValuesServiceTest covering singleton identity, concurrent usage, and output parity.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/main/java/net/datafaker/service/FakeValuesService.java Adds SHARED_INSTANCES map and getShared(Locale) factory.
src/main/java/net/datafaker/Faker.java Adds withSharedService static factory for thread-local Fakers reusing a shared service.
src/test/java/net/datafaker/SharedFakeValuesServiceTest.java New tests for concurrent identity, no-errors under load, and output parity.

Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 20, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 73.86364% with 23 lines in your changes missing coverage. Please review.
✅ Project coverage is 92.22%. Comparing base (148736b) to head (93bdd34).
⚠️ Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
.../java/net/datafaker/service/FakeValuesService.java 73.86% 8 Missing and 15 partials ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1819      +/-   ##
============================================
- Coverage     92.43%   92.22%   -0.21%     
- Complexity     3512     3519       +7     
============================================
  Files           343      343              
  Lines          6962     7009      +47     
  Branches        684      701      +17     
============================================
+ Hits           6435     6464      +29     
- Misses          360      369       +9     
- Partials        167      176       +9     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@asolntsev
Copy link
Copy Markdown
Collaborator

@mferretti Please don't take my words as critics, I am just researching the problem.

I feel that this solution heavily depends on the internal structure of DF. Ideally, the end-user should use only new Faker(), and shouldn't even know any its internals like FakeValuesService etc.

The bottleneck is redundant YAML loading, right? If we want to avoid redundant YAML loading, then shouldn't we just extract the YML loading to some smaller class, and cache its results?

@mferretti
Copy link
Copy Markdown
Author

@asolntsev , none taken :)
It was food for thought and made me go back at the code.

YAML loading is not the bottleneck — it's already cached globally in FakeValues.of(). The real issue is regex compilation: FakeValuesService has a REGEXP2SUPPLIER_MAP that is a per-instance field, so every new Faker() starts with an empty regex compilation cache.

You're right that exposing FakeValuesService is leaky. The cleaner fix is simply making REGEXP2SUPPLIER_MAP static (like EXPRESSION_2_SPLITTED already is) — users just call new Faker() and get shared regex caching for free, no API changes needed.

Would that make sense? If so, I can revise the PR along those lines :-)

@asolntsev
Copy link
Copy Markdown
Collaborator

@mferretti Yes, it seems absolutely reasonable to make REGEXP2SUPPLIER_MAP static. Seems it should solve the initial problem?

UPD I realized that it was previously static, and I made in non-static in commit 751ab16 :) That's an irony! :)

The reason was memory leak. Static map REGEXP2SUPPLIER_MAP was growing endlessly, collecting indefinite number of unique regexps.

Maybe we should investigate #1263 once again and find a better solution for it. Or just use some Cache-like map for REGEXP2SUPPLIER_MAP (which removes unused records after some time).

@mferretti
Copy link
Copy Markdown
Author

@asolntsev thanks for pointing this out. I'll have a look at the bug that made you opt for non static and try to put everything together. it'll have to wait a couple of days as i am heading out and will be back tomorrow late evening/night.

@kingthorin
Copy link
Copy Markdown
Collaborator

No rush, enjoy your weekend!!

@mferretti
Copy link
Copy Markdown
Author

Hello @asolntsev and @kingthorin
here's my understanding/findings (I am not going to question the why of the changes or the architecture/development, just exposing what I saw and understood)

  • Every new Faker() starts with an empty REGEXP2SUPPLIER_MAP — the cache that records which method resolves each YAML expression (e.g. #{Name.firstName} → Name.firstName()). In high-volume scenarios where many Faker instances are created, method-resolution work is repeated from scratch for every instance, causing slowdown.
  • REGEXP2SUPPLIER_MAP was previously static but was made per-instance in PR Fix memory leak #1271 to fix a memory leak (HashMap grew without bound as every Faker produced unique cache keys and entries never evicted).

What I propose, but i'd need some validation from you guys, is :

  1. expression2generex becomes static (does not help yaml resolution but helps regexify)
  2. add a static map and implement a "context free" ValueResolver

On point 2, resolve is private so there shouldn't be any breaking changes
On point 1: the main caveat here is the assumption that the provider is registred with the same name across all the Faker's root; probably an edge case but custom providers could be registering via non standard names thus i'd add a test case.

Obviously, I can recycle this pr or close this one and open a new one.

Looking forward for your input !

@mferretti mferretti marked this pull request as draft May 28, 2026 13:09
@asolntsev
Copy link
Copy Markdown
Collaborator

Hi @mferretti !
Yes, it seems like a totally valid plan:

  • expression2generex becomes static (does not help yaml resolution but helps regexify)
  • add a static map and implement a "context free" ValueResolver

Replaces the per-instance REGEXP2SUPPLIER_MAP with a two-level static+instance design:

- L1 RECIPE_MAP (static): keyed by CacheKey(String exp, SingletonLocale locale) — no per-Faker references. Stores context-free "recipe" resolvers shared across all Fakers with the same locale. Fixes the memory leak that blocked making this cache static (issue datafaker-net#1263): the old RegExpContext key held a strong reference to the BaseFaker root, keeping every Faker alive indefinitely.
- L2 instanceMap (per-instance): stores resolvers pre-bound to this Faker's concrete provider instances for fast repeated calls within the same Faker.

ValueResolver gains materialize(ProviderRegistration) and cacheable() to support the two-level contract. New resolver types (ProviderMethodResolver, ChainedCoercedResolver, InstanceMethodResolver, etc.) are context-free at L1 and pre-bound at L2.

expression2generex (RgxGen compiled-regex cache) is also made static

Adds SharedFakeValuesServiceTest covering concurrent multi-Faker usage and determinism under caching.
@mferretti
Copy link
Copy Markdown
Author

Hi.
sorry for the delay... life happens.
I had to revert what i done in the first 2 commits as we are changing the view radically on what needs to be done.

The original PR added a public FakeValuesService.getShared(Locale) API and Faker.withSharedService() to let callers share a single FakeValuesService across multiple Faker instances, reducing the cost of repeated cache warm-up. @asolntsev correctly pointed out that exposing FakeValuesService publicly is too leaky. His suggestion: make the internal map static instead.

This revised PR implements exactly that — but with the additional structural work required to do it correctly (naively making the map static brings back the memory leak from issue #1263 / PR #1271).
The fix requires two steps:

  1. Change the key to CacheKey(String exp, SingletonLocale locale) — no per-Faker references.
  2. Separate which method to call (shareable) from which provider instance to call it on (per-Faker). The ValueResolver implementations previously closed over per-Faker provider instances, so they could not be shared across Fakers.

Also, I added an L2 cache:

L1 — static RECIPE_MAP: Map<CacheKey, ValueResolver>
Stores context-free "recipes": which provider class and which method to invoke. Shared across all Fakers with the same locale. No per-Faker references → no memory leak.

L2 — per-instance instanceMap: Map<String, ValueResolver>
Stores "materialized" resolvers: the recipe pre-bound to this Faker's concrete provider object. Fast repeated calls on the same Faker skip the getProvider() reflective call entirely.

Resolution order: L2 hit → L1 hit (materialize + store L2) → full discovery (store L1 + materialize L2).

The benchmarks tests i did locally:

Scenario main This PR Delta
10k new Fakers × vehicle().vin() (heavy regex) 75 ms 7 ms 10× faster
10k new Fakers × finance().bic() (medium regex) 29 ms 8 ms 3.6× faster
Single Faker × 10k vehicle().vin() 6 ms 5 ms unchanged
1000 new Fakers × 3 calls (name/address/internet) 10,275 µs 7,940 µs 23% faster
10k single-shot Fakers × 1 call 25,222 µs 19,112 µs 24% faster
Single Faker × 1000 calls 5,024 µs 4,278 µs unchanged

In all honesty, I would suggest that you run your tests too as the performance gain, on my side, is clear with complex regex and a property based test scenario, but non existent with simple regex; plus I have added the L2 cache complexity.

@mferretti mferretti marked this pull request as ready for review June 4, 2026 14:39
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

Comment thread src/main/java/net/datafaker/service/FakeValuesService.java
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated
Comment thread src/test/java/net/datafaker/SharedFakeValuesServiceTest.java
Comment thread src/test/java/net/datafaker/SharedFakeValuesServiceTest.java Outdated
Comment thread src/test/java/net/datafaker/SharedFakeValuesServiceTest.java Outdated
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java Outdated
Comment thread src/main/java/net/datafaker/service/FakeValuesService.java
@mferretti mferretti changed the title Add shared FakeValuesService for multi-threaded Faker instances Make FakeValuesService expression caches static to share across Faker instances Jun 4, 2026
mferretti and others added 4 commits June 4, 2026 17:23
…owth note

- SafeFetchResolver.resolve(): guard against null root
- resolveFromMethodOn(): guard null root before NamedProviderCoercedResolver branch
- resolveExpression inner: guard null root before dotIndex provider lookup
- RECIPE_MAP: expand Javadoc noting bounded-in-practice growth
- accessor(): fix dead-code `methods == null` check → `methods.isEmpty()`
- accessor(): fix two "Didn't accessor" log typos → "Didn't find accessor"
- SharedFakeValuesServiceTest: call shutdownNow() on awaitTermination timeout

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ProviderMethodResolver.resolve(): return null when root is null
- ProviderMethodResolver.materialize(): return this when root is null
- resolveExpression outer loop: skip L2 caching when root is null,
  call recipe directly without materializing

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
RootCoercedResolver, NamedProviderCoercedResolver, and ChainedCoercedResolver
all called root.getProvider() / fakerAccessor.invoke(root) without guarding
against null root in both resolve() and materialize(). Pattern matches the
existing guards on SafeFetchResolver and ProviderMethodResolver.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants