GH-418: Reduce arrow-vector dependencies: drop jackson-* (other than jackson-core) and commons-codec#1181
Draft
JonathanGiles wants to merge 1 commit into
Draft
Conversation
… jsr310 and commons-codec Replace jackson-databind/annotations/jackson-datatype-jsr310 usage in arrow-vector with the jackson-core streaming API (JsonGenerator/JsonParser) plus small hand-written helpers (JsonValues, JsonStringSerializer), and replace commons-codec hex handling with java.util.HexFormat. - Schema/Field/DictionaryEncoding/ArrowType JSON (de)serialization now uses jackson-core streaming instead of ObjectMapper + jackson annotations. - OpaqueType and the IPC JsonFileReader/JsonFileWriter migrated to streaming. - JsonStringHashMap/JsonStringArrayList keep their JSON toString() via JsonStringSerializer; explicit serialVersionUID values are pinned to preserve Java deserialization compatibility, and JavaTimeModule's exact numeric output for java.time values is reproduced for byte-for-byte toString() compatibility. - Removed the public ObjectMapperFactory utility (callers updated) and the Text jackson @JsonSerialize annotation. This contains breaking changes (removed public ObjectMapperFactory and Text.TextSerializer, removed jackson annotation/databind interop, dropped transitive databind/annotations/jsr310/commons-codec dependencies, and module-info requires changes). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
Author
|
I don’t appear to have permission to add labels on this fork-based PR. Could a maintainer please add the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In considering using this API, I was concerned by the heaviness of the dependencies. I set about looking into the feasibility of removing these. This PR is not meant to be a final complete solution, but as a starting point for a discussion around the appetite of reducing the dependency size to make the library more palatable for developers building libraries (as I am with the Azure SDKs for Java).
AI Disclosure: I built this on my machine with the help of coding agents (Claude Opus 4.8) using the GitHub Copilot CLI tooling.
What's Changed
Removes several heavy dependencies from
arrow-vectorby migrating its JSON handling to the lightweightjackson-corestreaming API and the JDK's built-inHexFormat.Dependencies dropped from
arrow-vector:com.fasterxml.jackson.core:jackson-databindcom.fasterxml.jackson.core:jackson-annotationscom.fasterxml.jackson.datatype:jackson-datatype-jsr310commons-codec:commons-codecjackson-coreis retained (used for streaming JSON read/write).How:
Schema,Field,DictionaryEncoding, andArrowType(+ generated subtypes) JSON (de)serialization rewritten fromObjectMapper+ jackson annotations (@JsonCreator/@JsonProperty/@JsonTypeInfo) tojackson-corestreaming (JsonGenerator/JsonParser), with two small internal helpers:JsonValues(parser→tree + typed extractors) andJsonStringSerializer(compacttoString()JSON).extension/OpaqueTypeand the IPCJsonFileReader/JsonFileWritermigrated to the streaming API.Hextojava.util.HexFormat.Textno longer carries a jackson@JsonSerializeannotation (and its innerTextSerializeris removed).Compatibility preserved deliberately:
JsonStringHashMap/JsonStringArrayListhad astatic ObjectMapperfield removed, which would have changed their implicitly-computedserialVersionUIDand broken Java deserialization of objects written by older Arrow versions (e.g. blobs stored by H2, objects serialized by Spark). The originalserialVersionUIDvalues are now pinned explicitly to retain wire compatibility.toString()can containjava.timevalues (from temporal vectors'getObject()). The previous output usedJavaTimeModule's numeric form (e.g.LocalDateTime→[2021,1,2,3,4,5],Duration→90.000000000). That exact output is reproduced inJsonStringSerializer, sotoString()is byte-for-byte unchanged. A clearly-marked code block documents how to revert to native ISO-8601 output if desired.This contains breaking changes.
Breaking changes
org.apache.arrow.vector.util.ObjectMapperFactoryObjectMapper(addjackson-databinddirectly). It only configuredJavaTimeModule; supply that module if you needjava.timesupport.Text.TextSerializer(andText's@JsonSerialize)Textvia an external jacksonObjectMapper, register a custom serializer.Schema/Field/DictionaryEncoding/ArrowTypeObjectMapperrelying on Arrow's annotations will no longer work; useSchema.fromJSON(String)/Schema.toJson()(public API, unchanged signatures) instead.jackson-databind,jackson-annotations,jackson-datatype-jsr310, andcommons-codecare no longer pulled in transitively viaarrow-vector. Downstreams that relied on this transitively must declare them directly.module-inforequiresreducedarrow-vector's module no longerrequirescom.fasterxml.jackson.databind,…annotation,…jsr310, ororg.apache.commons.codec.Non-breaking behavioral notes
toString()of complex vectors containingjava.timevalues is unchanged (legacy numeric format reproduced);serialVersionUIDofJsonStringHashMap/JsonStringArrayListis preserved.JsonFileWriteremits raw epoch numbers as before).Testing
arrow-vector: full suite 1131 tests pass.arrow-tools: 13 pass (exercises the Arrow↔JSON file round-trip).adapter/jdbc: 152 pass (exercisesJsonStringHashMapserialization + H2 blob deserialization).spotless:check+checkstyle:checkclean.Closes #418.