Summary
Add a new parse_array primitive that converts serialized array values (primarily CSV strings) into typed Python lists before downstream operations like reduce.
Motivation
Reduce currently expects a list/tuple input. In file-based workflows (especially CSV), array-like values arrive as strings (e.g., "[8,8,8,8,6]").
Today, users need custom pre-processing before harmonization. This is workable but not composable in rules and makes UI-driven pipelines harder to express. A dedicated parsing primitive keeps concerns explicit and reusable.
Problem Statement
- Current
reduce behavior is intentionally strict (list/tuple only).
- CSV and similar formats encode arrays as strings.
- Users need a rule-level way to parse arrays without changing
reduce semantics.
Proposal
Introduce a new primitive operation:
operation: parse_array
- responsibility: parse scalar/list-like input into
list[Any]
- intended chaining:
parse_array -> reduce
Example chain:
{
"source": "week_hours",
"target": "total_hours",
"operations": [
{
"operation": "parse_array",
"format": "json",
"item_type": "integer",
"strict": true
},
{
"operation": "reduce",
"reduction": "sum"
}
]
}
Design Goals
- Keep primitive responsibilities explicit (no implicit parsing inside
reduce).
- Make behavior deterministic and easy to reason about.
- Provide strict-by-default error handling for data quality.
- Support safe incremental extension for future parsing formats.
Non-Goals (V1)
- General object parsing.
- Deep schema validation for nested arrays.
- Silent “best effort” coercion by default.
V1 API / Serialization
Suggested serialized config:
{
"operation": "parse_array",
"format": "json",
"item_type": "auto",
"strict": true
}
Optional keys:
default: value returned when strict=false and parse fails (default: null)
allow_singleton: when true, scalar input can be wrapped as one-item list after coercion
Field semantics:
format: parsing strategy (V1: json only)
item_type: auto | string | integer | float | boolean
strict: fail-fast behavior
Input/Output Contract
Input handling:
list/tuple: return list (idempotent)
str: parse according to format
- other scalar types:
- strict mode: error
- non-strict mode: return
default
Output:
- Always
list (or default in non-strict failure path)
Error Handling
strict=true: raise clear ValueError including input value and selected format.
strict=false: return default and emit warning/log message.
Why Not Put This in Reduce
Embedding string parsing in reduce would:
- conflate reduction and deserialization concerns,
- increase ambiguity (string syntax variants, malformed inputs),
- create inconsistent behavior relative to other primitives.
A dedicated parser primitive preserves composability and predictability.
Extension Path
After V1 stabilizes, consider:
format: delimiter for values like "8|8|8|8|6" with configurable delimiter.
- Optional
strip_items behavior for string tokens.
- Optional
python_literal mode only if justified (higher complexity/risk).
- Potential generic parsing family later (
parse_value, parse_object) if needed.
Suggested Implementation Areas
- New file:
src/harmonization_framework/primitives/parse_array.py
- Register operation:
src/harmonization_framework/primitives/vocabulary.py
src/harmonization_framework/primitives/__init__.py
src/harmonization_framework/harmonization_rule.py (from_serialization dispatch)
Test Plan
Add tests for:
- JSON list parsing success.
- Idempotent pass-through for list/tuple input.
- Item coercion success/failure for each
item_type.
- Strict failure behavior.
- Non-strict fallback behavior (
default).
- Invalid JSON / non-array JSON payloads.
Open Questions
- Should
parse_array accept tuples in output, or always normalize to list?
- For
item_type=boolean, what token vocabulary should be accepted (true/false, 1/0, yes/no)?
- Should warning output route through existing logging infrastructure rather than
print?
Summary
Add a new
parse_arrayprimitive that converts serialized array values (primarily CSV strings) into typed Python lists before downstream operations likereduce.Motivation
Reducecurrently expects a list/tuple input. In file-based workflows (especially CSV), array-like values arrive as strings (e.g.,"[8,8,8,8,6]").Today, users need custom pre-processing before harmonization. This is workable but not composable in rules and makes UI-driven pipelines harder to express. A dedicated parsing primitive keeps concerns explicit and reusable.
Problem Statement
reducebehavior is intentionally strict (list/tupleonly).reducesemantics.Proposal
Introduce a new primitive operation:
operation:parse_arraylist[Any]parse_array->reduceExample chain:
{ "source": "week_hours", "target": "total_hours", "operations": [ { "operation": "parse_array", "format": "json", "item_type": "integer", "strict": true }, { "operation": "reduce", "reduction": "sum" } ] }Design Goals
reduce).Non-Goals (V1)
V1 API / Serialization
Suggested serialized config:
{ "operation": "parse_array", "format": "json", "item_type": "auto", "strict": true }Optional keys:
default: value returned whenstrict=falseand parse fails (default:null)allow_singleton: when true, scalar input can be wrapped as one-item list after coercionField semantics:
format: parsing strategy (V1:jsononly)item_type:auto | string | integer | float | booleanstrict: fail-fast behaviorInput/Output Contract
Input handling:
list/tuple: return list (idempotent)str: parse according toformatdefaultOutput:
list(ordefaultin non-strict failure path)Error Handling
strict=true: raise clearValueErrorincluding input value and selected format.strict=false: returndefaultand emit warning/log message.Why Not Put This in
ReduceEmbedding string parsing in
reducewould:A dedicated parser primitive preserves composability and predictability.
Extension Path
After V1 stabilizes, consider:
format: delimiterfor values like"8|8|8|8|6"with configurabledelimiter.strip_itemsbehavior for string tokens.python_literalmode only if justified (higher complexity/risk).parse_value,parse_object) if needed.Suggested Implementation Areas
src/harmonization_framework/primitives/parse_array.pysrc/harmonization_framework/primitives/vocabulary.pysrc/harmonization_framework/primitives/__init__.pysrc/harmonization_framework/harmonization_rule.py(from_serializationdispatch)Test Plan
Add tests for:
item_type.default).Open Questions
parse_arrayaccept tuples in output, or always normalize to list?item_type=boolean, what token vocabulary should be accepted (true/false,1/0,yes/no)?print?