Skip to content

Add parse_array primitive for rule-level array deserialization #97

Description

@matthewhorridge

Summary

Add a new parse_array primitive that converts serialized array values (primarily CSV strings) into typed Python lists before downstream operations like reduce.

Motivation

Reduce currently expects a list/tuple input. In file-based workflows (especially CSV), array-like values arrive as strings (e.g., "[8,8,8,8,6]").

Today, users need custom pre-processing before harmonization. This is workable but not composable in rules and makes UI-driven pipelines harder to express. A dedicated parsing primitive keeps concerns explicit and reusable.

Problem Statement

  • Current reduce behavior is intentionally strict (list/tuple only).
  • CSV and similar formats encode arrays as strings.
  • Users need a rule-level way to parse arrays without changing reduce semantics.

Proposal

Introduce a new primitive operation:

  • operation: parse_array
  • responsibility: parse scalar/list-like input into list[Any]
  • intended chaining: parse_array -> reduce

Example chain:

{
  "source": "week_hours",
  "target": "total_hours",
  "operations": [
    {
      "operation": "parse_array",
      "format": "json",
      "item_type": "integer",
      "strict": true
    },
    {
      "operation": "reduce",
      "reduction": "sum"
    }
  ]
}

Design Goals

  1. Keep primitive responsibilities explicit (no implicit parsing inside reduce).
  2. Make behavior deterministic and easy to reason about.
  3. Provide strict-by-default error handling for data quality.
  4. Support safe incremental extension for future parsing formats.

Non-Goals (V1)

  • General object parsing.
  • Deep schema validation for nested arrays.
  • Silent “best effort” coercion by default.

V1 API / Serialization

Suggested serialized config:

{
  "operation": "parse_array",
  "format": "json",
  "item_type": "auto",
  "strict": true
}

Optional keys:

  • default: value returned when strict=false and parse fails (default: null)
  • allow_singleton: when true, scalar input can be wrapped as one-item list after coercion

Field semantics:

  • format: parsing strategy (V1: json only)
  • item_type: auto | string | integer | float | boolean
  • strict: fail-fast behavior

Input/Output Contract

Input handling:

  • list/tuple: return list (idempotent)
  • str: parse according to format
  • other scalar types:
    • strict mode: error
    • non-strict mode: return default

Output:

  • Always list (or default in non-strict failure path)

Error Handling

  • strict=true: raise clear ValueError including input value and selected format.
  • strict=false: return default and emit warning/log message.

Why Not Put This in Reduce

Embedding string parsing in reduce would:

  • conflate reduction and deserialization concerns,
  • increase ambiguity (string syntax variants, malformed inputs),
  • create inconsistent behavior relative to other primitives.

A dedicated parser primitive preserves composability and predictability.

Extension Path

After V1 stabilizes, consider:

  1. format: delimiter for values like "8|8|8|8|6" with configurable delimiter.
  2. Optional strip_items behavior for string tokens.
  3. Optional python_literal mode only if justified (higher complexity/risk).
  4. Potential generic parsing family later (parse_value, parse_object) if needed.

Suggested Implementation Areas

  • New file: src/harmonization_framework/primitives/parse_array.py
  • Register operation:
    • src/harmonization_framework/primitives/vocabulary.py
    • src/harmonization_framework/primitives/__init__.py
    • src/harmonization_framework/harmonization_rule.py (from_serialization dispatch)

Test Plan

Add tests for:

  1. JSON list parsing success.
  2. Idempotent pass-through for list/tuple input.
  3. Item coercion success/failure for each item_type.
  4. Strict failure behavior.
  5. Non-strict fallback behavior (default).
  6. Invalid JSON / non-array JSON payloads.

Open Questions

  1. Should parse_array accept tuples in output, or always normalize to list?
  2. For item_type=boolean, what token vocabulary should be accepted (true/false, 1/0, yes/no)?
  3. Should warning output route through existing logging infrastructure rather than print?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions