Context
The HybridGym benchmarks (hybridgym_funclocalize, hybridgym_funcgen, hybridgym_depsearch, and hybridgym_issuelocalize) do not appear to have corresponding Harbor registry datasets/adapters yet. These should be ported as a benchmark family while keeping per-task scoring behavior intact.
Proposed scope
- Create Harbor adapters/tasks for each HybridGym variant.
- Preserve prompts, environment setup, and evaluator semantics for each variant.
- Validate parity separately for function localization, function generation, dependency search, and issue localization.
- Document any shared adapter code and per-variant differences.
Acceptance criteria
- All four HybridGym variants can be run through Harbor.
- Per-variant parity results are recorded against the current OpenHands benchmark workflows.
- OpenHands wrappers can delegate HybridGym execution to Harbor.
Context
The HybridGym benchmarks (
hybridgym_funclocalize,hybridgym_funcgen,hybridgym_depsearch, andhybridgym_issuelocalize) do not appear to have corresponding Harbor registry datasets/adapters yet. These should be ported as a benchmark family while keeping per-task scoring behavior intact.Proposed scope
Acceptance criteria