Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ below). Adoption is voluntary; conformance, once adopted, is not.
| OVOS-CONTEXT-1 | [Intent Context](intent-context.md) | 1 | [Draft — in review (PR #18)](https://github.com/OpenVoiceOS/architecture/pull/18) |
| OVOS-CONVERSE-1 | [Active Handlers and Interactive Response](converse.md) | 1 | [Draft — in review (PR #25)](https://github.com/OpenVoiceOS/architecture/pull/25) |
| OVOS-STOP-1 | [Stop Pipeline Plugin](ovos-stop-1.md) | 1 | [Draft — in review (PR #33)](https://github.com/OpenVoiceOS/architecture/pull/33) |
| OVOS-AUDIO-IN-1 | [Audio Input Service](ovos-audio-in-1.md) | 1 | Draft |

Each spec carries its own scope statement, design rationale, and
conformance section in its header. Open the document for the full
Expand Down
158 changes: 158 additions & 0 deletions ovos-audio-in-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Audio Input Service Specification

**Spec ID:** OVOS-AUDIO-IN-1 · **Version:** 1 · **Status:** Draft

This specification defines the **audio input service** — the component
that acquires audio, runs the pre-STT transformer chain, transcribes
to text, and injects the result into the utterance lifecycle. How
audio is acquired is deployer-defined and out of scope.

It builds on three companion specifications:

- the *Utterance Lifecycle and Pipeline Specification*
(OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1);
- the *Transformer Plugins Specification* (OVOS-TRANSFORM-1) — the
audio-transformer chain (§3.1) that runs before STT;
- the *Session Lifecycle and State Ownership Specification*
(OVOS-SESSION-2) — session assignment as the originator of
interactions.

The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**,
**MAY**, and **RECOMMENDED** are used as in RFC 2119.

---

## 1. Scope

This specification does **not** define:

- **audio capture** — acquisition mechanism is deployer-defined;
- **STT engine selection** — engine, model, or API is deployer-defined;
- **post-STT transformer chains** — utterance and all subsequent
transformer stages are owned by the utterance lifecycle
(OVOS-PIPELINE-1) and the audio output layer (OVOS-AUDIO-1);
- **session persistence and resumption** — owned by OVOS-SESSION-2;
this spec defines only which session the emission carries (§5.2).

---

## 2. The audio input role

The audio input service acquires audio by any deployer-defined
mechanism, runs the audio-transformer chain (§4), transcribes via a
STT mechanism (§3), and emits the result on `ovos.utterance.handle`
(§5). It is the **producer** of utterance lifecycle messages per
OVOS-PIPELINE-1 §9.

---

## 3. STT mechanism

The audio input service **MUST** have access to a speech-to-text
mechanism. The engine, model, API, or local process is
deployer-defined.

---

## 4. Audio-transformer chain

Before passing audio to STT, the audio input service **MUST** run the
audio-transformer chain (**OVOS-TRANSFORM-1 §3.1**), configured per
OVOS-TRANSFORM-1 §4.

Canonical use cases:

- **Language identification** — writes `session.detected_lang` for
§5.1 language resolution and STT engine selection.
- **Denoising and normalisation** — noise reduction, gain
normalisation, format conversion.
- **Voice-print recognition** — writes an intermediate result to
`Message.context` (e.g. `context.voice_match`) for consolidation
by a metadata transformer into `session.voice_id` per
OVOS-USER-ID-1 §4.1.

---

## 5. Utterance emission

After transcription the audio input service **MUST** emit:

`ovos.utterance.handle`

per **OVOS-PIPELINE-1 §9.1**.

| Field | Type | Required | Meaning |
|-------|------|----------|---------|
| `utterances` | array of string | yes | Transcription candidates; first element is primary. |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use array of strings for the utterances type.

array of string is ambiguous in a normative table. Please switch to array of strings to match the established payload wording and reduce implementer interpretation drift.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ovos-audio-in-1.md` at line 84, Change the type wording for the field named
"utterances" from "array of string" to "array of strings" in the specification
table so it matches established payload wording; locate the table row that
defines the utterances field (header shows `utterances | array of string | yes |
Transcription candidates; first element is primary.`) and update the type cell
to read `array of strings`.

| `lang` | string | yes | BCP-47 output language of the transcription. See §5.1. |

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

data.lang is required here but optional-by-authority in PIPELINE-1.

This creates a contract mismatch: OVOS-PIPELINE-1 §9.1 allows lang only when the producer authoritatively knows content language, but this spec currently makes it unconditional (yes / MUST emit data.lang). Please align this to avoid forcing synthesized or guessed language tags.

Also applies to: 128-129

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ovos-audio-in-1.md` at line 85, The table entry that currently marks
`data.lang` as required (`yes` / MUST) is incorrect; update the `lang` field to
reflect the PIPELINE-1 §9.1 rule (optional-by-authority) so producers only emit
`data.lang` when they authoritatively know the content language. Change the
`lang` requirement from unconditional "yes / MUST emit" to an
authority-conditional form (e.g., "conditional / when authoritative" or
"optional-by-authority") in the `data.lang` row and make the same adjustment for
the other occurrences referenced (lines 128-129) so the spec consistently defers
to PIPELINE-1 §9.1. Ensure the text mentions `data.lang` and cites PIPELINE-1
§9.1 for clarity.


### 5.1 Language resolution

Select the STT input language in this order:

1. `session.detected_lang` (**OVOS-SESSION-1 §3.2.6**) — audio
transformer's language classification.
2. `session.request_lang` (**OVOS-SESSION-1 §3.2.5**) — hint from
the capture mechanism (e.g. wake word, UI language selector).
3. `session.lang` (**OVOS-SESSION-1 §3.2.1**) — session's general
language preference.

First present and non-empty value wins. If none is present use a
deployment-configured default.

The service SHOULD write the selected language to `session.stt_lang`
(**OVOS-SESSION-1 §3.2.4**) before STT invocation. `stt_lang`
records the model's assumed input language and normally matches
`data.lang`; they diverge in speech-translation models where the
audio and transcript languages differ.

### 5.2 Session assignment

The audio input service **MUST** assign a session to every emission,
placed in `context.session` (**OVOS-MSG-1 §4**).

- **Local device** — SHOULD use `session_id: "default"`
(**OVOS-SESSION-2 §5**).
- **Satellite** — session is assigned by the bridge at the hub
boundary (**OVOS-BRIDGE-1 §4.2.1**); the bridge relays or
NAT-translates the `session_id` as needed.

---

## 6. Conformance

### An audio input service **MUST**:

- have access to a STT mechanism (§3);
- run the audio-transformer chain (OVOS-TRANSFORM-1 §3.1) before
STT (§4);
- assign a session in `context.session` per §5.2;
- emit `ovos.utterance.handle` with `data.utterances` and `data.lang`
(§5).

### An audio input service **SHOULD**:

- use `session_id: "default"` when co-located with the orchestrator
(§5.2);
- write `session.stt_lang` before STT invocation (§5.1).

### An audio input service **MAY**:

- emit multiple candidate transcriptions in `data.utterances`.

---

## See also

- **OVOS-PIPELINE-1** — utterance lifecycle entry point (§9.1);
post-STT transformer chains are owned here.
- **OVOS-AUDIO-1** — audio output service; owns dialog and TTS
transformer chains.
- **OVOS-TRANSFORM-1** — audio-transformer chain (§3.1).
- **OVOS-SESSION-1** — `session.lang`, `session.stt_lang`,
`session.detected_lang`, `session.request_lang`.
- **OVOS-SESSION-2** — session assignment and default-session rule.
- **OVOS-MSG-1** — session carrier (§4) and envelope.
- **OVOS-BRIDGE-1** — satellite session assignment (§4.2.1).
- **OVOS-USER-ID-1** — user identity resolution; voice-print
recognition is an audio-transformer use case (§4.1).
17 changes: 11 additions & 6 deletions ovos-session-1.md
Original file line number Diff line number Diff line change
Expand Up @@ -363,12 +363,17 @@ voices the text correctly.

#### 3.2.4 `stt_lang`

`stt_lang` — string — the BCP-47 tag the speech-to-text stage
**actually transcribed in**. It records the language the audio was
decoded as, regardless of what was requested or expected. It is
typically populated by the component that produced the transcript;
once set, it travels with the session until overwritten by a later
stage that re-transcribes.
`stt_lang` — string — the BCP-47 tag the speech-to-text stage was
**configured to assume** for the audio (the model's input language).
It is written by the audio input service before or at the point of
STT invocation. In a straightforward transcription, `stt_lang`
matches `data.lang` (the transcript's output language). In a
speech-translation model, they diverge: `stt_lang` is the audio's
spoken language; `data.lang` is the language the transcript was
produced in. Downstream stages that need the audio's source language
read `stt_lang`; stages that need the transcript's language read
`data.lang` or `session.lang`. Once set, `stt_lang` travels with
Comment on lines +374 to +375

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not recommend session.lang as transcript-language fallback.

The clause “stages that need the transcript’s language read data.lang or session.lang” conflicts with utterance-layer semantics where payload language must come from data.lang (and must not be synthesized from session preference). Please remove or session.lang here to keep contracts consistent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ovos-session-1.md` around lines 374 - 375, The text incorrectly advises
stages to read transcript language from "data.lang or session.lang"; remove "or
session.lang" so stages only use data.lang (i.e., keep "read `stt_lang`; stages
that need the transcript's language read `data.lang`"). Ensure any mention of
`session.lang` as a fallback for `stt_lang` is deleted and add a clarifying note
that `stt_lang` must come from the utterance payload (`data.lang`) and not be
synthesized from `session.lang`.

the session until overwritten by a later transcription stage.

#### 3.2.5 `request_lang`

Expand Down