From 45f6598090da3896879314815e9b1b20626c9df5 Mon Sep 17 00:00:00 2001 From: JarbasAi Date: Thu, 28 May 2026 09:42:23 +0100 Subject: [PATCH 1/7] AUDIO-IN-1 v1: audio input service specification MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Minimal spec with three normative obligations: 1. A STT mechanism MUST exist (deployer-defined — engine, API, model are all out of scope) 2. Audio-transformer chain (TRANSFORM-1 §3.1) MUST run before STT 3. MUST emit ovos.utterance.handle with data.utterances and data.lang Everything else — audio capture method (mic, file, remote, wake word, VAD), STT engine selection, post-STT transformer chains — is deployer concern and explicitly out of scope. Language resolved from session.detected_lang → session.stt_lang → session.lang in that order. Co-Authored-By: Claude Sonnet 4.6 --- README.md | 1 + ovos-audio-in-1.md | 146 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 147 insertions(+) create mode 100644 ovos-audio-in-1.md diff --git a/README.md b/README.md index a63842a..3898b64 100644 --- a/README.md +++ b/README.md @@ -113,6 +113,7 @@ below). Adoption is voluntary; conformance, once adopted, is not. | OVOS-CONTEXT-1 | [Intent Context](intent-context.md) | 1 | [Draft — in review (PR #18)](https://github.com/OpenVoiceOS/architecture/pull/18) | | OVOS-CONVERSE-1 | [Active Handlers and Interactive Response](converse.md) | 1 | [Draft — in review (PR #25)](https://github.com/OpenVoiceOS/architecture/pull/25) | | OVOS-STOP-1 | [Stop Pipeline Plugin](ovos-stop-1.md) | 1 | [Draft — in review (PR #33)](https://github.com/OpenVoiceOS/architecture/pull/33) | +| OVOS-AUDIO-IN-1 | [Audio Input Service](ovos-audio-in-1.md) | 1 | Draft | Each spec carries its own scope statement, design rationale, and conformance section in its header. Open the document for the full diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md new file mode 100644 index 0000000..1ba09fc --- /dev/null +++ b/ovos-audio-in-1.md @@ -0,0 +1,146 @@ +# Audio Input Service Specification + +**Spec ID:** OVOS-AUDIO-IN-1 · **Version:** 1 · **Status:** Draft + +This specification defines the **audio input service** — the component +that acquires audio, processes it through the pre-STT transformer +chain, transcribes it to text, and injects the result into the +utterance lifecycle. + +How audio is acquired — microphone capture, file playback, remote +streaming, wake-word gating, voice-activity detection, push-to-talk, +or any other mechanism — is deployer-defined and out of scope. + +It builds on two companion specifications: + +- the *Utterance Lifecycle and Pipeline Specification* + (OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1); +- the *Transformer Plugins Specification* (OVOS-TRANSFORM-1) — the + audio-transformer chain (§3.1) that runs before STT. + +The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, +**MAY**, and **RECOMMENDED** are used as in RFC 2119. + +--- + +## 1. Scope + +This specification defines: + +- **the audio input role** (§2) — what the service produces; +- **the STT obligation** (§3) — that a transcription mechanism exists; +- **the audio-transformer obligation** (§4) — running the pre-STT + transformer chain; +- **the utterance emission** (§5) — topic, payload shape, and language + resolution. + +It does **not** define: + +- **audio capture** — microphone access, file reading, remote streaming, + wake-word detection, VAD, push-to-talk, or any other acquisition + mechanism; +- **STT engine selection** — which engine is used or how it is + configured; +- **post-STT processing** — utterance transformers + (OVOS-TRANSFORM-1 §3.2) and metadata transformers (§3.3) are + deployer concerns; the service MAY run them before emission; +- **session lifecycle** — how sessions are created or identified. + +--- + +## 2. The audio input role + +The audio input service acquires audio by any deployer-defined +mechanism, processes it through the audio-transformer chain (§4), +transcribes it via a STT mechanism (§3), and emits the result on +`ovos.utterance.handle` (§5). + +It is the **producer** of utterance lifecycle messages and the first +component in the utterance lifecycle per OVOS-PIPELINE-1 §9. + +--- + +## 3. STT mechanism + +The audio input service **MUST** have access to a speech-to-text +mechanism that converts processed audio into one or more candidate +transcription strings. The specific engine, model, API, or local +process is deployer-defined; this specification places no constraint +on it beyond the requirement that it exists and produces text. + +--- + +## 4. Audio-transformer chain + +Before passing audio to the STT mechanism, the audio input service +**MUST** run the audio-transformer chain (**OVOS-TRANSFORM-1 §3.1**). +The chain is ordered and configured per OVOS-TRANSFORM-1 §4; the +`context.session` is passed to each transformer. + +Audio transformers MAY perform noise reduction, format normalisation, +acoustic language detection (writing `session.detected_lang`), or any +other audio-domain processing. A deployment with no audio transformers +configured passes audio to STT unchanged. + +--- + +## 5. Utterance emission + +After transcription the audio input service **MUST** emit: + +`ovos.utterance.handle` + +per **OVOS-PIPELINE-1 §9.1**, with `context.session` populated per +**OVOS-MSG-1 §4**. + +Payload: + +| Field | Type | Required | Meaning | +|-------|------|----------|---------| +| `utterances` | array of string | yes | One or more candidate transcription strings. The first element is the primary candidate. | +| `lang` | string | yes | The BCP-47 language tag for the transcription. See §5.1. | + +### 5.1 Language resolution + +`data.lang` MUST be set to the language the STT mechanism transcribed +in. The service resolves the language in this order: + +1. `session.detected_lang` — if an audio transformer has detected the + spoken language and written it to this field, use it. +2. `session.stt_lang` — the session's explicit STT language preference, + if set. +3. `session.lang` — the session's general language preference. + +The first present and non-empty value wins. If none is present the +service SHOULD use a deployment-configured default language. + +--- + +## 6. Conformance + +### An audio input service **MUST**: + +- have access to a STT mechanism (§3); +- run the audio-transformer chain (OVOS-TRANSFORM-1 §3.1) before + passing audio to STT (§4); +- emit `ovos.utterance.handle` with `data.utterances` (array of + strings) and `data.lang` (BCP-47 tag) after transcription (§5); +- populate `context.session` per OVOS-MSG-1 §4. + +### An audio input service **MAY**: + +- acquire audio by any mechanism (§2); +- run the utterance-transformer chain (OVOS-TRANSFORM-1 §3.2) on the + transcription before emission; +- emit multiple candidate transcriptions in `data.utterances`. + +--- + +## See also + +- **OVOS-PIPELINE-1** — utterance lifecycle entry point (§9.1). +- **OVOS-TRANSFORM-1** — audio-transformer chain (§3.1). +- **OVOS-SESSION-1** — `session.lang`, `session.stt_lang`, + `session.detected_lang`. +- **OVOS-MSG-1** — session carrier (§4) and envelope. +- **OVOS-AUDIO-1** — the audio output service. From 26f5e3db463aca1ed05d89e7a52d90950fea6317 Mon Sep 17 00:00:00 2001 From: JarbasAi Date: Thu, 28 May 2026 09:44:57 +0100 Subject: [PATCH 2/7] =?UTF-8?q?AUDIO-IN-1=20=C2=A74:=20add=20canonical=20a?= =?UTF-8?q?udio=20transformer=20use=20cases?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Language identification: detected lang written to session.detected_lang - Denoising/normalisation: noise reduction, format conversion - Speaker recognition: speaker_id written to Message.context for downstream personalisation without audio service knowing semantics Co-Authored-By: Claude Sonnet 4.6 --- ovos-audio-in-1.md | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md index 1ba09fc..7c0304f 100644 --- a/ovos-audio-in-1.md +++ b/ovos-audio-in-1.md @@ -77,10 +77,20 @@ Before passing audio to the STT mechanism, the audio input service The chain is ordered and configured per OVOS-TRANSFORM-1 §4; the `context.session` is passed to each transformer. -Audio transformers MAY perform noise reduction, format normalisation, -acoustic language detection (writing `session.detected_lang`), or any -other audio-domain processing. A deployment with no audio transformers -configured passes audio to STT unchanged. +Canonical audio transformer use cases include: + +- **Language identification** — detecting the spoken language from + the audio signal and writing it to `session.detected_lang`, so + that §5.1 language resolution and the STT engine can use it. +- **Denoising and normalisation** — background noise reduction, gain + normalisation, sample-rate or format conversion before STT. +- **Speaker recognition** — identifying the speaker from the audio + and writing the result into `Message.context` (e.g. a `speaker_id` + key) so that downstream pipeline stages and skills can personalise + responses without the audio input service knowing their semantics. + +A deployment with no audio transformers configured passes audio to +STT unchanged. --- From 93a974735e85198b7a32cd58624190b960748c77 Mon Sep 17 00:00:00 2001 From: JarbasAi Date: Thu, 28 May 2026 09:47:24 +0100 Subject: [PATCH 3/7] =?UTF-8?q?AUDIO-IN-1=20=C2=A75.1:=20fix=20stt=5Flang?= =?UTF-8?q?=20=E2=80=94=20it's=20a=20write,=20not=20a=20read?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit session.stt_lang records the language STT actually decoded in (SESSION-1 §3.2.4). It is a result written by the audio input service after transcription, not an input to language selection. Corrected language resolution order (inputs to STT selection): 1. session.detected_lang (audio transformer detection) 2. session.request_lang (capture mechanism hint, e.g. wake word) 3. session.lang (general session preference) Added SHOULD obligation to write session.stt_lang after transcription. Co-Authored-By: Claude Sonnet 4.6 --- ovos-audio-in-1.md | 30 +++++++++++++++++++++++------- 1 file changed, 23 insertions(+), 7 deletions(-) diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md index 7c0304f..42264c7 100644 --- a/ovos-audio-in-1.md +++ b/ovos-audio-in-1.md @@ -113,17 +113,28 @@ Payload: ### 5.1 Language resolution `data.lang` MUST be set to the language the STT mechanism transcribed -in. The service resolves the language in this order: - -1. `session.detected_lang` — if an audio transformer has detected the - spoken language and written it to this field, use it. -2. `session.stt_lang` — the session's explicit STT language preference, - if set. -3. `session.lang` — the session's general language preference. +in. The service selects the STT language from these inputs in order: + +1. `session.detected_lang` — the language a language-detection audio + transformer classified the audio as (**OVOS-SESSION-1 §3.2.6**). + Most specific signal; use it when present. +2. `session.request_lang` — a hint from the capture mechanism about + the expected language (e.g. the wake word that triggered capture, + or a UI language selector) (**OVOS-SESSION-1 §3.2.5**). A prior, + not a guarantee. +3. `session.lang` — the session's general language preference + (**OVOS-SESSION-1 §3.2.1**). The first present and non-empty value wins. If none is present the service SHOULD use a deployment-configured default language. +After transcription the service SHOULD write the language actually +used to `session.stt_lang` (**OVOS-SESSION-1 §3.2.4**) so that +downstream stages (intent matching, dialog transformers) know what +language the audio was decoded in. `stt_lang` is a result field +written by the audio input service; it is not an input to language +selection. + --- ## 6. Conformance @@ -137,6 +148,11 @@ service SHOULD use a deployment-configured default language. strings) and `data.lang` (BCP-47 tag) after transcription (§5); - populate `context.session` per OVOS-MSG-1 §4. +### An audio input service **SHOULD**: + +- write `session.stt_lang` to the language STT decoded in, after + transcription (§5.1). + ### An audio input service **MAY**: - acquire audio by any mechanism (§2); From 9dd08515b310ceb67662d411e9ca424ff6a4d090 Mon Sep 17 00:00:00 2001 From: JarbasAi Date: Thu, 28 May 2026 09:48:22 +0100 Subject: [PATCH 4/7] =?UTF-8?q?AUDIO-IN-1=20=C2=A75.1=20+=20SESSION-1=20?= =?UTF-8?q?=C2=A73.2.4:=20clarify=20stt=5Flang=20semantics?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit stt_lang is the language the STT model was configured to assume for the audio (model input language), written before/at STT invocation. In normal transcription stt_lang == data.lang; in speech-translation they diverge — stt_lang is the audio's spoken language, data.lang is the transcript's output language. SESSION-1 §3.2.4 updated to match: "actually transcribed in" was ambiguous in the translation case. Co-Authored-By: Claude Sonnet 4.6 --- ovos-audio-in-1.md | 15 +++++++++------ ovos-session-1.md | 17 +++++++++++------ 2 files changed, 20 insertions(+), 12 deletions(-) diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md index 42264c7..fa001f5 100644 --- a/ovos-audio-in-1.md +++ b/ovos-audio-in-1.md @@ -128,12 +128,15 @@ in. The service selects the STT language from these inputs in order: The first present and non-empty value wins. If none is present the service SHOULD use a deployment-configured default language. -After transcription the service SHOULD write the language actually -used to `session.stt_lang` (**OVOS-SESSION-1 §3.2.4**) so that -downstream stages (intent matching, dialog transformers) know what -language the audio was decoded in. `stt_lang` is a result field -written by the audio input service; it is not an input to language -selection. +The service SHOULD write the selected input language to +`session.stt_lang` (**OVOS-SESSION-1 §3.2.4**) before or at the +point of STT invocation. `stt_lang` records the language the STT +model was **configured to assume** for the audio, which normally +matches `data.lang` but may differ when the STT model performs +speech translation — in that case `stt_lang` is the audio's +language and `data.lang` is the transcription's output language. +Downstream stages that need to know the audio's source language +(rather than the transcript's language) read `session.stt_lang`. --- diff --git a/ovos-session-1.md b/ovos-session-1.md index 552382f..54537ae 100644 --- a/ovos-session-1.md +++ b/ovos-session-1.md @@ -363,12 +363,17 @@ voices the text correctly. #### 3.2.4 `stt_lang` -`stt_lang` — string — the BCP-47 tag the speech-to-text stage -**actually transcribed in**. It records the language the audio was -decoded as, regardless of what was requested or expected. It is -typically populated by the component that produced the transcript; -once set, it travels with the session until overwritten by a later -stage that re-transcribes. +`stt_lang` — string — the BCP-47 tag the speech-to-text stage was +**configured to assume** for the audio (the model's input language). +It is written by the audio input service before or at the point of +STT invocation. In a straightforward transcription, `stt_lang` +matches `data.lang` (the transcript's output language). In a +speech-translation model, they diverge: `stt_lang` is the audio's +spoken language; `data.lang` is the language the transcript was +produced in. Downstream stages that need the audio's source language +read `stt_lang`; stages that need the transcript's language read +`data.lang` or `session.lang`. Once set, `stt_lang` travels with +the session until overwritten by a later transcription stage. #### 3.2.5 `request_lang` From 9ab7c080582546a7d8764f5e16d5dd4c9722c855 Mon Sep 17 00:00:00 2001 From: JarbasAi Date: Thu, 28 May 2026 10:03:38 +0100 Subject: [PATCH 5/7] AUDIO-IN-1: clarify post-STT ownership, session assignment, SESSION-2 ref MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - §1 non-goals: replace confusing "post-STT transformers are deployer concern" with "owned by utterance lifecycle (PIPELINE-1), run after emission". Session lifecycle non-goal now cross-references SESSION-2 and points to §5.2. - §5.2 (new): session assignment. Audio input is the originator of interactions and MUST assign a session. Local device SHOULD use session_id "default" (SESSION-2 §5); satellite session is assigned by the bridge at the hub boundary (BRIDGE-1 §4.2.1). Session MUST be in context.session, not data. - §6 conformance: add session assignment MUST and SHOULD; remove utterance-transformer MAY (it belongs to the utterance lifecycle) - See also: add SESSION-2 and BRIDGE-1; note PIPELINE-1 owns post-STT Co-Authored-By: Claude Sonnet 4.6 --- ovos-audio-in-1.md | 68 +++++++++++++++++++++++++++++++++++----------- 1 file changed, 52 insertions(+), 16 deletions(-) diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md index fa001f5..1e20d2a 100644 --- a/ovos-audio-in-1.md +++ b/ovos-audio-in-1.md @@ -11,12 +11,16 @@ How audio is acquired — microphone capture, file playback, remote streaming, wake-word gating, voice-activity detection, push-to-talk, or any other mechanism — is deployer-defined and out of scope. -It builds on two companion specifications: +It builds on three companion specifications: - the *Utterance Lifecycle and Pipeline Specification* - (OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1); + (OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1) + and the utterance lifecycle the emission triggers; - the *Transformer Plugins Specification* (OVOS-TRANSFORM-1) — the - audio-transformer chain (§3.1) that runs before STT. + audio-transformer chain (§3.1) that runs before STT; +- the *Session Lifecycle and State Ownership Specification* + (OVOS-SESSION-2) — the session assignment and state-ownership + rules this service must follow as the originator of interactions. The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, **MAY**, and **RECOMMENDED** are used as in RFC 2119. @@ -41,10 +45,12 @@ It does **not** define: mechanism; - **STT engine selection** — which engine is used or how it is configured; -- **post-STT processing** — utterance transformers - (OVOS-TRANSFORM-1 §3.2) and metadata transformers (§3.3) are - deployer concerns; the service MAY run them before emission; -- **session lifecycle** — how sessions are created or identified. +- **post-STT transformer chains** — utterance transformers and + all subsequent transformer stages are owned by the utterance + lifecycle (OVOS-PIPELINE-1) and run after the emission; +- **session persistence and resumption** — owned by + OVOS-SESSION-2; this spec defines only which session the + emission carries (§5.2). --- @@ -138,6 +144,30 @@ language and `data.lang` is the transcription's output language. Downstream stages that need to know the audio's source language (rather than the transcript's language) read `session.stt_lang`. +### 5.2 Session assignment + +The audio input service is the **originator** of the interaction — +it creates the `ovos.utterance.handle` Message that starts the +utterance lifecycle. It **MUST** assign a session to the Message +per **OVOS-SESSION-2** before emission. + +The appropriate session depends on the deployment: + +- **Local device** — the service SHOULD use `session_id: "default"`, + the orchestrator-owned default session + (**OVOS-SESSION-2 §5**). This is the normal case when the audio + input service and the orchestrator run on the same device. +- **Satellite** — when the audio input service runs on a satellite + that communicates with a hub via a bridge + (**OVOS-BRIDGE-1 §4.2.1**), the session is assigned by the bridge + at the hub boundary. The satellite emits `ovos.utterance.handle` + with its own session; the bridge relays it to the hub with the + appropriate `session_id` (its own, or NAT-translated per + **OVOS-BRIDGE-1 §3.2**). + +The session MUST be placed in `context.session` per +**OVOS-MSG-1 §4**, not in `data`. + --- ## 6. Conformance @@ -147,29 +177,35 @@ Downstream stages that need to know the audio's source language - have access to a STT mechanism (§3); - run the audio-transformer chain (OVOS-TRANSFORM-1 §3.1) before passing audio to STT (§4); +- assign a session to every emission per §5.2, placing it in + `context.session` (OVOS-MSG-1 §4); - emit `ovos.utterance.handle` with `data.utterances` (array of - strings) and `data.lang` (BCP-47 tag) after transcription (§5); -- populate `context.session` per OVOS-MSG-1 §4. + strings) and `data.lang` (BCP-47 tag) after transcription (§5). ### An audio input service **SHOULD**: -- write `session.stt_lang` to the language STT decoded in, after - transcription (§5.1). +- use `session_id: "default"` when running on the same device as + the orchestrator (§5.2); +- write `session.stt_lang` before or at the point of STT invocation + (§5.1). ### An audio input service **MAY**: - acquire audio by any mechanism (§2); -- run the utterance-transformer chain (OVOS-TRANSFORM-1 §3.2) on the - transcription before emission; - emit multiple candidate transcriptions in `data.utterances`. --- ## See also -- **OVOS-PIPELINE-1** — utterance lifecycle entry point (§9.1). +- **OVOS-PIPELINE-1** — utterance lifecycle entry point (§9.1); + post-STT transformer chains are owned here. - **OVOS-TRANSFORM-1** — audio-transformer chain (§3.1). -- **OVOS-SESSION-1** — `session.lang`, `session.stt_lang`, - `session.detected_lang`. +- **OVOS-SESSION-1** — session field registry; `session.lang`, + `session.stt_lang`, `session.detected_lang`, `session.request_lang`. +- **OVOS-SESSION-2** — session assignment, state ownership, and the + default-session rule (§5). - **OVOS-MSG-1** — session carrier (§4) and envelope. +- **OVOS-BRIDGE-1** — satellite deployment and session assignment at + the bridge boundary (§4.2.1). - **OVOS-AUDIO-1** — the audio output service. From d074ae2f09748aa857834c4fdf0e27d3fca63451 Mon Sep 17 00:00:00 2001 From: JarbasAi Date: Thu, 28 May 2026 10:07:15 +0100 Subject: [PATCH 6/7] AUDIO-IN-1: simplification pass (-57 lines) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Preamble: drop repeated capture-method list (already in §1) - §1: drop "defines" list (restated section headings); keep non-goals only; add AUDIO-1 to post-STT non-goal (dialog/TTS chains live there) - §2: merge two-sentence role description into one - §3: drop obvious "no constraint beyond..." clause - §4: trim use-case bullet tails to one line each; drop "no transformers → unchanged" (obvious) - §5: drop redundant MSG-1 §4 reference (covered by §5.2) - §5.1: drop "most specific signal" and "prior not guarantee" padding - §5.2: drop "this is the normal case" sentence; drop final "MUST be in context.session not data" (in §6 MUST) - §6 MAY: remove "acquire audio by any mechanism" (a non-goal, not a MAY) - See also: AUDIO-1 added; entries tightened Co-Authored-By: Claude Sonnet 4.6 --- ovos-audio-in-1.md | 197 ++++++++++++++++----------------------------- 1 file changed, 70 insertions(+), 127 deletions(-) diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md index 1e20d2a..5f9a87f 100644 --- a/ovos-audio-in-1.md +++ b/ovos-audio-in-1.md @@ -3,24 +3,19 @@ **Spec ID:** OVOS-AUDIO-IN-1 · **Version:** 1 · **Status:** Draft This specification defines the **audio input service** — the component -that acquires audio, processes it through the pre-STT transformer -chain, transcribes it to text, and injects the result into the -utterance lifecycle. - -How audio is acquired — microphone capture, file playback, remote -streaming, wake-word gating, voice-activity detection, push-to-talk, -or any other mechanism — is deployer-defined and out of scope. +that acquires audio, runs the pre-STT transformer chain, transcribes +to text, and injects the result into the utterance lifecycle. How +audio is acquired is deployer-defined and out of scope. It builds on three companion specifications: - the *Utterance Lifecycle and Pipeline Specification* - (OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1) - and the utterance lifecycle the emission triggers; + (OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1); - the *Transformer Plugins Specification* (OVOS-TRANSFORM-1) — the audio-transformer chain (§3.1) that runs before STT; - the *Session Lifecycle and State Ownership Specification* - (OVOS-SESSION-2) — the session assignment and state-ownership - rules this service must follow as the originator of interactions. + (OVOS-SESSION-2) — session assignment as the originator of + interactions. The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, **MAY**, and **RECOMMENDED** are used as in RFC 2119. @@ -29,74 +24,50 @@ The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**, ## 1. Scope -This specification defines: - -- **the audio input role** (§2) — what the service produces; -- **the STT obligation** (§3) — that a transcription mechanism exists; -- **the audio-transformer obligation** (§4) — running the pre-STT - transformer chain; -- **the utterance emission** (§5) — topic, payload shape, and language - resolution. - -It does **not** define: - -- **audio capture** — microphone access, file reading, remote streaming, - wake-word detection, VAD, push-to-talk, or any other acquisition - mechanism; -- **STT engine selection** — which engine is used or how it is - configured; -- **post-STT transformer chains** — utterance transformers and - all subsequent transformer stages are owned by the utterance - lifecycle (OVOS-PIPELINE-1) and run after the emission; -- **session persistence and resumption** — owned by - OVOS-SESSION-2; this spec defines only which session the - emission carries (§5.2). +This specification does **not** define: + +- **audio capture** — acquisition mechanism is deployer-defined; +- **STT engine selection** — engine, model, or API is deployer-defined; +- **post-STT transformer chains** — utterance and all subsequent + transformer stages are owned by the utterance lifecycle + (OVOS-PIPELINE-1) and the audio output layer (OVOS-AUDIO-1); +- **session persistence and resumption** — owned by OVOS-SESSION-2; + this spec defines only which session the emission carries (§5.2). --- ## 2. The audio input role The audio input service acquires audio by any deployer-defined -mechanism, processes it through the audio-transformer chain (§4), -transcribes it via a STT mechanism (§3), and emits the result on -`ovos.utterance.handle` (§5). - -It is the **producer** of utterance lifecycle messages and the first -component in the utterance lifecycle per OVOS-PIPELINE-1 §9. +mechanism, runs the audio-transformer chain (§4), transcribes via a +STT mechanism (§3), and emits the result on `ovos.utterance.handle` +(§5). It is the **producer** of utterance lifecycle messages per +OVOS-PIPELINE-1 §9. --- ## 3. STT mechanism The audio input service **MUST** have access to a speech-to-text -mechanism that converts processed audio into one or more candidate -transcription strings. The specific engine, model, API, or local -process is deployer-defined; this specification places no constraint -on it beyond the requirement that it exists and produces text. +mechanism. The engine, model, API, or local process is +deployer-defined. --- ## 4. Audio-transformer chain -Before passing audio to the STT mechanism, the audio input service -**MUST** run the audio-transformer chain (**OVOS-TRANSFORM-1 §3.1**). -The chain is ordered and configured per OVOS-TRANSFORM-1 §4; the -`context.session` is passed to each transformer. - -Canonical audio transformer use cases include: +Before passing audio to STT, the audio input service **MUST** run the +audio-transformer chain (**OVOS-TRANSFORM-1 §3.1**), configured per +OVOS-TRANSFORM-1 §4. -- **Language identification** — detecting the spoken language from - the audio signal and writing it to `session.detected_lang`, so - that §5.1 language resolution and the STT engine can use it. -- **Denoising and normalisation** — background noise reduction, gain - normalisation, sample-rate or format conversion before STT. -- **Speaker recognition** — identifying the speaker from the audio - and writing the result into `Message.context` (e.g. a `speaker_id` - key) so that downstream pipeline stages and skills can personalise - responses without the audio input service knowing their semantics. +Canonical use cases: -A deployment with no audio transformers configured passes audio to -STT unchanged. +- **Language identification** — writes `session.detected_lang` for + §5.1 language resolution and STT engine selection. +- **Denoising and normalisation** — noise reduction, gain + normalisation, format conversion. +- **Speaker recognition** — writes a `speaker_id` (or equivalent) + into `Message.context` for downstream personalisation. --- @@ -106,67 +77,43 @@ After transcription the audio input service **MUST** emit: `ovos.utterance.handle` -per **OVOS-PIPELINE-1 §9.1**, with `context.session` populated per -**OVOS-MSG-1 §4**. - -Payload: +per **OVOS-PIPELINE-1 §9.1**. | Field | Type | Required | Meaning | |-------|------|----------|---------| -| `utterances` | array of string | yes | One or more candidate transcription strings. The first element is the primary candidate. | -| `lang` | string | yes | The BCP-47 language tag for the transcription. See §5.1. | +| `utterances` | array of string | yes | Transcription candidates; first element is primary. | +| `lang` | string | yes | BCP-47 output language of the transcription. See §5.1. | ### 5.1 Language resolution -`data.lang` MUST be set to the language the STT mechanism transcribed -in. The service selects the STT language from these inputs in order: - -1. `session.detected_lang` — the language a language-detection audio - transformer classified the audio as (**OVOS-SESSION-1 §3.2.6**). - Most specific signal; use it when present. -2. `session.request_lang` — a hint from the capture mechanism about - the expected language (e.g. the wake word that triggered capture, - or a UI language selector) (**OVOS-SESSION-1 §3.2.5**). A prior, - not a guarantee. -3. `session.lang` — the session's general language preference - (**OVOS-SESSION-1 §3.2.1**). - -The first present and non-empty value wins. If none is present the -service SHOULD use a deployment-configured default language. - -The service SHOULD write the selected input language to -`session.stt_lang` (**OVOS-SESSION-1 §3.2.4**) before or at the -point of STT invocation. `stt_lang` records the language the STT -model was **configured to assume** for the audio, which normally -matches `data.lang` but may differ when the STT model performs -speech translation — in that case `stt_lang` is the audio's -language and `data.lang` is the transcription's output language. -Downstream stages that need to know the audio's source language -(rather than the transcript's language) read `session.stt_lang`. +Select the STT input language in this order: -### 5.2 Session assignment +1. `session.detected_lang` (**OVOS-SESSION-1 §3.2.6**) — audio + transformer's language classification. +2. `session.request_lang` (**OVOS-SESSION-1 §3.2.5**) — hint from + the capture mechanism (e.g. wake word, UI language selector). +3. `session.lang` (**OVOS-SESSION-1 §3.2.1**) — session's general + language preference. -The audio input service is the **originator** of the interaction — -it creates the `ovos.utterance.handle` Message that starts the -utterance lifecycle. It **MUST** assign a session to the Message -per **OVOS-SESSION-2** before emission. +First present and non-empty value wins. If none is present use a +deployment-configured default. -The appropriate session depends on the deployment: +The service SHOULD write the selected language to `session.stt_lang` +(**OVOS-SESSION-1 §3.2.4**) before STT invocation. `stt_lang` +records the model's assumed input language and normally matches +`data.lang`; they diverge in speech-translation models where the +audio and transcript languages differ. + +### 5.2 Session assignment -- **Local device** — the service SHOULD use `session_id: "default"`, - the orchestrator-owned default session - (**OVOS-SESSION-2 §5**). This is the normal case when the audio - input service and the orchestrator run on the same device. -- **Satellite** — when the audio input service runs on a satellite - that communicates with a hub via a bridge - (**OVOS-BRIDGE-1 §4.2.1**), the session is assigned by the bridge - at the hub boundary. The satellite emits `ovos.utterance.handle` - with its own session; the bridge relays it to the hub with the - appropriate `session_id` (its own, or NAT-translated per - **OVOS-BRIDGE-1 §3.2**). +The audio input service **MUST** assign a session to every emission, +placed in `context.session` (**OVOS-MSG-1 §4**). -The session MUST be placed in `context.session` per -**OVOS-MSG-1 §4**, not in `data`. +- **Local device** — SHOULD use `session_id: "default"` + (**OVOS-SESSION-2 §5**). +- **Satellite** — session is assigned by the bridge at the hub + boundary (**OVOS-BRIDGE-1 §4.2.1**); the bridge relays or + NAT-translates the `session_id` as needed. --- @@ -176,22 +123,19 @@ The session MUST be placed in `context.session` per - have access to a STT mechanism (§3); - run the audio-transformer chain (OVOS-TRANSFORM-1 §3.1) before - passing audio to STT (§4); -- assign a session to every emission per §5.2, placing it in - `context.session` (OVOS-MSG-1 §4); -- emit `ovos.utterance.handle` with `data.utterances` (array of - strings) and `data.lang` (BCP-47 tag) after transcription (§5). + STT (§4); +- assign a session in `context.session` per §5.2; +- emit `ovos.utterance.handle` with `data.utterances` and `data.lang` + (§5). ### An audio input service **SHOULD**: -- use `session_id: "default"` when running on the same device as - the orchestrator (§5.2); -- write `session.stt_lang` before or at the point of STT invocation - (§5.1). +- use `session_id: "default"` when co-located with the orchestrator + (§5.2); +- write `session.stt_lang` before STT invocation (§5.1). ### An audio input service **MAY**: -- acquire audio by any mechanism (§2); - emit multiple candidate transcriptions in `data.utterances`. --- @@ -200,12 +144,11 @@ The session MUST be placed in `context.session` per - **OVOS-PIPELINE-1** — utterance lifecycle entry point (§9.1); post-STT transformer chains are owned here. +- **OVOS-AUDIO-1** — audio output service; owns dialog and TTS + transformer chains. - **OVOS-TRANSFORM-1** — audio-transformer chain (§3.1). -- **OVOS-SESSION-1** — session field registry; `session.lang`, - `session.stt_lang`, `session.detected_lang`, `session.request_lang`. -- **OVOS-SESSION-2** — session assignment, state ownership, and the - default-session rule (§5). +- **OVOS-SESSION-1** — `session.lang`, `session.stt_lang`, + `session.detected_lang`, `session.request_lang`. +- **OVOS-SESSION-2** — session assignment and default-session rule. - **OVOS-MSG-1** — session carrier (§4) and envelope. -- **OVOS-BRIDGE-1** — satellite deployment and session assignment at - the bridge boundary (§4.2.1). -- **OVOS-AUDIO-1** — the audio output service. +- **OVOS-BRIDGE-1** — satellite session assignment (§4.2.1). From 649542ec24f87488d8946f8c318e2e5e2dc0234b Mon Sep 17 00:00:00 2001 From: JarbasAi Date: Thu, 28 May 2026 10:33:53 +0100 Subject: [PATCH 7/7] AUDIO-IN-1: cross-reference USER-ID-1 voice signal injection MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Audio transformer is the inline voice-signal injection point per USER-ID-1 §3.1; context.voice_match is the intermediate signal. Co-Authored-By: Claude Sonnet 4.6 --- ovos-audio-in-1.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md index 5f9a87f..71525f0 100644 --- a/ovos-audio-in-1.md +++ b/ovos-audio-in-1.md @@ -66,8 +66,10 @@ Canonical use cases: §5.1 language resolution and STT engine selection. - **Denoising and normalisation** — noise reduction, gain normalisation, format conversion. -- **Speaker recognition** — writes a `speaker_id` (or equivalent) - into `Message.context` for downstream personalisation. +- **Voice-print recognition** — writes an intermediate result to + `Message.context` (e.g. `context.voice_match`) for consolidation + by a metadata transformer into `session.voice_id` per + OVOS-USER-ID-1 §4.1. --- @@ -152,3 +154,5 @@ placed in `context.session` (**OVOS-MSG-1 §4**). - **OVOS-SESSION-2** — session assignment and default-session rule. - **OVOS-MSG-1** — session carrier (§4) and envelope. - **OVOS-BRIDGE-1** — satellite session assignment (§4.2.1). +- **OVOS-USER-ID-1** — user identity resolution; voice-print + recognition is an audio-transformer use case (§4.1).