From 45f6598090da3896879314815e9b1b20626c9df5 Mon Sep 17 00:00:00 2001
From: JarbasAi <jarbasai@mailfence.com>
Date: Thu, 28 May 2026 09:42:23 +0100
Subject: [PATCH 1/7] AUDIO-IN-1 v1: audio input service specification
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Minimal spec with three normative obligations:
1. A STT mechanism MUST exist (deployer-defined — engine, API, model
   are all out of scope)
2. Audio-transformer chain (TRANSFORM-1 §3.1) MUST run before STT
3. MUST emit ovos.utterance.handle with data.utterances and data.lang

Everything else — audio capture method (mic, file, remote, wake word,
VAD), STT engine selection, post-STT transformer chains — is deployer
concern and explicitly out of scope. Language resolved from
session.detected_lang → session.stt_lang → session.lang in that order.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 README.md          |   1 +
 ovos-audio-in-1.md | 146 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 147 insertions(+)
 create mode 100644 ovos-audio-in-1.md

diff --git a/README.md b/README.md
index a63842a..3898b64 100644
--- a/README.md
+++ b/README.md
@@ -113,6 +113,7 @@ below). Adoption is voluntary; conformance, once adopted, is not.
 | OVOS-CONTEXT-1 | [Intent Context](intent-context.md) | 1 | [Draft — in review (PR #18)](https://github.com/OpenVoiceOS/architecture/pull/18) |
 | OVOS-CONVERSE-1 | [Active Handlers and Interactive Response](converse.md) | 1 | [Draft — in review (PR #25)](https://github.com/OpenVoiceOS/architecture/pull/25) |
 | OVOS-STOP-1 | [Stop Pipeline Plugin](ovos-stop-1.md) | 1 | [Draft — in review (PR #33)](https://github.com/OpenVoiceOS/architecture/pull/33) |
+| OVOS-AUDIO-IN-1 | [Audio Input Service](ovos-audio-in-1.md) | 1 | Draft |
 
 Each spec carries its own scope statement, design rationale, and
 conformance section in its header. Open the document for the full
diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md
new file mode 100644
index 0000000..1ba09fc
--- /dev/null
+++ b/ovos-audio-in-1.md
@@ -0,0 +1,146 @@
+# Audio Input Service Specification
+
+**Spec ID:** OVOS-AUDIO-IN-1 · **Version:** 1 · **Status:** Draft
+
+This specification defines the **audio input service** — the component
+that acquires audio, processes it through the pre-STT transformer
+chain, transcribes it to text, and injects the result into the
+utterance lifecycle.
+
+How audio is acquired — microphone capture, file playback, remote
+streaming, wake-word gating, voice-activity detection, push-to-talk,
+or any other mechanism — is deployer-defined and out of scope.
+
+It builds on two companion specifications:
+
+- the *Utterance Lifecycle and Pipeline Specification*
+  (OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1);
+- the *Transformer Plugins Specification* (OVOS-TRANSFORM-1) — the
+  audio-transformer chain (§3.1) that runs before STT.
+
+The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**,
+**MAY**, and **RECOMMENDED** are used as in RFC 2119.
+
+---
+
+## 1. Scope
+
+This specification defines:
+
+- **the audio input role** (§2) — what the service produces;
+- **the STT obligation** (§3) — that a transcription mechanism exists;
+- **the audio-transformer obligation** (§4) — running the pre-STT
+  transformer chain;
+- **the utterance emission** (§5) — topic, payload shape, and language
+  resolution.
+
+It does **not** define:
+
+- **audio capture** — microphone access, file reading, remote streaming,
+  wake-word detection, VAD, push-to-talk, or any other acquisition
+  mechanism;
+- **STT engine selection** — which engine is used or how it is
+  configured;
+- **post-STT processing** — utterance transformers
+  (OVOS-TRANSFORM-1 §3.2) and metadata transformers (§3.3) are
+  deployer concerns; the service MAY run them before emission;
+- **session lifecycle** — how sessions are created or identified.
+
+---
+
+## 2. The audio input role
+
+The audio input service acquires audio by any deployer-defined
+mechanism, processes it through the audio-transformer chain (§4),
+transcribes it via a STT mechanism (§3), and emits the result on
+`ovos.utterance.handle` (§5).
+
+It is the **producer** of utterance lifecycle messages and the first
+component in the utterance lifecycle per OVOS-PIPELINE-1 §9.
+
+---
+
+## 3. STT mechanism
+
+The audio input service **MUST** have access to a speech-to-text
+mechanism that converts processed audio into one or more candidate
+transcription strings. The specific engine, model, API, or local
+process is deployer-defined; this specification places no constraint
+on it beyond the requirement that it exists and produces text.
+
+---
+
+## 4. Audio-transformer chain
+
+Before passing audio to the STT mechanism, the audio input service
+**MUST** run the audio-transformer chain (**OVOS-TRANSFORM-1 §3.1**).
+The chain is ordered and configured per OVOS-TRANSFORM-1 §4; the
+`context.session` is passed to each transformer.
+
+Audio transformers MAY perform noise reduction, format normalisation,
+acoustic language detection (writing `session.detected_lang`), or any
+other audio-domain processing. A deployment with no audio transformers
+configured passes audio to STT unchanged.
+
+---
+
+## 5. Utterance emission
+
+After transcription the audio input service **MUST** emit:
+
+`ovos.utterance.handle`
+
+per **OVOS-PIPELINE-1 §9.1**, with `context.session` populated per
+**OVOS-MSG-1 §4**.
+
+Payload:
+
+| Field | Type | Required | Meaning |
+|-------|------|----------|---------|
+| `utterances` | array of string | yes | One or more candidate transcription strings. The first element is the primary candidate. |
+| `lang` | string | yes | The BCP-47 language tag for the transcription. See §5.1. |
+
+### 5.1 Language resolution
+
+`data.lang` MUST be set to the language the STT mechanism transcribed
+in. The service resolves the language in this order:
+
+1. `session.detected_lang` — if an audio transformer has detected the
+   spoken language and written it to this field, use it.
+2. `session.stt_lang` — the session's explicit STT language preference,
+   if set.
+3. `session.lang` — the session's general language preference.
+
+The first present and non-empty value wins. If none is present the
+service SHOULD use a deployment-configured default language.
+
+---
+
+## 6. Conformance
+
+### An audio input service **MUST**:
+
+- have access to a STT mechanism (§3);
+- run the audio-transformer chain (OVOS-TRANSFORM-1 §3.1) before
+  passing audio to STT (§4);
+- emit `ovos.utterance.handle` with `data.utterances` (array of
+  strings) and `data.lang` (BCP-47 tag) after transcription (§5);
+- populate `context.session` per OVOS-MSG-1 §4.
+
+### An audio input service **MAY**:
+
+- acquire audio by any mechanism (§2);
+- run the utterance-transformer chain (OVOS-TRANSFORM-1 §3.2) on the
+  transcription before emission;
+- emit multiple candidate transcriptions in `data.utterances`.
+
+---
+
+## See also
+
+- **OVOS-PIPELINE-1** — utterance lifecycle entry point (§9.1).
+- **OVOS-TRANSFORM-1** — audio-transformer chain (§3.1).
+- **OVOS-SESSION-1** — `session.lang`, `session.stt_lang`,
+  `session.detected_lang`.
+- **OVOS-MSG-1** — session carrier (§4) and envelope.
+- **OVOS-AUDIO-1** — the audio output service.

From 26f5e3db463aca1ed05d89e7a52d90950fea6317 Mon Sep 17 00:00:00 2001
From: JarbasAi <jarbasai@mailfence.com>
Date: Thu, 28 May 2026 09:44:57 +0100
Subject: [PATCH 2/7] =?UTF-8?q?AUDIO-IN-1=20=C2=A74:=20add=20canonical=20a?=
 =?UTF-8?q?udio=20transformer=20use=20cases?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Language identification: detected lang written to session.detected_lang
- Denoising/normalisation: noise reduction, format conversion
- Speaker recognition: speaker_id written to Message.context for
  downstream personalisation without audio service knowing semantics

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 ovos-audio-in-1.md | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md
index 1ba09fc..7c0304f 100644
--- a/ovos-audio-in-1.md
+++ b/ovos-audio-in-1.md
@@ -77,10 +77,20 @@ Before passing audio to the STT mechanism, the audio input service
 The chain is ordered and configured per OVOS-TRANSFORM-1 §4; the
 `context.session` is passed to each transformer.
 
-Audio transformers MAY perform noise reduction, format normalisation,
-acoustic language detection (writing `session.detected_lang`), or any
-other audio-domain processing. A deployment with no audio transformers
-configured passes audio to STT unchanged.
+Canonical audio transformer use cases include:
+
+- **Language identification** — detecting the spoken language from
+  the audio signal and writing it to `session.detected_lang`, so
+  that §5.1 language resolution and the STT engine can use it.
+- **Denoising and normalisation** — background noise reduction, gain
+  normalisation, sample-rate or format conversion before STT.
+- **Speaker recognition** — identifying the speaker from the audio
+  and writing the result into `Message.context` (e.g. a `speaker_id`
+  key) so that downstream pipeline stages and skills can personalise
+  responses without the audio input service knowing their semantics.
+
+A deployment with no audio transformers configured passes audio to
+STT unchanged.
 
 ---
 

From 93a974735e85198b7a32cd58624190b960748c77 Mon Sep 17 00:00:00 2001
From: JarbasAi <jarbasai@mailfence.com>
Date: Thu, 28 May 2026 09:47:24 +0100
Subject: [PATCH 3/7] =?UTF-8?q?AUDIO-IN-1=20=C2=A75.1:=20fix=20stt=5Flang?=
 =?UTF-8?q?=20=E2=80=94=20it's=20a=20write,=20not=20a=20read?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

session.stt_lang records the language STT actually decoded in
(SESSION-1 §3.2.4). It is a result written by the audio input
service after transcription, not an input to language selection.

Corrected language resolution order (inputs to STT selection):
  1. session.detected_lang (audio transformer detection)
  2. session.request_lang (capture mechanism hint, e.g. wake word)
  3. session.lang (general session preference)

Added SHOULD obligation to write session.stt_lang after transcription.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 ovos-audio-in-1.md | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md
index 7c0304f..42264c7 100644
--- a/ovos-audio-in-1.md
+++ b/ovos-audio-in-1.md
@@ -113,17 +113,28 @@ Payload:
 ### 5.1 Language resolution
 
 `data.lang` MUST be set to the language the STT mechanism transcribed
-in. The service resolves the language in this order:
-
-1. `session.detected_lang` — if an audio transformer has detected the
-   spoken language and written it to this field, use it.
-2. `session.stt_lang` — the session's explicit STT language preference,
-   if set.
-3. `session.lang` — the session's general language preference.
+in. The service selects the STT language from these inputs in order:
+
+1. `session.detected_lang` — the language a language-detection audio
+   transformer classified the audio as (**OVOS-SESSION-1 §3.2.6**).
+   Most specific signal; use it when present.
+2. `session.request_lang` — a hint from the capture mechanism about
+   the expected language (e.g. the wake word that triggered capture,
+   or a UI language selector) (**OVOS-SESSION-1 §3.2.5**). A prior,
+   not a guarantee.
+3. `session.lang` — the session's general language preference
+   (**OVOS-SESSION-1 §3.2.1**).
 
 The first present and non-empty value wins. If none is present the
 service SHOULD use a deployment-configured default language.
 
+After transcription the service SHOULD write the language actually
+used to `session.stt_lang` (**OVOS-SESSION-1 §3.2.4**) so that
+downstream stages (intent matching, dialog transformers) know what
+language the audio was decoded in. `stt_lang` is a result field
+written by the audio input service; it is not an input to language
+selection.
+
 ---
 
 ## 6. Conformance
@@ -137,6 +148,11 @@ service SHOULD use a deployment-configured default language.
   strings) and `data.lang` (BCP-47 tag) after transcription (§5);
 - populate `context.session` per OVOS-MSG-1 §4.
 
+### An audio input service **SHOULD**:
+
+- write `session.stt_lang` to the language STT decoded in, after
+  transcription (§5.1).
+
 ### An audio input service **MAY**:
 
 - acquire audio by any mechanism (§2);

From 9dd08515b310ceb67662d411e9ca424ff6a4d090 Mon Sep 17 00:00:00 2001
From: JarbasAi <jarbasai@mailfence.com>
Date: Thu, 28 May 2026 09:48:22 +0100
Subject: [PATCH 4/7] =?UTF-8?q?AUDIO-IN-1=20=C2=A75.1=20+=20SESSION-1=20?=
 =?UTF-8?q?=C2=A73.2.4:=20clarify=20stt=5Flang=20semantics?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

stt_lang is the language the STT model was configured to assume for
the audio (model input language), written before/at STT invocation.
In normal transcription stt_lang == data.lang; in speech-translation
they diverge — stt_lang is the audio's spoken language, data.lang
is the transcript's output language.

SESSION-1 §3.2.4 updated to match: "actually transcribed in" was
ambiguous in the translation case.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 ovos-audio-in-1.md | 15 +++++++++------
 ovos-session-1.md  | 17 +++++++++++------
 2 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md
index 42264c7..fa001f5 100644
--- a/ovos-audio-in-1.md
+++ b/ovos-audio-in-1.md
@@ -128,12 +128,15 @@ in. The service selects the STT language from these inputs in order:
 The first present and non-empty value wins. If none is present the
 service SHOULD use a deployment-configured default language.
 
-After transcription the service SHOULD write the language actually
-used to `session.stt_lang` (**OVOS-SESSION-1 §3.2.4**) so that
-downstream stages (intent matching, dialog transformers) know what
-language the audio was decoded in. `stt_lang` is a result field
-written by the audio input service; it is not an input to language
-selection.
+The service SHOULD write the selected input language to
+`session.stt_lang` (**OVOS-SESSION-1 §3.2.4**) before or at the
+point of STT invocation. `stt_lang` records the language the STT
+model was **configured to assume** for the audio, which normally
+matches `data.lang` but may differ when the STT model performs
+speech translation — in that case `stt_lang` is the audio's
+language and `data.lang` is the transcription's output language.
+Downstream stages that need to know the audio's source language
+(rather than the transcript's language) read `session.stt_lang`.
 
 ---
 
diff --git a/ovos-session-1.md b/ovos-session-1.md
index 552382f..54537ae 100644
--- a/ovos-session-1.md
+++ b/ovos-session-1.md
@@ -363,12 +363,17 @@ voices the text correctly.
 
 #### 3.2.4 `stt_lang`
 
-`stt_lang` — string — the BCP-47 tag the speech-to-text stage
-**actually transcribed in**. It records the language the audio was
-decoded as, regardless of what was requested or expected. It is
-typically populated by the component that produced the transcript;
-once set, it travels with the session until overwritten by a later
-stage that re-transcribes.
+`stt_lang` — string — the BCP-47 tag the speech-to-text stage was
+**configured to assume** for the audio (the model's input language).
+It is written by the audio input service before or at the point of
+STT invocation. In a straightforward transcription, `stt_lang`
+matches `data.lang` (the transcript's output language). In a
+speech-translation model, they diverge: `stt_lang` is the audio's
+spoken language; `data.lang` is the language the transcript was
+produced in. Downstream stages that need the audio's source language
+read `stt_lang`; stages that need the transcript's language read
+`data.lang` or `session.lang`. Once set, `stt_lang` travels with
+the session until overwritten by a later transcription stage.
 
 #### 3.2.5 `request_lang`
 

From 9ab7c080582546a7d8764f5e16d5dd4c9722c855 Mon Sep 17 00:00:00 2001
From: JarbasAi <jarbasai@mailfence.com>
Date: Thu, 28 May 2026 10:03:38 +0100
Subject: [PATCH 5/7] AUDIO-IN-1: clarify post-STT ownership, session
 assignment, SESSION-2 ref
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- §1 non-goals: replace confusing "post-STT transformers are deployer
  concern" with "owned by utterance lifecycle (PIPELINE-1), run after
  emission". Session lifecycle non-goal now cross-references SESSION-2
  and points to §5.2.
- §5.2 (new): session assignment. Audio input is the originator of
  interactions and MUST assign a session. Local device SHOULD use
  session_id "default" (SESSION-2 §5); satellite session is assigned
  by the bridge at the hub boundary (BRIDGE-1 §4.2.1). Session MUST
  be in context.session, not data.
- §6 conformance: add session assignment MUST and SHOULD; remove
  utterance-transformer MAY (it belongs to the utterance lifecycle)
- See also: add SESSION-2 and BRIDGE-1; note PIPELINE-1 owns post-STT

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 ovos-audio-in-1.md | 68 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 52 insertions(+), 16 deletions(-)

diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md
index fa001f5..1e20d2a 100644
--- a/ovos-audio-in-1.md
+++ b/ovos-audio-in-1.md
@@ -11,12 +11,16 @@ How audio is acquired — microphone capture, file playback, remote
 streaming, wake-word gating, voice-activity detection, push-to-talk,
 or any other mechanism — is deployer-defined and out of scope.
 
-It builds on two companion specifications:
+It builds on three companion specifications:
 
 - the *Utterance Lifecycle and Pipeline Specification*
-  (OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1);
+  (OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1)
+  and the utterance lifecycle the emission triggers;
 - the *Transformer Plugins Specification* (OVOS-TRANSFORM-1) — the
-  audio-transformer chain (§3.1) that runs before STT.
+  audio-transformer chain (§3.1) that runs before STT;
+- the *Session Lifecycle and State Ownership Specification*
+  (OVOS-SESSION-2) — the session assignment and state-ownership
+  rules this service must follow as the originator of interactions.
 
 The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**,
 **MAY**, and **RECOMMENDED** are used as in RFC 2119.
@@ -41,10 +45,12 @@ It does **not** define:
   mechanism;
 - **STT engine selection** — which engine is used or how it is
   configured;
-- **post-STT processing** — utterance transformers
-  (OVOS-TRANSFORM-1 §3.2) and metadata transformers (§3.3) are
-  deployer concerns; the service MAY run them before emission;
-- **session lifecycle** — how sessions are created or identified.
+- **post-STT transformer chains** — utterance transformers and
+  all subsequent transformer stages are owned by the utterance
+  lifecycle (OVOS-PIPELINE-1) and run after the emission;
+- **session persistence and resumption** — owned by
+  OVOS-SESSION-2; this spec defines only which session the
+  emission carries (§5.2).
 
 ---
 
@@ -138,6 +144,30 @@ language and `data.lang` is the transcription's output language.
 Downstream stages that need to know the audio's source language
 (rather than the transcript's language) read `session.stt_lang`.
 
+### 5.2 Session assignment
+
+The audio input service is the **originator** of the interaction —
+it creates the `ovos.utterance.handle` Message that starts the
+utterance lifecycle. It **MUST** assign a session to the Message
+per **OVOS-SESSION-2** before emission.
+
+The appropriate session depends on the deployment:
+
+- **Local device** — the service SHOULD use `session_id: "default"`,
+  the orchestrator-owned default session
+  (**OVOS-SESSION-2 §5**). This is the normal case when the audio
+  input service and the orchestrator run on the same device.
+- **Satellite** — when the audio input service runs on a satellite
+  that communicates with a hub via a bridge
+  (**OVOS-BRIDGE-1 §4.2.1**), the session is assigned by the bridge
+  at the hub boundary. The satellite emits `ovos.utterance.handle`
+  with its own session; the bridge relays it to the hub with the
+  appropriate `session_id` (its own, or NAT-translated per
+  **OVOS-BRIDGE-1 §3.2**).
+
+The session MUST be placed in `context.session` per
+**OVOS-MSG-1 §4**, not in `data`.
+
 ---
 
 ## 6. Conformance
@@ -147,29 +177,35 @@ Downstream stages that need to know the audio's source language
 - have access to a STT mechanism (§3);
 - run the audio-transformer chain (OVOS-TRANSFORM-1 §3.1) before
   passing audio to STT (§4);
+- assign a session to every emission per §5.2, placing it in
+  `context.session` (OVOS-MSG-1 §4);
 - emit `ovos.utterance.handle` with `data.utterances` (array of
-  strings) and `data.lang` (BCP-47 tag) after transcription (§5);
-- populate `context.session` per OVOS-MSG-1 §4.
+  strings) and `data.lang` (BCP-47 tag) after transcription (§5).
 
 ### An audio input service **SHOULD**:
 
-- write `session.stt_lang` to the language STT decoded in, after
-  transcription (§5.1).
+- use `session_id: "default"` when running on the same device as
+  the orchestrator (§5.2);
+- write `session.stt_lang` before or at the point of STT invocation
+  (§5.1).
 
 ### An audio input service **MAY**:
 
 - acquire audio by any mechanism (§2);
-- run the utterance-transformer chain (OVOS-TRANSFORM-1 §3.2) on the
-  transcription before emission;
 - emit multiple candidate transcriptions in `data.utterances`.
 
 ---
 
 ## See also
 
-- **OVOS-PIPELINE-1** — utterance lifecycle entry point (§9.1).
+- **OVOS-PIPELINE-1** — utterance lifecycle entry point (§9.1);
+  post-STT transformer chains are owned here.
 - **OVOS-TRANSFORM-1** — audio-transformer chain (§3.1).
-- **OVOS-SESSION-1** — `session.lang`, `session.stt_lang`,
-  `session.detected_lang`.
+- **OVOS-SESSION-1** — session field registry; `session.lang`,
+  `session.stt_lang`, `session.detected_lang`, `session.request_lang`.
+- **OVOS-SESSION-2** — session assignment, state ownership, and the
+  default-session rule (§5).
 - **OVOS-MSG-1** — session carrier (§4) and envelope.
+- **OVOS-BRIDGE-1** — satellite deployment and session assignment at
+  the bridge boundary (§4.2.1).
 - **OVOS-AUDIO-1** — the audio output service.

From d074ae2f09748aa857834c4fdf0e27d3fca63451 Mon Sep 17 00:00:00 2001
From: JarbasAi <jarbasai@mailfence.com>
Date: Thu, 28 May 2026 10:07:15 +0100
Subject: [PATCH 6/7] AUDIO-IN-1: simplification pass (-57 lines)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- Preamble: drop repeated capture-method list (already in §1)
- §1: drop "defines" list (restated section headings); keep non-goals
  only; add AUDIO-1 to post-STT non-goal (dialog/TTS chains live there)
- §2: merge two-sentence role description into one
- §3: drop obvious "no constraint beyond..." clause
- §4: trim use-case bullet tails to one line each; drop "no transformers
  → unchanged" (obvious)
- §5: drop redundant MSG-1 §4 reference (covered by §5.2)
- §5.1: drop "most specific signal" and "prior not guarantee" padding
- §5.2: drop "this is the normal case" sentence; drop final
  "MUST be in context.session not data" (in §6 MUST)
- §6 MAY: remove "acquire audio by any mechanism" (a non-goal, not a MAY)
- See also: AUDIO-1 added; entries tightened

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 ovos-audio-in-1.md | 197 ++++++++++++++++-----------------------------
 1 file changed, 70 insertions(+), 127 deletions(-)

diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md
index 1e20d2a..5f9a87f 100644
--- a/ovos-audio-in-1.md
+++ b/ovos-audio-in-1.md
@@ -3,24 +3,19 @@
 **Spec ID:** OVOS-AUDIO-IN-1 · **Version:** 1 · **Status:** Draft
 
 This specification defines the **audio input service** — the component
-that acquires audio, processes it through the pre-STT transformer
-chain, transcribes it to text, and injects the result into the
-utterance lifecycle.
-
-How audio is acquired — microphone capture, file playback, remote
-streaming, wake-word gating, voice-activity detection, push-to-talk,
-or any other mechanism — is deployer-defined and out of scope.
+that acquires audio, runs the pre-STT transformer chain, transcribes
+to text, and injects the result into the utterance lifecycle. How
+audio is acquired is deployer-defined and out of scope.
 
 It builds on three companion specifications:
 
 - the *Utterance Lifecycle and Pipeline Specification*
-  (OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1)
-  and the utterance lifecycle the emission triggers;
+  (OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1);
 - the *Transformer Plugins Specification* (OVOS-TRANSFORM-1) — the
   audio-transformer chain (§3.1) that runs before STT;
 - the *Session Lifecycle and State Ownership Specification*
-  (OVOS-SESSION-2) — the session assignment and state-ownership
-  rules this service must follow as the originator of interactions.
+  (OVOS-SESSION-2) — session assignment as the originator of
+  interactions.
 
 The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**,
 **MAY**, and **RECOMMENDED** are used as in RFC 2119.
@@ -29,74 +24,50 @@ The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**,
 
 ## 1. Scope
 
-This specification defines:
-
-- **the audio input role** (§2) — what the service produces;
-- **the STT obligation** (§3) — that a transcription mechanism exists;
-- **the audio-transformer obligation** (§4) — running the pre-STT
-  transformer chain;
-- **the utterance emission** (§5) — topic, payload shape, and language
-  resolution.
-
-It does **not** define:
-
-- **audio capture** — microphone access, file reading, remote streaming,
-  wake-word detection, VAD, push-to-talk, or any other acquisition
-  mechanism;
-- **STT engine selection** — which engine is used or how it is
-  configured;
-- **post-STT transformer chains** — utterance transformers and
-  all subsequent transformer stages are owned by the utterance
-  lifecycle (OVOS-PIPELINE-1) and run after the emission;
-- **session persistence and resumption** — owned by
-  OVOS-SESSION-2; this spec defines only which session the
-  emission carries (§5.2).
+This specification does **not** define:
+
+- **audio capture** — acquisition mechanism is deployer-defined;
+- **STT engine selection** — engine, model, or API is deployer-defined;
+- **post-STT transformer chains** — utterance and all subsequent
+  transformer stages are owned by the utterance lifecycle
+  (OVOS-PIPELINE-1) and the audio output layer (OVOS-AUDIO-1);
+- **session persistence and resumption** — owned by OVOS-SESSION-2;
+  this spec defines only which session the emission carries (§5.2).
 
 ---
 
 ## 2. The audio input role
 
 The audio input service acquires audio by any deployer-defined
-mechanism, processes it through the audio-transformer chain (§4),
-transcribes it via a STT mechanism (§3), and emits the result on
-`ovos.utterance.handle` (§5).
-
-It is the **producer** of utterance lifecycle messages and the first
-component in the utterance lifecycle per OVOS-PIPELINE-1 §9.
+mechanism, runs the audio-transformer chain (§4), transcribes via a
+STT mechanism (§3), and emits the result on `ovos.utterance.handle`
+(§5). It is the **producer** of utterance lifecycle messages per
+OVOS-PIPELINE-1 §9.
 
 ---
 
 ## 3. STT mechanism
 
 The audio input service **MUST** have access to a speech-to-text
-mechanism that converts processed audio into one or more candidate
-transcription strings. The specific engine, model, API, or local
-process is deployer-defined; this specification places no constraint
-on it beyond the requirement that it exists and produces text.
+mechanism. The engine, model, API, or local process is
+deployer-defined.
 
 ---
 
 ## 4. Audio-transformer chain
 
-Before passing audio to the STT mechanism, the audio input service
-**MUST** run the audio-transformer chain (**OVOS-TRANSFORM-1 §3.1**).
-The chain is ordered and configured per OVOS-TRANSFORM-1 §4; the
-`context.session` is passed to each transformer.
-
-Canonical audio transformer use cases include:
+Before passing audio to STT, the audio input service **MUST** run the
+audio-transformer chain (**OVOS-TRANSFORM-1 §3.1**), configured per
+OVOS-TRANSFORM-1 §4.
 
-- **Language identification** — detecting the spoken language from
-  the audio signal and writing it to `session.detected_lang`, so
-  that §5.1 language resolution and the STT engine can use it.
-- **Denoising and normalisation** — background noise reduction, gain
-  normalisation, sample-rate or format conversion before STT.
-- **Speaker recognition** — identifying the speaker from the audio
-  and writing the result into `Message.context` (e.g. a `speaker_id`
-  key) so that downstream pipeline stages and skills can personalise
-  responses without the audio input service knowing their semantics.
+Canonical use cases:
 
-A deployment with no audio transformers configured passes audio to
-STT unchanged.
+- **Language identification** — writes `session.detected_lang` for
+  §5.1 language resolution and STT engine selection.
+- **Denoising and normalisation** — noise reduction, gain
+  normalisation, format conversion.
+- **Speaker recognition** — writes a `speaker_id` (or equivalent)
+  into `Message.context` for downstream personalisation.
 
 ---
 
@@ -106,67 +77,43 @@ After transcription the audio input service **MUST** emit:
 
 `ovos.utterance.handle`
 
-per **OVOS-PIPELINE-1 §9.1**, with `context.session` populated per
-**OVOS-MSG-1 §4**.
-
-Payload:
+per **OVOS-PIPELINE-1 §9.1**.
 
 | Field | Type | Required | Meaning |
 |-------|------|----------|---------|
-| `utterances` | array of string | yes | One or more candidate transcription strings. The first element is the primary candidate. |
-| `lang` | string | yes | The BCP-47 language tag for the transcription. See §5.1. |
+| `utterances` | array of string | yes | Transcription candidates; first element is primary. |
+| `lang` | string | yes | BCP-47 output language of the transcription. See §5.1. |
 
 ### 5.1 Language resolution
 
-`data.lang` MUST be set to the language the STT mechanism transcribed
-in. The service selects the STT language from these inputs in order:
-
-1. `session.detected_lang` — the language a language-detection audio
-   transformer classified the audio as (**OVOS-SESSION-1 §3.2.6**).
-   Most specific signal; use it when present.
-2. `session.request_lang` — a hint from the capture mechanism about
-   the expected language (e.g. the wake word that triggered capture,
-   or a UI language selector) (**OVOS-SESSION-1 §3.2.5**). A prior,
-   not a guarantee.
-3. `session.lang` — the session's general language preference
-   (**OVOS-SESSION-1 §3.2.1**).
-
-The first present and non-empty value wins. If none is present the
-service SHOULD use a deployment-configured default language.
-
-The service SHOULD write the selected input language to
-`session.stt_lang` (**OVOS-SESSION-1 §3.2.4**) before or at the
-point of STT invocation. `stt_lang` records the language the STT
-model was **configured to assume** for the audio, which normally
-matches `data.lang` but may differ when the STT model performs
-speech translation — in that case `stt_lang` is the audio's
-language and `data.lang` is the transcription's output language.
-Downstream stages that need to know the audio's source language
-(rather than the transcript's language) read `session.stt_lang`.
+Select the STT input language in this order:
 
-### 5.2 Session assignment
+1. `session.detected_lang` (**OVOS-SESSION-1 §3.2.6**) — audio
+   transformer's language classification.
+2. `session.request_lang` (**OVOS-SESSION-1 §3.2.5**) — hint from
+   the capture mechanism (e.g. wake word, UI language selector).
+3. `session.lang` (**OVOS-SESSION-1 §3.2.1**) — session's general
+   language preference.
 
-The audio input service is the **originator** of the interaction —
-it creates the `ovos.utterance.handle` Message that starts the
-utterance lifecycle. It **MUST** assign a session to the Message
-per **OVOS-SESSION-2** before emission.
+First present and non-empty value wins. If none is present use a
+deployment-configured default.
 
-The appropriate session depends on the deployment:
+The service SHOULD write the selected language to `session.stt_lang`
+(**OVOS-SESSION-1 §3.2.4**) before STT invocation. `stt_lang`
+records the model's assumed input language and normally matches
+`data.lang`; they diverge in speech-translation models where the
+audio and transcript languages differ.
+
+### 5.2 Session assignment
 
-- **Local device** — the service SHOULD use `session_id: "default"`,
-  the orchestrator-owned default session
-  (**OVOS-SESSION-2 §5**). This is the normal case when the audio
-  input service and the orchestrator run on the same device.
-- **Satellite** — when the audio input service runs on a satellite
-  that communicates with a hub via a bridge
-  (**OVOS-BRIDGE-1 §4.2.1**), the session is assigned by the bridge
-  at the hub boundary. The satellite emits `ovos.utterance.handle`
-  with its own session; the bridge relays it to the hub with the
-  appropriate `session_id` (its own, or NAT-translated per
-  **OVOS-BRIDGE-1 §3.2**).
+The audio input service **MUST** assign a session to every emission,
+placed in `context.session` (**OVOS-MSG-1 §4**).
 
-The session MUST be placed in `context.session` per
-**OVOS-MSG-1 §4**, not in `data`.
+- **Local device** — SHOULD use `session_id: "default"`
+  (**OVOS-SESSION-2 §5**).
+- **Satellite** — session is assigned by the bridge at the hub
+  boundary (**OVOS-BRIDGE-1 §4.2.1**); the bridge relays or
+  NAT-translates the `session_id` as needed.
 
 ---
 
@@ -176,22 +123,19 @@ The session MUST be placed in `context.session` per
 
 - have access to a STT mechanism (§3);
 - run the audio-transformer chain (OVOS-TRANSFORM-1 §3.1) before
-  passing audio to STT (§4);
-- assign a session to every emission per §5.2, placing it in
-  `context.session` (OVOS-MSG-1 §4);
-- emit `ovos.utterance.handle` with `data.utterances` (array of
-  strings) and `data.lang` (BCP-47 tag) after transcription (§5).
+  STT (§4);
+- assign a session in `context.session` per §5.2;
+- emit `ovos.utterance.handle` with `data.utterances` and `data.lang`
+  (§5).
 
 ### An audio input service **SHOULD**:
 
-- use `session_id: "default"` when running on the same device as
-  the orchestrator (§5.2);
-- write `session.stt_lang` before or at the point of STT invocation
-  (§5.1).
+- use `session_id: "default"` when co-located with the orchestrator
+  (§5.2);
+- write `session.stt_lang` before STT invocation (§5.1).
 
 ### An audio input service **MAY**:
 
-- acquire audio by any mechanism (§2);
 - emit multiple candidate transcriptions in `data.utterances`.
 
 ---
@@ -200,12 +144,11 @@ The session MUST be placed in `context.session` per
 
 - **OVOS-PIPELINE-1** — utterance lifecycle entry point (§9.1);
   post-STT transformer chains are owned here.
+- **OVOS-AUDIO-1** — audio output service; owns dialog and TTS
+  transformer chains.
 - **OVOS-TRANSFORM-1** — audio-transformer chain (§3.1).
-- **OVOS-SESSION-1** — session field registry; `session.lang`,
-  `session.stt_lang`, `session.detected_lang`, `session.request_lang`.
-- **OVOS-SESSION-2** — session assignment, state ownership, and the
-  default-session rule (§5).
+- **OVOS-SESSION-1** — `session.lang`, `session.stt_lang`,
+  `session.detected_lang`, `session.request_lang`.
+- **OVOS-SESSION-2** — session assignment and default-session rule.
 - **OVOS-MSG-1** — session carrier (§4) and envelope.
-- **OVOS-BRIDGE-1** — satellite deployment and session assignment at
-  the bridge boundary (§4.2.1).
-- **OVOS-AUDIO-1** — the audio output service.
+- **OVOS-BRIDGE-1** — satellite session assignment (§4.2.1).

From 649542ec24f87488d8946f8c318e2e5e2dc0234b Mon Sep 17 00:00:00 2001
From: JarbasAi <jarbasai@mailfence.com>
Date: Thu, 28 May 2026 10:33:53 +0100
Subject: [PATCH 7/7] AUDIO-IN-1: cross-reference USER-ID-1 voice signal
 injection
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Audio transformer is the inline voice-signal injection point per
USER-ID-1 §3.1; context.voice_match is the intermediate signal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 ovos-audio-in-1.md | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md
index 5f9a87f..71525f0 100644
--- a/ovos-audio-in-1.md
+++ b/ovos-audio-in-1.md
@@ -66,8 +66,10 @@ Canonical use cases:
   §5.1 language resolution and STT engine selection.
 - **Denoising and normalisation** — noise reduction, gain
   normalisation, format conversion.
-- **Speaker recognition** — writes a `speaker_id` (or equivalent)
-  into `Message.context` for downstream personalisation.
+- **Voice-print recognition** — writes an intermediate result to
+  `Message.context` (e.g. `context.voice_match`) for consolidation
+  by a metadata transformer into `session.voice_id` per
+  OVOS-USER-ID-1 §4.1.
 
 ---
 
@@ -152,3 +154,5 @@ placed in `context.session` (**OVOS-MSG-1 §4**).
 - **OVOS-SESSION-2** — session assignment and default-session rule.
 - **OVOS-MSG-1** — session carrier (§4) and envelope.
 - **OVOS-BRIDGE-1** — satellite session assignment (§4.2.1).
+- **OVOS-USER-ID-1** — user identity resolution; voice-print
+  recognition is an audio-transformer use case (§4.1).