OpenVoiceOS · JarbasAl · May 28, 2026 · May 28, 2026 · May 28, 2026 · May 28, 2026
diff --git a/README.md b/README.md
@@ -113,6 +113,7 @@ below). Adoption is voluntary; conformance, once adopted, is not.
 | OVOS-CONTEXT-1 | [Intent Context](intent-context.md) | 1 | [Draft — in review (PR #18)](https://github.com/OpenVoiceOS/architecture/pull/18) |
 | OVOS-CONVERSE-1 | [Active Handlers and Interactive Response](converse.md) | 1 | [Draft — in review (PR #25)](https://github.com/OpenVoiceOS/architecture/pull/25) |
 | OVOS-STOP-1 | [Stop Pipeline Plugin](ovos-stop-1.md) | 1 | [Draft — in review (PR #33)](https://github.com/OpenVoiceOS/architecture/pull/33) |
+| OVOS-AUDIO-IN-1 | [Audio Input Service](ovos-audio-in-1.md) | 1 | Draft |
 
 Each spec carries its own scope statement, design rationale, and
 conformance section in its header. Open the document for the full

diff --git a/ovos-audio-in-1.md b/ovos-audio-in-1.md
@@ -0,0 +1,158 @@
+# Audio Input Service Specification
+
+**Spec ID:** OVOS-AUDIO-IN-1 · **Version:** 1 · **Status:** Draft
+
+This specification defines the **audio input service** — the component
+that acquires audio, runs the pre-STT transformer chain, transcribes
+to text, and injects the result into the utterance lifecycle. How
+audio is acquired is deployer-defined and out of scope.
+
+It builds on three companion specifications:
+
+- the *Utterance Lifecycle and Pipeline Specification*
+  (OVOS-PIPELINE-1) — the `ovos.utterance.handle` entry point (§9.1);
+- the *Transformer Plugins Specification* (OVOS-TRANSFORM-1) — the
+  audio-transformer chain (§3.1) that runs before STT;
+- the *Session Lifecycle and State Ownership Specification*
+  (OVOS-SESSION-2) — session assignment as the originator of
+  interactions.
+
+The key words **MUST**, **MUST NOT**, **SHOULD**, **SHOULD NOT**,
+**MAY**, and **RECOMMENDED** are used as in RFC 2119.
+
+---
+
+## 1. Scope
+
+This specification does **not** define:
+
+- **audio capture** — acquisition mechanism is deployer-defined;
+- **STT engine selection** — engine, model, or API is deployer-defined;
+- **post-STT transformer chains** — utterance and all subsequent
+  transformer stages are owned by the utterance lifecycle
+  (OVOS-PIPELINE-1) and the audio output layer (OVOS-AUDIO-1);
+- **session persistence and resumption** — owned by OVOS-SESSION-2;
+  this spec defines only which session the emission carries (§5.2).
+
+---
+
+## 2. The audio input role
+
+The audio input service acquires audio by any deployer-defined
+mechanism, runs the audio-transformer chain (§4), transcribes via a
+STT mechanism (§3), and emits the result on `ovos.utterance.handle`
+(§5). It is the **producer** of utterance lifecycle messages per
+OVOS-PIPELINE-1 §9.
+
+---
+
+## 3. STT mechanism
+
+The audio input service **MUST** have access to a speech-to-text
+mechanism. The engine, model, API, or local process is
+deployer-defined.
+
+---
+
+## 4. Audio-transformer chain
+
+Before passing audio to STT, the audio input service **MUST** run the
+audio-transformer chain (**OVOS-TRANSFORM-1 §3.1**), configured per
+OVOS-TRANSFORM-1 §4.
+
+Canonical use cases:
+
+- **Language identification** — writes `session.detected_lang` for
+  §5.1 language resolution and STT engine selection.
+- **Denoising and normalisation** — noise reduction, gain
+  normalisation, format conversion.
+- **Voice-print recognition** — writes an intermediate result to
+  `Message.context` (e.g. `context.voice_match`) for consolidation
+  by a metadata transformer into `session.voice_id` per
+  OVOS-USER-ID-1 §4.1.
+
+---
+
+## 5. Utterance emission
+
+After transcription the audio input service **MUST** emit:
+
+`ovos.utterance.handle`
+
+per **OVOS-PIPELINE-1 §9.1**.
+
+| Field | Type | Required | Meaning |
+|-------|------|----------|---------|
+| `utterances` | array of string | yes | Transcription candidates; first element is primary. |
+| `lang` | string | yes | BCP-47 output language of the transcription. See §5.1. |
+
+### 5.1 Language resolution
+
+Select the STT input language in this order:
+
+1. `session.detected_lang` (**OVOS-SESSION-1 §3.2.6**) — audio
+   transformer's language classification.
+2. `session.request_lang` (**OVOS-SESSION-1 §3.2.5**) — hint from
+   the capture mechanism (e.g. wake word, UI language selector).
+3. `session.lang` (**OVOS-SESSION-1 §3.2.1**) — session's general
+   language preference.
+
+First present and non-empty value wins. If none is present use a
+deployment-configured default.
+
+The service SHOULD write the selected language to `session.stt_lang`
+(**OVOS-SESSION-1 §3.2.4**) before STT invocation. `stt_lang`
+records the model's assumed input language and normally matches
+`data.lang`; they diverge in speech-translation models where the
+audio and transcript languages differ.
+
+### 5.2 Session assignment
+
+The audio input service **MUST** assign a session to every emission,
+placed in `context.session` (**OVOS-MSG-1 §4**).
+
+- **Local device** — SHOULD use `session_id: "default"`
+  (**OVOS-SESSION-2 §5**).
+- **Satellite** — session is assigned by the bridge at the hub
+  boundary (**OVOS-BRIDGE-1 §4.2.1**); the bridge relays or
+  NAT-translates the `session_id` as needed.
+
+---
+
+## 6. Conformance
+
+### An audio input service **MUST**:
+
+- have access to a STT mechanism (§3);
+- run the audio-transformer chain (OVOS-TRANSFORM-1 §3.1) before
+  STT (§4);
+- assign a session in `context.session` per §5.2;
+- emit `ovos.utterance.handle` with `data.utterances` and `data.lang`
+  (§5).
+
+### An audio input service **SHOULD**:
+
+- use `session_id: "default"` when co-located with the orchestrator
+  (§5.2);
+- write `session.stt_lang` before STT invocation (§5.1).
+
+### An audio input service **MAY**:
+
+- emit multiple candidate transcriptions in `data.utterances`.
+
+---
+
+## See also
+
+- **OVOS-PIPELINE-1** — utterance lifecycle entry point (§9.1);
+  post-STT transformer chains are owned here.
+- **OVOS-AUDIO-1** — audio output service; owns dialog and TTS
+  transformer chains.
+- **OVOS-TRANSFORM-1** — audio-transformer chain (§3.1).
+- **OVOS-SESSION-1** — `session.lang`, `session.stt_lang`,
+  `session.detected_lang`, `session.request_lang`.
+- **OVOS-SESSION-2** — session assignment and default-session rule.
+- **OVOS-MSG-1** — session carrier (§4) and envelope.
+- **OVOS-BRIDGE-1** — satellite session assignment (§4.2.1).
+- **OVOS-USER-ID-1** — user identity resolution; voice-print
+  recognition is an audio-transformer use case (§4.1).
diff --git a/ovos-session-1.md b/ovos-session-1.md
@@ -363,12 +363,17 @@ voices the text correctly.
 
 #### 3.2.4 `stt_lang`
 
-`stt_lang` — string — the BCP-47 tag the speech-to-text stage
-**actually transcribed in**. It records the language the audio was
-decoded as, regardless of what was requested or expected. It is
-typically populated by the component that produced the transcript;
-once set, it travels with the session until overwritten by a later
-stage that re-transcribes.
+`stt_lang` — string — the BCP-47 tag the speech-to-text stage was
+**configured to assume** for the audio (the model's input language).
+It is written by the audio input service before or at the point of
+STT invocation. In a straightforward transcription, `stt_lang`
+matches `data.lang` (the transcript's output language). In a
+speech-translation model, they diverge: `stt_lang` is the audio's
+spoken language; `data.lang` is the language the transcript was
+produced in. Downstream stages that need the audio's source language
+read `stt_lang`; stages that need the transcript's language read
+`data.lang` or `session.lang`. Once set, `stt_lang` travels with
+the session until overwritten by a later transcription stage.
 
 #### 3.2.5 `request_lang`