Skip to content

Gateway SUP integration#180

Open
julienduquesnay-se wants to merge 11 commits into
pre-draftfrom
gateway-sup
Open

Gateway SUP integration#180
julienduquesnay-se wants to merge 11 commits into
pre-draftfrom
gateway-sup

Conversation

@julienduquesnay-se
Copy link
Copy Markdown

Description

Specification update related to the Gateway SUP. Add support of gateway service.

Issues Addressed

#137

Change Type

Please select the relevant options:

  • [] Fix (change that resolves an issue)
  • New enhancement (change that adds specification content)
  • Content edits (change that edits existing content)

Checklist

  • I have read the CONTRIBUTING document.
  • My changes adhere to the established patterns, and best practices.

@julienduquesnay-se julienduquesnay-se requested a review from a team as a code owner May 14, 2026 14:54
@phil-abb phil-abb self-requested a review May 15, 2026 10:16
@phil-abb
Copy link
Copy Markdown
Contributor

@julienduquesnay-se - I won't be able to review this until next week.

@phil-abb
Copy link
Copy Markdown
Contributor

@julienduquesnay-se - would you mind updating your branch? It's showing changes related to the helm PR as changes in your branch

@julienduquesnay-se
Copy link
Copy Markdown
Author

@julienduquesnay-se - would you mind updating your branch? It's showing changes related to the helm PR as changes in your branch

@phil-abb, I was trying to merge the latest changes from the "pre-draft" branch, and it looks like it grabbed your latest commit. I might need help to clean that up.

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
@phil-abb
Copy link
Copy Markdown
Contributor

@julienduquesnay-se - would you mind updating your branch? It's showing changes related to the helm PR as changes in your branch

@phil-abb, I was trying to merge the latest changes from the "pre-draft" branch, and it looks like it grabbed your latest commit. I might need help to clean that up.

I rebased via the GitHub PR UI, and it looks ok now.

Comment thread system-design/concepts/gateways/gateways.md Outdated
Comment thread system-design/figures/gateway-types.drawio.svg Outdated
Comment thread system-design/specification/margo-management-interface/deployment-status.md Outdated
Comment thread system-design/specification/margo-management-interface/deployment-status.md Outdated
Comment thread system-design/specification/margo-management-interface/deployment-status.md Outdated
Comment thread mkdocs.yml Outdated
julienduquesnay-se and others added 2 commits May 20, 2026 19:24
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Co-authored-by: Philip Presson <philip.presson@us.abb.com>
Signed-off-by: Julien Duquesnay <156128585+julienduquesnay-se@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@matlec matlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR implements an approved SUP, so I'm not asking to block or restructure it here. However, I'd like to flag a broader question this PR surfaces that's worth picking up separately.

@ajcraig @phil-abb

This is a bigger thread, and it ties into a comment I left on the "Device roles to capabilities" SUP. Working through this gateway PR, I keep bumping into the same question:

When the WFM sends a deployment, what is it targeting?

Today the implicit answer is "a WFM client". Routing happens via clientId in the URL, and the ApplicationDeployment YAML stays target-agnostic. This PR shifts that implicit answer by putting deviceId inside the ApplicationDeployment YAML's metadata. That works neatly for the see-thru gateway case, but it doesn't generalize. Multi-node Kubernetes clusters are the clearest example: kube-scheduler picks where pods run, so the WFM really has no business addressing nodes directly.

That points at something deeper. Standalone Device, Standalone Cluster, and Gateway look like three different things in the spec - but to me they read as the same shape, and the only real difference between them is: how many deployment targets the WFM client speaks for. A Standalone Device speaks for one. A Standalone Cluster speaks for one (the cluster as a whole, single-node or multi-node). A Gateway speaks for several. The PR introduces Gateway as a new category to handle the "several" case; I'd suggest framing it instead as extending the existing pattern to support more than one deployment target per client.

The most concrete future case this opens up is the DFM/WFM split for multi-node clusters. For a multi-node cluster, devices (the nodes) aren't really a WFM concern - the WFM just needs to know there's a deployment target (the cluster) and what it can run. Node-level identity, vendor, lifecycle - that's all DFM territory (post-GA). The PR currently surfaces device identity through WFM artifacts: deviceId in deployment metadata, the capabilities endpoint keyed on deviceId, the Gateway role naming child devices. When the DFM lands and we want to add device-level visibility, we'd ideally do that without having to pull device identity through every WFM interaction. One framing that would help: separate "what the WFM addresses" from "what counts as a device." The WFM-side concept becomes a "deployment target," orthogonal to whatever the DFM tracks. The WFM addresses targets; the DFM (when it arrives) tracks devices; the two surfaces stay independent.

Sketch of the alternative framing

  • WFM Client = protocol participant. Has clientId, X.509 identity. Speaks for one or more deployment targets.
  • Deployment target = an execution surface the WFM can address. Has a targetId unique per client. Has its own capabilities document.
  • Target capabilities are scoped to deployment concerns: supported runtimes, deployment types, resources. Device-shaped fields (vendor, modelNumber, serialNumber) move to the DFM; they describe what the device is, not what it can host.
  • What "resources" means depends on what schedules at this target. A standalone device or a gateway-managed sub-device reports concrete hardware (camera, GPU, ...) because Margo schedules at that level. For a cluster target, it probably makes less sense to report hardware at the cluster-level. In the cluster case, kube-scheduler can handle node-specific / hardware-aware placement via the chart's own affinity rules. Margo doesn't need to know which cluster node has the camera - that's Kubenetes' job.
  • Target identity inherits from the client. No separate certs; the client is the trust boundary.
  • Deployment routing = (clientId, targetId) pair. clientId stays where it is today. targetId lives in the bundle manifest's in a deployment entry - not in the ApplicationDeployment YAML, which stays target-agnostic. One bundle manifest carries routing for all of a client's deployments; a multi-target client fetches once rather than per target.
  • Delegation ("autonomous placement") = a client-level capability flag, orthogonal to target count. To request delegation for a deployment, the WFM omits targetId from the bundle manifest entry; the client picks among its eligible targets. Single-target clients trivially satisfy this. A multi-target client that hasn't reported delegation capability must receive an explicit targetId.
  • No hierarchy in targetId. Flat IDs unique per client. "Parent/child" is the client's internal concern, not protocol-visible.

How this covers the cases

Case clientId Targets (WFM) Devices (DFM, post-GA)
Standalone Device 1 1 (the device) 1 (the device)
Standalone Cluster 1 1 (the cluster) 1 (the node)
Multi-node cluster (when it arrives) 1 1 (the cluster) N (one per node)
See-thru gateway 1 N+1 or N (gw + children) 1 (just the gateway)
Opaque gateway 1 1 (the gateway) 1 (just the gateway)

Clusters don't need special-casing - a cluster is just "1 target." Gateways become "clients with >1 targets." The opaque/see-thru distinction collapses into "how many targets do you expose."

If there's interest, I'm happy to draft this as a SUP (post-PlugFest 2 ;)).

Comment on lines +1 to +184
@@ -9,24 +9,28 @@ To ensure the WFM is kept up to date, the device's client MUST send updated capa
## Route and HTTP Methods

```https
POST /api/v1/clients/{clientId}/capabilities
PUT /api/v1/clients/{clientId}/capabilities
POST /api/v1/clients/{clientId}/capabilities/{deviceId}
PUT /api/v1/clients/{clientId}/capabilities/{deviceId}
DELETE /api/v1/clients/{clientId}/capabilities/{deviceId}
```

### Route Parameters

|Parameter | Type | Required? | Description|
|----------|------|-----------|------------|
| {clientId} | string | Y | The unique identifier of the (device) client registered with the WFM during onboarding. |
| {deviceId} | string | Y | The unique identifier of the device reporting the capabilities. <br/>It must have the following format: "{id}[/{id}[/{id}...]]". The top-level `id` is required and must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3). If reporting capabilties for a child device, the subsequent `id`s are required and must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3). <br/>Using multiple ids in the endpoint does not register multiple devices in a single request, but indicates a hierarchy of devices, with a parent/child relationship. |

### Response Codes

| Code | Description |
|------|-------------|
| 201 OK | The device capabilities document was added, or updated, successfully |
| 204 No Content | The device capabilities document was deleted successfully. |
| 400 Bad Request | Missing or invalid content-digest header. Ensure the SHA256 hash of the base64-encoded payload is included. |
| 401 Unauthorized | Signature verification failed. Ensure you are signing with the correct X.509 private key. |
| 403 Forbidden | Client certificate is not trusted or has been revoked. |
| 404 Not Found | POST, PUT: No client with the given `clientID` was found. <br/> DELETE: No client with the given `clientID` was found or no device with the given `deviceId` was found for the client. |
| 422 Unprocessable Content | Request body includes a semantic error. |

## Request Body Attributes
@@ -41,12 +45,12 @@ PUT /api/v1/clients/{clientId}/capabilities

| Field | Type | Required? | Description |
|-----------------|-----------------|-----------------|-----------------|
| id | string | Y | Unique deviceID assigned to the device via the Device Owner.|
| id | string | Y | Unique deviceID assigned to the device via the Device Owner. It must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3) plus the path separator (i.e. '/'). In case of a device behind a gateway, the id field takes the form of a path with the id of the parent gateway, the id of the child device, and the ids of any intermediate devices, i.e., "{gatewayId}/[{intermediateDeviceId/.../]{deviceId}". |
| vendor | string | Y | Defines the device vendor.|
| modelNumber | string | Y | Defines the model number of the device.|
| serialNumber | string | Y | Defines the serial number of the device.|
| roles | []string | Y | Element that defines the device role it can provide to the Margo environment. MUST be one of the following: Standalone Cluster, Cluster Leader, or Standalone Device |
| resources | []Resource | Y | Element that defines the device's resources available to the application deployed on the device. See the [Resource Fields](#resources-attributes) section below. |
| roles | []string | Y | Element that defines the device role it can provide to the Margo environment. MUST be one of the following: Standalone Cluster, Cluster Leader, Standalone Device, or Gateway |
| resources | []Resource | * | Element that defines the device's resources available to the application deployed on the device. See the [Resource Fields](#resources-attributes) section below. <br/> * The element is required if the device has any of the following roles: Standalone Cluster, Cluster Leader, Standalone Device. |

### Resources Attributes
Resources of the specific device being reported to the WFM. Utilized to match with the required resources defined in the application description
@@ -161,4 +165,20 @@ These enumerations are used as vocabularies for attribute values of the `DeviceC
}
}
}
``` No newline at end of file
```

## Gateways considerations

### Opaque gateways

Opaque gateways MUST report the combined capabilities of all the devices they connect to the WFM.

> Example: An opaque gateway has two child-devices. Each child-device has an ARM64 processor with 2 cores, 5 GB of memory, 32 GB of storage, and 1 ethernet interface. The gateway will report capabilities of 2 CPUs (arm64) with 2 cores each, 10 GB of memory, 64 GB of storage, and 2 ethernet interfaces. In addition since the gateway can deploy compose applications on its child-devices it will report the role of "standalone device".

## See-thru gateways

See-thru gateways MUST report their capabilities and the capabilities of each device they connect to the WFM. This is done by calling the `device capabilities` endpoint for the gateway itself and for each device behind the gateway. The `deviceId` in the endpoint is used to indicate the hierarchy of devices, with a parent/child relationship. For example, if a see-thru gateway with `deviceId` "gateway1" connects two devices with `deviceId` "deviceA" and "deviceB", the gateway would call the `device capabilities` endpoint three times with the following `deviceId`s: "gateway1", "gateway1/deviceA", and "gateway1/deviceB".

When reporting its own capabilities, a see-thru gateway MUST report the role "Gateway".

If a see-thru gateway is capable of hosting edge applications it MUST report the corresponding role(s) (i.e., "Standalone Device", "Standalone Cluster, and/or "Cluster Leader") and the resources available for these deployments. No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR carries forward the SUP's distinction between "opaque gateway" and "see-thru gateway" into the MUST rules. Reading through, I think only see-thru really needs to live there - and even then, mostly as a name for an underlying mechanism.

From the WFM's side, an opaque gateway looks identical to a regular device: no Gateway role, no hierarchical deviceId, no special error codes, ... The "Opaque gateways MUST report the combined capabilities..." boils down to "report what you can actually offer", which any device does anyway. And since the WFM can't tell an opaque gateway from a regular device, there's no way the spec could enforce that rule even if it wanted to. It's good guidance for someone building an aggregation device, but it doesn't really have a job in the MUST rules. (Side note: the example uses Standalone Device, but the rules don't seem to stop an opaque gateway from reporting Gateway instead - so even the boundary of the category isn't clear to me from the text.)

See-thru is different. It does point at something real. But that "something real" is the combination of mechanisms already in this PR: the Gateway role plus hierarchical deviceId. The name is a convenient handle for that combination, but it's the mechanisms themselves doing the work.

What I'd suggest:

  • Drop opaque from the MUST rules, maybe just provide an informative note such as: "A device may aggregate several sub-devices behind it and report itself as a single Margo device."
  • Rewrite the see-thru MUSTs to point at the mechanism directly: "A WFM client reporting the Gateway role MUST..." You could keep "see-thru gateway" as an informal name in the spec (as a useful shorthand) but don't pin protocol rules to that term.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the key is addressability. In the case of single node, cluster, and opaque I am addressing the target directly. In the case of transparent I am targeting the leaf device via the gateway. That means I have to consider the leaf device the same way I would any other target, but then have to add in the targeting parameters associated with the gateway it must communicate through.

i.e., I can target a leaf device of a gateway just like any other device but what happens if the gateway does not support the particular constructs that must flow through to it for "starting the camera". If we assume it is "just flow through" it would devalue the gateway and become not much more than a proxy server that security teams may not be happy with.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matlec
Regarding your first suggestion: I have dropped the requirement (it was not a requirement in the SUP) and rephrase a bit based on your proposal.
Regarding your second suggestion: it makes sense and I tried to rewrite the requirements accordingly.

@chrisgclayton the original intent for the gateway concept was to allow connecting non-margo devices to the Margo ecosystem. How the gateway communicates with its child-devices is on purpose outside of the scope of the specification.

@ajcraig
Copy link
Copy Markdown
Contributor

ajcraig commented May 22, 2026

@julienduquesnay-se @matlec

Review Comments

Device ID

  • I think we should move away from deviceId and utilize targetName
    • This appears to be a user-assigned name that the device owner defines — deviceId implies a system-generated unique identifier, which could cause confusion later.
    • deviceId should be reserved for universally unique identifiers, such as one assigned by the Device Fleet Manager.
  • Within the new documentation, the properties attribute id has a description referencing deviceId. I propose both the route parameter {deviceId} and the body field id be renamed to targetName.

Device Capabilities

  • It would help to add concrete HTTP call examples within the gateway section, something like:
    • Gateway only: POST .../capabilities/gateway1 → body "targetName": "gateway1"
    • Gateway + child: POST .../capabilities/gateway1/deviceA → body "targetName": "gateway1/deviceA"
    • Gateway + intermediate + child: POST .../capabilities/gateway1/zone1/sensorB → body "targetName": "gateway1/zone1/sensorB"
  • An additional see-thru gateway payload example would also help adopters understand the correct format when reporting gateway metadata.

Deployment Status

  • deviceId is listed as required in the body attributes but is absent from the example payload — these should be consistent.
  • I think targetName should be optional here, only included when a see-thru gateway is reporting status on behalf of a child device. This makes the intent explicit: targetName identifies which child device the status applies to, since the reporting client and the executing device differ in that scenario.

Desired State

  • Examples showing how desired state YAML files are structured for see-thru gateway deployments would be helpful. It would also be worth showing the wildcard targeting case (e.g. targetName: gateway1/*) since that behavior is unique to gateway scenarios and has no example today.

…ributes description

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
@phil-abb
Copy link
Copy Markdown
Contributor

@julienduquesnay-se @matlec

Review Comments

Device ID

...

In light of changes that have been discussed as part of the identity and authorization framework, I think it makes sense to move away from using "deviceId" in this way. Previously, we really didn't have any real meaning behind "deviceId" other than some random ID in the device capabilities magically assigned by the device supplier. No "device ID" is starting to mean something more specific than this, so I think by using device ID here, it's just going to require us to change it later. May as well just change it now as part of the original PR since we know it will need to change.,

@julienduquesnay-se
Copy link
Copy Markdown
Author

@julienduquesnay-se @matlec

Review Comments

Device ID

...

In light of changes that have been discussed as part of the identity and authorization framework, I think it makes sense to move away from using "deviceId" in this way. Previously, we really didn't have any real meaning behind "deviceId" other than some random ID in the device capabilities magically assigned by the device supplier. No "device ID" is starting to mean something more specific than this, so I think by using device ID here, it's just going to require us to change it later. May as well just change it now as part of the original PR since we know it will need to change.,

@phil-abb @ajcraig @matlec
While I don't necessary disagree with the proposed change, it raises a concern to me. The content on the SUP was reviewed and approved as part of a formal process. Here we are talking about making changes to what was approved outside of this process and potentially without the knowledge/approval of the original approvers for the SUP.
While it might add some work and time, it is cleaner to complete the changes to the specification as specified in the SUP and then submit an official change request as per the process to make these changes (either has a direct PR or a SUP depending on the scope).
Changing an attribute name, without change of semantic meaning, might be fine. But if we accept this type of change they should be clearly reported, maybe in the SUP itself in a new section. And the original voters need to be made aware so they can react if they disagree with the change. I don't think we can assume they are monitoring the PR.

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
@phil-abb
Copy link
Copy Markdown
Contributor

@phil-abb @ajcraig @matlec While I don't necessary disagree with the proposed change, it raises a concern to me...

Fair comment.

I guess the question is, at what point do changes made while updating the specification with the information from an approved SUP start to matter?

For this case, I view it as just changing a word. The SUP used the word "foo," but after thinking about it, "bar" seemed better. It doesn't fundamentally change the behavior that was approved in the SUP.

For example, if this change isn't made with this PR, I wouldn't expect a new SUP to be created just to change "deviceId" to "targetName" because it's not changing any behaviors.

Maybe a topic we can discuss in the next TWG call to see what people think about how much change is acceptable when updating the spec for a SUP?

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
@ajcraig
Copy link
Copy Markdown
Contributor

ajcraig commented May 29, 2026

@phil-abb @ajcraig @matlec While I don't necessary disagree with the proposed change, it raises a concern to me...

Fair comment.

I guess the question is, at what point do changes made while updating the specification with the information from an approved SUP start to matter?

Fair response indeed @julienduquesnay-se. I'm leaning towards approving this SUP without the change, and then I can drive the word change in a PR directly to spec without a full SUP. Since, as Phil mentioned, it is just a word change.

However, it would be "burried" inside the "Gateway SUP", which didn't propose a change to the original id.

But we are learning these proccesses on the fly, so @phil-abb thoughts on my strategy above to address this particular FB to Julien?

@phil-abb
Copy link
Copy Markdown
Contributor

But we are learning these proccesses on the fly, so @phil-abb thoughts on my strategy above to address this particular FB to Julien?

@julienduquesnay-se / @ajcraig / @matlec

I don't have too strong an opinion on whether we do it now or later; it just seems like doing it now as part of the original change means people don't have to make changes later after things have been implemented.

If we want to keep it how it is for now and revisit it later after we know the results of the identity and authentication framework vote, then that is fine as well. Even if that SUP is rejected, I think we'll want to reconsider how we are using ID/Device ID because our use right now seems too generic.

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
… gateway requriements

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants