Gateway SUP integration by julienduquesnay-se · Pull Request #180 · margo/specification

julienduquesnay-se · 2026-05-14T14:54:28Z

Description

Specification update related to the Gateway SUP. Add support of gateway service.

Issues Addressed

#137

Change Type

Please select the relevant options:

[] Fix (change that resolves an issue)
New enhancement (change that adds specification content)
Content edits (change that edits existing content)

Checklist

I have read the CONTRIBUTING document.
My changes adhere to the established patterns, and best practices.

phil-abb · 2026-05-15T10:17:15Z

@julienduquesnay-se - I won't be able to review this until next week.

phil-abb · 2026-05-15T14:57:51Z

@julienduquesnay-se - would you mind updating your branch? It's showing changes related to the helm PR as changes in your branch

julienduquesnay-se · 2026-05-15T20:27:20Z

@julienduquesnay-se - would you mind updating your branch? It's showing changes related to the helm PR as changes in your branch

@phil-abb, I was trying to merge the latest changes from the "pre-draft" branch, and it looks like it grabbed your latest commit. I might need help to clean that up.

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

phil-abb · 2026-05-18T11:06:28Z

@julienduquesnay-se - would you mind updating your branch? It's showing changes related to the helm PR as changes in your branch

@phil-abb, I was trying to merge the latest changes from the "pre-draft" branch, and it looks like it grabbed your latest commit. I might need help to clean that up.

I rebased via the GitHub PR UI, and it looks ok now.

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

Co-authored-by: Philip Presson <philip.presson@us.abb.com> Signed-off-by: Julien Duquesnay <156128585+julienduquesnay-se@users.noreply.github.com>

matlec

This PR implements an approved SUP, so I'm not asking to block or restructure it here. However, I'd like to flag a broader question this PR surfaces that's worth picking up separately.

@ajcraig @phil-abb

This is a bigger thread, and it ties into a comment I left on the "Device roles to capabilities" SUP. Working through this gateway PR, I keep bumping into the same question:

When the WFM sends a deployment, what is it targeting?

Today the implicit answer is "a WFM client". Routing happens via clientId in the URL, and the ApplicationDeployment YAML stays target-agnostic. This PR shifts that implicit answer by putting deviceId inside the ApplicationDeployment YAML's metadata. That works neatly for the see-thru gateway case, but it doesn't generalize. Multi-node Kubernetes clusters are the clearest example: kube-scheduler picks where pods run, so the WFM really has no business addressing nodes directly.

That points at something deeper. Standalone Device, Standalone Cluster, and Gateway look like three different things in the spec - but to me they read as the same shape, and the only real difference between them is: how many deployment targets the WFM client speaks for. A Standalone Device speaks for one. A Standalone Cluster speaks for one (the cluster as a whole, single-node or multi-node). A Gateway speaks for several. The PR introduces Gateway as a new category to handle the "several" case; I'd suggest framing it instead as extending the existing pattern to support more than one deployment target per client.

The most concrete future case this opens up is the DFM/WFM split for multi-node clusters. For a multi-node cluster, devices (the nodes) aren't really a WFM concern - the WFM just needs to know there's a deployment target (the cluster) and what it can run. Node-level identity, vendor, lifecycle - that's all DFM territory (post-GA). The PR currently surfaces device identity through WFM artifacts: deviceId in deployment metadata, the capabilities endpoint keyed on deviceId, the Gateway role naming child devices. When the DFM lands and we want to add device-level visibility, we'd ideally do that without having to pull device identity through every WFM interaction. One framing that would help: separate "what the WFM addresses" from "what counts as a device." The WFM-side concept becomes a "deployment target," orthogonal to whatever the DFM tracks. The WFM addresses targets; the DFM (when it arrives) tracks devices; the two surfaces stay independent.

Sketch of the alternative framing

WFM Client = protocol participant. Has clientId, X.509 identity. Speaks for one or more deployment targets.
Deployment target = an execution surface the WFM can address. Has a targetId unique per client. Has its own capabilities document.
Target capabilities are scoped to deployment concerns: supported runtimes, deployment types, resources. Device-shaped fields (vendor, modelNumber, serialNumber) move to the DFM; they describe what the device is, not what it can host.
What "resources" means depends on what schedules at this target. A standalone device or a gateway-managed sub-device reports concrete hardware (camera, GPU, ...) because Margo schedules at that level. For a cluster target, it probably makes less sense to report hardware at the cluster-level. In the cluster case, kube-scheduler can handle node-specific / hardware-aware placement via the chart's own affinity rules. Margo doesn't need to know which cluster node has the camera - that's Kubenetes' job.
Target identity inherits from the client. No separate certs; the client is the trust boundary.
Deployment routing = (clientId, targetId) pair. clientId stays where it is today. targetId lives in the bundle manifest's in a deployment entry - not in the ApplicationDeployment YAML, which stays target-agnostic. One bundle manifest carries routing for all of a client's deployments; a multi-target client fetches once rather than per target.
Delegation ("autonomous placement") = a client-level capability flag, orthogonal to target count. To request delegation for a deployment, the WFM omits targetId from the bundle manifest entry; the client picks among its eligible targets. Single-target clients trivially satisfy this. A multi-target client that hasn't reported delegation capability must receive an explicit targetId.
No hierarchy in targetId. Flat IDs unique per client. "Parent/child" is the client's internal concern, not protocol-visible.

How this covers the cases

Case	clientId	Targets (WFM)	Devices (DFM, post-GA)
Standalone Device	1	1 (the device)	1 (the device)
Standalone Cluster	1	1 (the cluster)	1 (the node)
Multi-node cluster (when it arrives)	1	1 (the cluster)	N (one per node)
See-thru gateway	1	N+1 or N (gw + children)	1 (just the gateway)
Opaque gateway	1	1 (the gateway)	1 (just the gateway)

Clusters don't need special-casing - a cluster is just "1 target." Gateways become "clients with >1 targets." The opaque/see-thru distinction collapses into "how many targets do you expose."

If there's interest, I'm happy to draft this as a SUP (post-PlugFest 2 ;)).

matlec · 2026-05-22T12:22:11Z

@@ -9,24 +9,28 @@ To ensure the WFM is kept up to date, the device's client MUST send updated capa
 ## Route and HTTP Methods

 ```https
-POST /api/v1/clients/{clientId}/capabilities
-PUT /api/v1/clients/{clientId}/capabilities
+POST /api/v1/clients/{clientId}/capabilities/{deviceId}
+PUT /api/v1/clients/{clientId}/capabilities/{deviceId}
+DELETE /api/v1/clients/{clientId}/capabilities/{deviceId}
 ```

 ### Route Parameters

 |Parameter | Type | Required? | Description|
 |----------|------|-----------|------------|
 | {clientId} | string | Y | The unique identifier of the (device) client registered with the WFM during onboarding. |
+| {deviceId} | string | Y | The unique identifier of the device reporting the capabilities. <br/>It must have the following format: "{id}[/{id}[/{id}...]]". The top-level `id` is required and must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3). If reporting capabilties for a child device, the subsequent `id`s are required and must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3). <br/>Using multiple ids in the endpoint does not register multiple devices in a single request, but indicates a hierarchy of devices, with a parent/child relationship. |

 ### Response Codes

 | Code | Description |
 |------|-------------|
 | 201 OK | The device capabilities document was added, or updated, successfully |
+| 204 No Content | The device capabilities document was deleted successfully. |
 | 400 Bad Request | Missing or invalid content-digest header. Ensure the SHA256 hash of the base64-encoded payload is included. |
 | 401 Unauthorized | Signature verification failed. Ensure you are signing with the correct X.509 private key.  |
 | 403 Forbidden | Client certificate is not trusted or has been revoked. |
+| 404 Not Found | POST, PUT:  No client with the given `clientID` was found. <br/> DELETE: No client with the given `clientID` was found or no device with the given `deviceId` was found for the client. |
 | 422 Unprocessable Content | Request body includes a semantic error.  |

 ## Request Body Attributes
@@ -41,12 +45,12 @@ PUT /api/v1/clients/{clientId}/capabilities

 | Field       | Type            | Required?       | Description     |
 |-----------------|-----------------|-----------------|-----------------|
-| id     | string    | Y    | Unique deviceID assigned to the device via the Device Owner.|
+| id     | string    | Y    | Unique deviceID assigned to the device via the Device Owner. It must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3) plus the path separator (i.e. '/'). In case of a device behind a gateway, the id field takes the form of a path with the id of the parent gateway, the id of the child device, and the ids of any intermediate devices, i.e., "{gatewayId}/[{intermediateDeviceId/.../]{deviceId}". |
 | vendor        | string    | Y    | Defines the device vendor.|
 | modelNumber        | string    | Y    | Defines the model number of the device.|
 | serialNumber       | string    | Y    | Defines the serial number of the device.|
-| roles         | []string    | Y    | Element that defines the device role it can provide to the Margo environment. MUST be one of the following: Standalone Cluster, Cluster Leader, or Standalone Device |
-| resources            | []Resource    | Y    | Element that defines the device's resources available to the application deployed on the device. See the [Resource Fields](#resources-attributes) section below. |
+| roles         | []string    | Y    | Element that defines the device role it can provide to the Margo environment. MUST be one of the following: Standalone Cluster, Cluster Leader, Standalone Device, or Gateway |
+| resources            | []Resource    | *    | Element that defines the device's resources available to the application deployed on the device. See the [Resource Fields](#resources-attributes) section below. <br/> * The element is required if the device has any of the following roles: Standalone Cluster, Cluster Leader, Standalone Device. |

 ### Resources Attributes
 Resources of the specific device being reported to the WFM. Utilized to match with the required resources defined in the application description
@@ -161,4 +165,20 @@ These enumerations are used as vocabularies for attribute values of the `DeviceC
        }
    }
 }
-```
+```
+
+## Gateways considerations
+
+### Opaque gateways
+
+Opaque gateways MUST report the combined capabilities of all the devices they connect to the WFM.
+
+> Example: An opaque gateway has two child-devices. Each child-device has an ARM64 processor with 2 cores, 5 GB of memory, 32 GB of storage, and 1 ethernet interface. The gateway will report capabilities of 2 CPUs (arm64) with 2 cores each, 10 GB of memory, 64 GB of storage, and 2 ethernet interfaces. In addition since the gateway can deploy compose applications on its child-devices it will report the role of "standalone device".
+
+## See-thru gateways
+
+See-thru gateways MUST report their capabilities and the capabilities of each device they connect to the WFM. This is done by calling the `device capabilities` endpoint for the gateway itself and for each device behind the gateway. The `deviceId` in the endpoint is used to indicate the hierarchy of devices, with a parent/child relationship. For example, if a see-thru gateway with `deviceId` "gateway1" connects two devices with `deviceId` "deviceA" and "deviceB", the gateway would call the `device capabilities` endpoint three times with the following `deviceId`s: "gateway1", "gateway1/deviceA", and "gateway1/deviceB". 
+
+When reporting its own capabilities, a see-thru gateway MUST report the role "Gateway". 
+
+If a see-thru gateway is capable of hosting edge applications it MUST report the corresponding role(s) (i.e., "Standalone Device", "Standalone Cluster, and/or "Cluster Leader") and the resources available for these deployments.


The PR carries forward the SUP's distinction between "opaque gateway" and "see-thru gateway" into the MUST rules. Reading through, I think only see-thru really needs to live there - and even then, mostly as a name for an underlying mechanism.

From the WFM's side, an opaque gateway looks identical to a regular device: no Gateway role, no hierarchical deviceId, no special error codes, ... The "Opaque gateways MUST report the combined capabilities..." boils down to "report what you can actually offer", which any device does anyway. And since the WFM can't tell an opaque gateway from a regular device, there's no way the spec could enforce that rule even if it wanted to. It's good guidance for someone building an aggregation device, but it doesn't really have a job in the MUST rules. (Side note: the example uses Standalone Device, but the rules don't seem to stop an opaque gateway from reporting Gateway instead - so even the boundary of the category isn't clear to me from the text.)

See-thru is different. It does point at something real. But that "something real" is the combination of mechanisms already in this PR: the Gateway role plus hierarchical deviceId. The name is a convenient handle for that combination, but it's the mechanisms themselves doing the work.

What I'd suggest:

Drop opaque from the MUST rules, maybe just provide an informative note such as: "A device may aggregate several sub-devices behind it and report itself as a single Margo device."

Rewrite the see-thru MUSTs to point at the mechanism directly: "A WFM client reporting the Gateway role MUST..." You could keep "see-thru gateway" as an informal name in the spec (as a useful shorthand) but don't pin protocol rules to that term.

I think the key is addressability. In the case of single node, cluster, and opaque I am addressing the target directly. In the case of transparent I am targeting the leaf device via the gateway. That means I have to consider the leaf device the same way I would any other target, but then have to add in the targeting parameters associated with the gateway it must communicate through.

i.e., I can target a leaf device of a gateway just like any other device but what happens if the gateway does not support the particular constructs that must flow through to it for "starting the camera". If we assume it is "just flow through" it would devalue the gateway and become not much more than a proxy server that security teams may not be happy with.

@matlec
Regarding your first suggestion: I have dropped the requirement (it was not a requirement in the SUP) and rephrase a bit based on your proposal.
Regarding your second suggestion: it makes sense and I tried to rewrite the requirements accordingly.

@chrisgclayton the original intent for the gateway concept was to allow connecting non-margo devices to the Margo ecosystem. How the gateway communicates with its child-devices is on purpose outside of the scope of the specification.

ajcraig · 2026-05-22T19:30:01Z

@julienduquesnay-se @matlec

Review Comments

Device ID

I think we should move away from deviceId and utilize targetName
- This appears to be a user-assigned name that the device owner defines — deviceId implies a system-generated unique identifier, which could cause confusion later.
- deviceId should be reserved for universally unique identifiers, such as one assigned by the Device Fleet Manager.
Within the new documentation, the properties attribute id has a description referencing deviceId. I propose both the route parameter {deviceId} and the body field id be renamed to targetName.

Device Capabilities

It would help to add concrete HTTP call examples within the gateway section, something like:
- Gateway only: POST .../capabilities/gateway1 → body "targetName": "gateway1"
- Gateway + child: POST .../capabilities/gateway1/deviceA → body "targetName": "gateway1/deviceA"
- Gateway + intermediate + child: POST .../capabilities/gateway1/zone1/sensorB → body "targetName": "gateway1/zone1/sensorB"
An additional see-thru gateway payload example would also help adopters understand the correct format when reporting gateway metadata.

Deployment Status

deviceId is listed as required in the body attributes but is absent from the example payload — these should be consistent.
I think targetName should be optional here, only included when a see-thru gateway is reporting status on behalf of a child device. This makes the intent explicit: targetName identifies which child device the status applies to, since the reporting client and the executing device differ in that scenario.

Desired State

Examples showing how desired state YAML files are structured for see-thru gateway deployments would be helpful. It would also be worth showing the wildcard targeting case (e.g. targetName: gateway1/*) since that behavior is unique to gateway scenarios and has no example today.

…ributes description Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

phil-abb · 2026-05-28T12:40:56Z

@julienduquesnay-se @matlec

Review Comments

Device ID

...

In light of changes that have been discussed as part of the identity and authorization framework, I think it makes sense to move away from using "deviceId" in this way. Previously, we really didn't have any real meaning behind "deviceId" other than some random ID in the device capabilities magically assigned by the device supplier. No "device ID" is starting to mean something more specific than this, so I think by using device ID here, it's just going to require us to change it later. May as well just change it now as part of the original PR since we know it will need to change.,

julienduquesnay-se · 2026-05-28T17:29:40Z

@julienduquesnay-se @matlec

Review Comments

Device ID

...

In light of changes that have been discussed as part of the identity and authorization framework, I think it makes sense to move away from using "deviceId" in this way. Previously, we really didn't have any real meaning behind "deviceId" other than some random ID in the device capabilities magically assigned by the device supplier. No "device ID" is starting to mean something more specific than this, so I think by using device ID here, it's just going to require us to change it later. May as well just change it now as part of the original PR since we know it will need to change.,

@phil-abb @ajcraig @matlec
While I don't necessary disagree with the proposed change, it raises a concern to me. The content on the SUP was reviewed and approved as part of a formal process. Here we are talking about making changes to what was approved outside of this process and potentially without the knowledge/approval of the original approvers for the SUP.
While it might add some work and time, it is cleaner to complete the changes to the specification as specified in the SUP and then submit an official change request as per the process to make these changes (either has a direct PR or a SUP depending on the scope).
Changing an attribute name, without change of semantic meaning, might be fine. But if we accept this type of change they should be clearly reported, maybe in the SUP itself in a new section. And the original voters need to be made aware so they can react if they disagree with the change. I don't think we can assume they are monitoring the PR.

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

phil-abb · 2026-05-28T17:48:46Z

@phil-abb @ajcraig @matlec While I don't necessary disagree with the proposed change, it raises a concern to me...

Fair comment.

I guess the question is, at what point do changes made while updating the specification with the information from an approved SUP start to matter?

For this case, I view it as just changing a word. The SUP used the word "foo," but after thinking about it, "bar" seemed better. It doesn't fundamentally change the behavior that was approved in the SUP.

For example, if this change isn't made with this PR, I wouldn't expect a new SUP to be created just to change "deviceId" to "targetName" because it's not changing any behaviors.

Maybe a topic we can discuss in the next TWG call to see what people think about how much change is acceptable when updating the spec for a SUP?

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

ajcraig · 2026-05-29T14:19:06Z

@phil-abb @ajcraig @matlec While I don't necessary disagree with the proposed change, it raises a concern to me...

Fair comment.

I guess the question is, at what point do changes made while updating the specification with the information from an approved SUP start to matter?

Fair response indeed @julienduquesnay-se. I'm leaning towards approving this SUP without the change, and then I can drive the word change in a PR directly to spec without a full SUP. Since, as Phil mentioned, it is just a word change.

However, it would be "burried" inside the "Gateway SUP", which didn't propose a change to the original id.

But we are learning these proccesses on the fly, so @phil-abb thoughts on my strategy above to address this particular FB to Julien?

phil-abb · 2026-05-29T14:32:05Z

But we are learning these proccesses on the fly, so @phil-abb thoughts on my strategy above to address this particular FB to Julien?

@julienduquesnay-se / @ajcraig / @matlec

I don't have too strong an opinion on whether we do it now or later; it just seems like doing it now as part of the original change means people don't have to make changes later after things have been implemented.

If we want to keep it how it is for now and revisit it later after we know the results of the identity and authentication framework vote, then that is fine as well. Even if that SUP is rejected, I think we'll want to reconsider how we are using ID/Device ID because our use right now seems too generic.

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

… gateway requriements Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

julienduquesnay-se requested a review from a team as a code owner May 14, 2026 14:54

phil-abb self-requested a review May 15, 2026 10:16

julienduquesnay-se added 3 commits May 18, 2026 07:00

update spec to match gateway SUP - first pass

ecc769b

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

small fixes

5eff01e

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

minor edits

44057c5

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

phil-abb force-pushed the gateway-sup branch from 434f910 to 44057c5 Compare May 18, 2026 11:00

phil-abb reviewed May 18, 2026

View reviewed changes

julienduquesnay-se and others added 2 commits May 20, 2026 19:24

Move concepts conent to the proper repo

024e26c

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

Apply suggestions from code review

b9c0fb7

Co-authored-by: Philip Presson <philip.presson@us.abb.com> Signed-off-by: Julien Duquesnay <156128585+julienduquesnay-se@users.noreply.github.com>

matlec reviewed May 22, 2026

View reviewed changes

Add links to gate page in concepts. Combine notes in status error att…

2008795

…ributes description Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

add missing 'deviceId' in example

5763183

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

add HTTP call and payload examples + fix couple of typos

e2f6f16

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

julienduquesnay-se added 3 commits May 29, 2026 15:44

add desired states examples

e5ad291

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

clarified deviceId description

754f27a

Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

WFM return 404 error for out of order capability reporting + rephrase…

78349ad

… gateway requriements Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>

phil-abb approved these changes Jun 1, 2026

View reviewed changes

Conversation

julienduquesnay-se commented May 14, 2026

Description

Issues Addressed

Change Type

Checklist

Uh oh!

phil-abb commented May 15, 2026

Uh oh!

phil-abb commented May 15, 2026

Uh oh!

julienduquesnay-se commented May 15, 2026

Uh oh!

phil-abb commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matlec left a comment

Choose a reason for hiding this comment

Sketch of the alternative framing

How this covers the cases

Uh oh!

matlec May 22, 2026

Choose a reason for hiding this comment

Uh oh!

chrisgclayton May 22, 2026

Choose a reason for hiding this comment

Uh oh!

julienduquesnay-se May 29, 2026

Choose a reason for hiding this comment

Uh oh!

ajcraig commented May 22, 2026

Review Comments

Device ID

Device Capabilities

Deployment Status

Desired State

Uh oh!

phil-abb commented May 28, 2026

Review Comments

Device ID

Uh oh!

julienduquesnay-se commented May 28, 2026

Review Comments

Device ID

Uh oh!

phil-abb commented May 28, 2026

Uh oh!

ajcraig commented May 29, 2026

Uh oh!

phil-abb commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants