Gateway SUP integration#180
Conversation
|
@julienduquesnay-se - I won't be able to review this until next week. |
|
@julienduquesnay-se - would you mind updating your branch? It's showing changes related to the helm PR as changes in your branch |
@phil-abb, I was trying to merge the latest changes from the "pre-draft" branch, and it looks like it grabbed your latest commit. I might need help to clean that up. |
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
I rebased via the GitHub PR UI, and it looks ok now. |
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Co-authored-by: Philip Presson <philip.presson@us.abb.com> Signed-off-by: Julien Duquesnay <156128585+julienduquesnay-se@users.noreply.github.com>
matlec
left a comment
There was a problem hiding this comment.
This PR implements an approved SUP, so I'm not asking to block or restructure it here. However, I'd like to flag a broader question this PR surfaces that's worth picking up separately.
This is a bigger thread, and it ties into a comment I left on the "Device roles to capabilities" SUP. Working through this gateway PR, I keep bumping into the same question:
When the WFM sends a deployment, what is it targeting?
Today the implicit answer is "a WFM client". Routing happens via clientId in the URL, and the ApplicationDeployment YAML stays target-agnostic. This PR shifts that implicit answer by putting deviceId inside the ApplicationDeployment YAML's metadata. That works neatly for the see-thru gateway case, but it doesn't generalize. Multi-node Kubernetes clusters are the clearest example: kube-scheduler picks where pods run, so the WFM really has no business addressing nodes directly.
That points at something deeper. Standalone Device, Standalone Cluster, and Gateway look like three different things in the spec - but to me they read as the same shape, and the only real difference between them is: how many deployment targets the WFM client speaks for. A Standalone Device speaks for one. A Standalone Cluster speaks for one (the cluster as a whole, single-node or multi-node). A Gateway speaks for several. The PR introduces Gateway as a new category to handle the "several" case; I'd suggest framing it instead as extending the existing pattern to support more than one deployment target per client.
The most concrete future case this opens up is the DFM/WFM split for multi-node clusters. For a multi-node cluster, devices (the nodes) aren't really a WFM concern - the WFM just needs to know there's a deployment target (the cluster) and what it can run. Node-level identity, vendor, lifecycle - that's all DFM territory (post-GA). The PR currently surfaces device identity through WFM artifacts: deviceId in deployment metadata, the capabilities endpoint keyed on deviceId, the Gateway role naming child devices. When the DFM lands and we want to add device-level visibility, we'd ideally do that without having to pull device identity through every WFM interaction. One framing that would help: separate "what the WFM addresses" from "what counts as a device." The WFM-side concept becomes a "deployment target," orthogonal to whatever the DFM tracks. The WFM addresses targets; the DFM (when it arrives) tracks devices; the two surfaces stay independent.
Sketch of the alternative framing
- WFM Client = protocol participant. Has
clientId, X.509 identity. Speaks for one or more deployment targets. - Deployment target = an execution surface the WFM can address. Has a
targetIdunique per client. Has its own capabilities document. - Target capabilities are scoped to deployment concerns: supported runtimes, deployment types, resources. Device-shaped fields (
vendor,modelNumber,serialNumber) move to the DFM; they describe what the device is, not what it can host. - What "resources" means depends on what schedules at this target. A standalone device or a gateway-managed sub-device reports concrete hardware (camera, GPU, ...) because Margo schedules at that level. For a cluster target, it probably makes less sense to report hardware at the cluster-level. In the cluster case, kube-scheduler can handle node-specific / hardware-aware placement via the chart's own affinity rules. Margo doesn't need to know which cluster node has the camera - that's Kubenetes' job.
- Target identity inherits from the client. No separate certs; the client is the trust boundary.
- Deployment routing =
(clientId, targetId)pair.clientIdstays where it is today.targetIdlives in the bundle manifest's in a deployment entry - not in theApplicationDeploymentYAML, which stays target-agnostic. One bundle manifest carries routing for all of a client's deployments; a multi-target client fetches once rather than per target. - Delegation ("autonomous placement") = a client-level capability flag, orthogonal to target count. To request delegation for a deployment, the WFM omits
targetIdfrom the bundle manifest entry; the client picks among its eligible targets. Single-target clients trivially satisfy this. A multi-target client that hasn't reported delegation capability must receive an explicittargetId. - No hierarchy in
targetId. Flat IDs unique per client. "Parent/child" is the client's internal concern, not protocol-visible.
How this covers the cases
| Case | clientId | Targets (WFM) | Devices (DFM, post-GA) |
|---|---|---|---|
| Standalone Device | 1 | 1 (the device) | 1 (the device) |
| Standalone Cluster | 1 | 1 (the cluster) | 1 (the node) |
| Multi-node cluster (when it arrives) | 1 | 1 (the cluster) | N (one per node) |
| See-thru gateway | 1 | N+1 or N (gw + children) | 1 (just the gateway) |
| Opaque gateway | 1 | 1 (the gateway) | 1 (just the gateway) |
Clusters don't need special-casing - a cluster is just "1 target." Gateways become "clients with >1 targets." The opaque/see-thru distinction collapses into "how many targets do you expose."
If there's interest, I'm happy to draft this as a SUP (post-PlugFest 2 ;)).
| @@ -9,24 +9,28 @@ To ensure the WFM is kept up to date, the device's client MUST send updated capa | |||
| ## Route and HTTP Methods | |||
|
|
|||
| ```https | |||
| POST /api/v1/clients/{clientId}/capabilities | |||
| PUT /api/v1/clients/{clientId}/capabilities | |||
| POST /api/v1/clients/{clientId}/capabilities/{deviceId} | |||
| PUT /api/v1/clients/{clientId}/capabilities/{deviceId} | |||
| DELETE /api/v1/clients/{clientId}/capabilities/{deviceId} | |||
| ``` | |||
|
|
|||
| ### Route Parameters | |||
|
|
|||
| |Parameter | Type | Required? | Description| | |||
| |----------|------|-----------|------------| | |||
| | {clientId} | string | Y | The unique identifier of the (device) client registered with the WFM during onboarding. | | |||
| | {deviceId} | string | Y | The unique identifier of the device reporting the capabilities. <br/>It must have the following format: "{id}[/{id}[/{id}...]]". The top-level `id` is required and must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3). If reporting capabilties for a child device, the subsequent `id`s are required and must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3). <br/>Using multiple ids in the endpoint does not register multiple devices in a single request, but indicates a hierarchy of devices, with a parent/child relationship. | | |||
|
|
|||
| ### Response Codes | |||
|
|
|||
| | Code | Description | | |||
| |------|-------------| | |||
| | 201 OK | The device capabilities document was added, or updated, successfully | | |||
| | 204 No Content | The device capabilities document was deleted successfully. | | |||
| | 400 Bad Request | Missing or invalid content-digest header. Ensure the SHA256 hash of the base64-encoded payload is included. | | |||
| | 401 Unauthorized | Signature verification failed. Ensure you are signing with the correct X.509 private key. | | |||
| | 403 Forbidden | Client certificate is not trusted or has been revoked. | | |||
| | 404 Not Found | POST, PUT: No client with the given `clientID` was found. <br/> DELETE: No client with the given `clientID` was found or no device with the given `deviceId` was found for the client. | | |||
| | 422 Unprocessable Content | Request body includes a semantic error. | | |||
|
|
|||
| ## Request Body Attributes | |||
| @@ -41,12 +45,12 @@ PUT /api/v1/clients/{clientId}/capabilities | |||
|
|
|||
| | Field | Type | Required? | Description | | |||
| |-----------------|-----------------|-----------------|-----------------| | |||
| | id | string | Y | Unique deviceID assigned to the device via the Device Owner.| | |||
| | id | string | Y | Unique deviceID assigned to the device via the Device Owner. It must include only unreserved characters as specified in [RFC3986](https://www.rfc-editor.org/rfc/rfc3986#section-2.3) plus the path separator (i.e. '/'). In case of a device behind a gateway, the id field takes the form of a path with the id of the parent gateway, the id of the child device, and the ids of any intermediate devices, i.e., "{gatewayId}/[{intermediateDeviceId/.../]{deviceId}". | | |||
| | vendor | string | Y | Defines the device vendor.| | |||
| | modelNumber | string | Y | Defines the model number of the device.| | |||
| | serialNumber | string | Y | Defines the serial number of the device.| | |||
| | roles | []string | Y | Element that defines the device role it can provide to the Margo environment. MUST be one of the following: Standalone Cluster, Cluster Leader, or Standalone Device | | |||
| | resources | []Resource | Y | Element that defines the device's resources available to the application deployed on the device. See the [Resource Fields](#resources-attributes) section below. | | |||
| | roles | []string | Y | Element that defines the device role it can provide to the Margo environment. MUST be one of the following: Standalone Cluster, Cluster Leader, Standalone Device, or Gateway | | |||
| | resources | []Resource | * | Element that defines the device's resources available to the application deployed on the device. See the [Resource Fields](#resources-attributes) section below. <br/> * The element is required if the device has any of the following roles: Standalone Cluster, Cluster Leader, Standalone Device. | | |||
|
|
|||
| ### Resources Attributes | |||
| Resources of the specific device being reported to the WFM. Utilized to match with the required resources defined in the application description | |||
| @@ -161,4 +165,20 @@ These enumerations are used as vocabularies for attribute values of the `DeviceC | |||
| } | |||
| } | |||
| } | |||
| ``` No newline at end of file | |||
| ``` | |||
|
|
|||
| ## Gateways considerations | |||
|
|
|||
| ### Opaque gateways | |||
|
|
|||
| Opaque gateways MUST report the combined capabilities of all the devices they connect to the WFM. | |||
|
|
|||
| > Example: An opaque gateway has two child-devices. Each child-device has an ARM64 processor with 2 cores, 5 GB of memory, 32 GB of storage, and 1 ethernet interface. The gateway will report capabilities of 2 CPUs (arm64) with 2 cores each, 10 GB of memory, 64 GB of storage, and 2 ethernet interfaces. In addition since the gateway can deploy compose applications on its child-devices it will report the role of "standalone device". | |||
|
|
|||
| ## See-thru gateways | |||
|
|
|||
| See-thru gateways MUST report their capabilities and the capabilities of each device they connect to the WFM. This is done by calling the `device capabilities` endpoint for the gateway itself and for each device behind the gateway. The `deviceId` in the endpoint is used to indicate the hierarchy of devices, with a parent/child relationship. For example, if a see-thru gateway with `deviceId` "gateway1" connects two devices with `deviceId` "deviceA" and "deviceB", the gateway would call the `device capabilities` endpoint three times with the following `deviceId`s: "gateway1", "gateway1/deviceA", and "gateway1/deviceB". | |||
|
|
|||
| When reporting its own capabilities, a see-thru gateway MUST report the role "Gateway". | |||
|
|
|||
| If a see-thru gateway is capable of hosting edge applications it MUST report the corresponding role(s) (i.e., "Standalone Device", "Standalone Cluster, and/or "Cluster Leader") and the resources available for these deployments. No newline at end of file | |||
There was a problem hiding this comment.
The PR carries forward the SUP's distinction between "opaque gateway" and "see-thru gateway" into the MUST rules. Reading through, I think only see-thru really needs to live there - and even then, mostly as a name for an underlying mechanism.
From the WFM's side, an opaque gateway looks identical to a regular device: no Gateway role, no hierarchical deviceId, no special error codes, ... The "Opaque gateways MUST report the combined capabilities..." boils down to "report what you can actually offer", which any device does anyway. And since the WFM can't tell an opaque gateway from a regular device, there's no way the spec could enforce that rule even if it wanted to. It's good guidance for someone building an aggregation device, but it doesn't really have a job in the MUST rules. (Side note: the example uses Standalone Device, but the rules don't seem to stop an opaque gateway from reporting Gateway instead - so even the boundary of the category isn't clear to me from the text.)
See-thru is different. It does point at something real. But that "something real" is the combination of mechanisms already in this PR: the Gateway role plus hierarchical deviceId. The name is a convenient handle for that combination, but it's the mechanisms themselves doing the work.
What I'd suggest:
- Drop opaque from the MUST rules, maybe just provide an informative note such as: "A device may aggregate several sub-devices behind it and report itself as a single Margo device."
- Rewrite the see-thru MUSTs to point at the mechanism directly: "A WFM client reporting the
Gatewayrole MUST..." You could keep "see-thru gateway" as an informal name in the spec (as a useful shorthand) but don't pin protocol rules to that term.
There was a problem hiding this comment.
I think the key is addressability. In the case of single node, cluster, and opaque I am addressing the target directly. In the case of transparent I am targeting the leaf device via the gateway. That means I have to consider the leaf device the same way I would any other target, but then have to add in the targeting parameters associated with the gateway it must communicate through.
i.e., I can target a leaf device of a gateway just like any other device but what happens if the gateway does not support the particular constructs that must flow through to it for "starting the camera". If we assume it is "just flow through" it would devalue the gateway and become not much more than a proxy server that security teams may not be happy with.
There was a problem hiding this comment.
@matlec
Regarding your first suggestion: I have dropped the requirement (it was not a requirement in the SUP) and rephrase a bit based on your proposal.
Regarding your second suggestion: it makes sense and I tried to rewrite the requirements accordingly.
@chrisgclayton the original intent for the gateway concept was to allow connecting non-margo devices to the Margo ecosystem. How the gateway communicates with its child-devices is on purpose outside of the scope of the specification.
Review CommentsDevice ID
Device Capabilities
Deployment Status
Desired State
|
…ributes description Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
In light of changes that have been discussed as part of the identity and authorization framework, I think it makes sense to move away from using "deviceId" in this way. Previously, we really didn't have any real meaning behind "deviceId" other than some random ID in the device capabilities magically assigned by the device supplier. No "device ID" is starting to mean something more specific than this, so I think by using device ID here, it's just going to require us to change it later. May as well just change it now as part of the original PR since we know it will need to change., |
@phil-abb @ajcraig @matlec |
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Fair comment. I guess the question is, at what point do changes made while updating the specification with the information from an approved SUP start to matter? For this case, I view it as just changing a word. The SUP used the word "foo," but after thinking about it, "bar" seemed better. It doesn't fundamentally change the behavior that was approved in the SUP. For example, if this change isn't made with this PR, I wouldn't expect a new SUP to be created just to change "deviceId" to "targetName" because it's not changing any behaviors. Maybe a topic we can discuss in the next TWG call to see what people think about how much change is acceptable when updating the spec for a SUP? |
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Fair response indeed @julienduquesnay-se. I'm leaning towards approving this SUP without the change, and then I can drive the word change in a PR directly to spec without a full SUP. Since, as Phil mentioned, it is just a word change. However, it would be "burried" inside the "Gateway SUP", which didn't propose a change to the original But we are learning these proccesses on the fly, so @phil-abb thoughts on my strategy above to address this particular FB to Julien? |
@julienduquesnay-se / @ajcraig / @matlec I don't have too strong an opinion on whether we do it now or later; it just seems like doing it now as part of the original change means people don't have to make changes later after things have been implemented. If we want to keep it how it is for now and revisit it later after we know the results of the identity and authentication framework vote, then that is fine as well. Even if that SUP is rejected, I think we'll want to reconsider how we are using ID/Device ID because our use right now seems too generic. |
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
… gateway requriements Signed-off-by: Julien Duquesnay <julien.duquesnay@se.com>
Description
Specification update related to the Gateway SUP. Add support of gateway service.
Issues Addressed
#137
Change Type
Please select the relevant options:
Checklist