Project-HAMi · mesutoezdil · Jun 4, 2026
diff --git a/docs/faq/faq.md b/docs/faq/faq.md
@@ -181,3 +181,54 @@ If the official Device Plugin cannot provide the required information, HAMi deve
 
 - Ascend’s official Device Plugin requires a separate plugin for each card type. HAMi abstracts these card templates into a unified plugin for easier integration with the scheduler.
 - NVIDIA requires custom implementations to support advanced features like compute and memory limits, overcommitment, and NUMA awareness, necessitating HAMi’s custom Device Plugin.
+
+## How does HAMi enforce GPU memory and compute limits?
+
+HAMi injects `libvgpu.so` into containers via `/etc/ld.so.preload`. The library intercepts CUDA memory allocation calls and returns OOM when the `nvidia.com/gpumem` limit is exceeded; compute limits use a token-bucket throttle on kernel launch calls. Applications that bypass the CUDA library (Docker-in-Docker, direct driver API) are not covered. For the full interception flow, see [GPU Virtualization](./core-concepts/gpu-virtualization).
+
+## How does HAMi vGPU differ from NVIDIA MIG? When should I use each?
+
+HAMi vGPU is software-only with no hardware requirements. NVIDIA MIG is hardware partitioning available only on Ampere and later GPUs (A100, H100, A30).
+
+| Property | HAMi vGPU | NVIDIA MIG |
+|---|---|---|
+| Hardware requirement | Any NVIDIA GPU, driver v440+ | Ampere or later (A100, H100, A30, H200) |
+| Isolation mechanism | User-space library interception | Hardware engine partitioning |
+| Memory enforcement | Soft (CUDA API level) | Hard (hardware-enforced) |
+| Compute enforcement | Soft (throttle inside libvgpu.so) | Hard (separate SM partitions) |
+| Partition granularity | 1 MiB memory, 1% compute | Fixed MIG profiles (e.g. 1g.10gb) |
+| Dynamic reconfiguration | Yes, no node drain needed | Requires MIG profile reconfiguration |
+| Multi-tenant noise isolation | Best-effort | Strong |
+
+Use HAMi vGPU when the GPU does not support MIG, workloads need flexible memory sizes, or dynamic repacking without node drains is needed. Use MIG when hard hardware isolation is a compliance or SLA requirement. HAMi also supports dynamic MIG via `mig-parted`; see [Dynamic MIG Support](./userguide/nvidia-device/dynamic-mig-support).
+
+## Why does nvidia-smi inside my container show less memory than on the host?
+
+`libvgpu.so` intercepts `nvmlDeviceGetMemoryInfo` and related calls, returning the `nvidia.com/gpumem` limit instead of physical VRAM. This is intentional: workloads that size their allocations based on reported memory (such as vLLM) will use only their budget. The host’s `nvidia-smi` always shows physical memory. See [GPU Virtualization](./core-concepts/gpu-virtualization).
+
+## Why is my nvidia.com/gpumem limit not enforced? {#why-is-my-nvidiagpumem-limit-not-enforced}
+
+The four most common causes: `CUDA_DISABLE_CONTROL=true` is set, the workload runs inside Docker-in-Docker, the application calls the GPU driver directly (bypassing `libvgpu.so`), or `nvidia-container-runtime` is not the default runtime on the node. See [Troubleshooting](./troubleshooting) for resolution steps.
+
+## Does HAMi replace kube-scheduler or run alongside it?
+
+HAMi runs alongside kube-scheduler as a [scheduler extender](https://github.com/kubernetes/design-proposals-archive/blob/main/scheduling/scheduler_extender.md) - it does not replace it. The MutatingWebhook sets `schedulerName: hami-scheduler` only on pods requesting HAMi resources; all other pods follow the default scheduler path unchanged. See [Architecture](./core-concepts/architecture).
+
+## Does HAMi work with vLLM, and what are the known limitations for multi-GPU tensor parallelism?
+
+Single-GPU vLLM with `nvidia.com/gpumem` works without configuration. For multi-GPU tensor parallelism (`tensor_parallel_size > 1`) with vLLM versions greater than 0.18, HAMi v2.9.0 or later is required. Earlier versions had NCCL initialization failures due to shared CUDA device memory state files (see [#1764](https://github.com/Project-HAMi/HAMi/issues/1764) and [#1853](https://github.com/Project-HAMi/HAMi/issues/1853)). In Volcano environments, set `tensor_parallel_size` per pod, not across all pods. If CUDA graph capture errors occur, try `--enforce-eager`.
+
+## Is HAMi compatible with NVIDIA GPU Operator and DCGM metrics?
+
+HAMi’s device plugin and GPU Operator’s device plugin both report `nvidia.com/gpu` to kubelet - running both on the same node causes conflicts. Disable the GPU Operator device plugin:
+
+```yaml
+devicePlugin:
+  enabled: false
+```
+
+DCGM Exporter is not affected and continues to report physical-level counters normally. HAMi’s per-container virtual metrics are separate; see [GPU Utilization Metrics](./developers/gpu-utilization-metrics).
+
+## How do I set up Prometheus and Grafana monitoring for HAMi vGPU metrics?
+
+The `hami-device-plugin` pod on each node exposes per-container vGPU metrics on port `31992` (configurable via `devicePlugin.monitorPort`). See [Grafana Dashboard](./userguide/monitoring/grafana-dashboard) for the full setup including Prometheus scrape config and dashboard import.
diff --git a/docs/troubleshooting/troubleshooting.md b/docs/troubleshooting/troubleshooting.md
@@ -2,6 +2,21 @@
 title: Troubleshooting
 ---
 
+## GPU Memory Limit Not Enforced {#gpu-memory-limit-not-enforced}
+
+If a container exceeds its `nvidia.com/gpumem` limit, check the following causes:
+
+- **`CUDA_DISABLE_CONTROL=true` is set** - disables HAMi-core enforcement entirely. Remove it from production workloads.
+- **Docker-in-Docker (DinD)** - inner containers do not inherit the `/etc/ld.so.preload` hostPath mount. HAMi enforcement does not apply inside DinD.
+- **Direct driver API usage** - workloads calling NVML or the CUDA Driver API directly bypass `libvgpu.so`.
+- **`nvidia-container-runtime` not set as default** - verify with:
+
+  ```bash
+  containerd config dump | grep default_runtime_name
+  ```
+
+  The output must show `nvidia`. If not, follow the [Prerequisites](./installation/online-installation) guide.
+
 - If you don’t explicitly request vGPUs when using the device plugin with NVIDIA images, all GPUs on the host may be exposed to your container.
 - Currently, A100 MIG can be supported in only "none" and "mixed" modes.
 - Tasks with the "nodeName" field cannot be scheduled at the moment; please use "nodeSelector" instead.

diff --git a/docs/userguide/monitoring/grafana-dashboard.md b/docs/userguide/monitoring/grafana-dashboard.md
@@ -25,6 +25,28 @@ The dashboard includes panels for:
 - Node-level GPU resource availability
 - Device plugin health status
 
+## Prometheus Scrape Config
+
+The `hami-device-plugin` pod on each node exposes metrics on port `31992` (configurable via `devicePlugin.monitorPort`). Add a scrape job:
+
+```yaml
+scrape_configs:
+  - job_name: hami-device-plugin
+    static_configs:
+      - targets:
+          - <node-ip>:31992
+```
+
+For Prometheus Operator, create a `ServiceMonitor` targeting the `hami-device-plugin` service on port `31992`.
+
+Key metrics:
+
+| Metric | Description |
+|---|---|
+| `Device_memory_desc_of_container` | Virtual GPU memory allocated to a container |
+| `Device_utilization_desc_of_container` | GPU compute utilization per container |
+| `Device_memory_limit_of_container` | Memory limit set for the container |
+
 ## Prerequisites
 
 - Prometheus is installed and scraping the HAMi device plugin metrics endpoint.