simvx.graphics.gpu.multi_device

Explicit multi-adapter (multi-GPU) foundation: D8 workload-split offload.

This is the foundation wave of design decision D8. It builds the gated plumbing for an explicit-multi-adapter renderer (one independent VkDevice per physical GPU) where whole SubViewport / offscreen scene-render-units (SRUs) are offloaded to secondary GPUs and composited on the primary (GPU 0). It is deliberately off by default: a single-GPU box, and any multi-GPU box that does not opt in, runs today’s single-device path byte-identical (no extra device, no transfer work, no behaviour change). The active multi-device path is verified on a 4x Arc Pro B70 rig, not on this single-GPU dev box.

Three coherent, unit-testable pieces live here:

  1. class:

    MultiDeviceManager enumerates physical devices and, only when physical_device_count > 1 and the caller opted in, creates an independent logical device per physical GPU (reusing

    func:

    ~simvx.graphics.gpu.device.create_logical_device). When the count is 1 or the opt-in is off it holds exactly the existing single (primary) device and is a transparent passthrough.

  2. func:

    assign_srus is the pure device-assignment policy (no Vulkan): given the ordered SRUs and a device count it decides which render on GPU 0 vs which offload to GPU 1+. TAA-safe by construction (each SRU keeps its temporal history on its one assigned device, so there is no cross-device reprojection).

  3. class:

    CrossDeviceTransfer selects how a finished offscreen colour image moves from a secondary device to the primary for compositing. The staging-copy path (secondary -> host-visible staging -> primary) is the guaranteed floor and is the only path implemented end-to-end; the dma-buf / VK_KHR_external_memory_fd zero-copy path is the rig optimisation, gated behind a capability and currently raising a clear, actionable error.

REALITY CHECK (honest scope). Rendering an SRU on a second VkDevice requires that device to own its full set of rendering resources: its own pipelines, descriptor pools, transform/material SSBOs, and mesh+texture residency. The current :class:~simvx.graphics.renderer.forward.Renderer is built around exactly one device. Duplicating it per device is a large, GPU-bound refactor that cannot be functionally verified on this single-GPU box. So this module stops at a clean seam: the device manager, the assignment policy, and the transfer interface are real and tested; the per-device renderer construction + the actual offload-record-and-composite loop are the documented rig-side completion (see :class:MultiDeviceManager.attach_renderer /

meth:

DeviceSlot.renderer).

Module Contents

Classes

DeviceSlot

One physical+logical GPU participating in the multi-device renderer.

MultiDeviceManager

Owns the per-physical-GPU logical devices for the D8 offload renderer.

SRUAssignment

Which device renders one SRU and whether its result must be transferred.

TransferMethod

Selectable cross-device image-transfer strategies (enum-like constants).

CrossDeviceTransfer

Moves a finished offscreen colour image from a secondary to the primary.

OffloadRoute

The per-SRU offload decision the recording path consults.

SRUOffloadCoordinator

Decides + drives where each SubViewport SRU renders across the devices.

Functions

assign_srus

Decide which device renders each SRU (pure, unit-testable, no Vulkan).

select_transfer_method

Pick the transfer strategy for one SRU assignment (pure, no Vulkan).

Data

API

simvx.graphics.gpu.multi_device.log

‘getLogger(…)’

simvx.graphics.gpu.multi_device.__all__

[‘DeviceSlot’, ‘MultiDeviceManager’, ‘SRUAssignment’, ‘assign_srus’, ‘CrossDeviceTransfer’, ‘Transfe…

class simvx.graphics.gpu.multi_device.DeviceSlot[source]

One physical+logical GPU participating in the multi-device renderer.

index 0 is always the primary (compositing) GPU; 1+ are secondaries that render offloaded SRUs. On a single-GPU run there is exactly one slot (index == 0) wrapping the engine’s existing device, and nothing else is created.

renderer is the per-device :class:~simvx.graphics.renderer.forward.Renderer duplicate. It is populated for the primary slot from the engine’s existing renderer and, on the rig, for each secondary by

Meth:

MultiDeviceManager.attach_renderer. It stays None for secondaries until that rig-side per-device renderer construction lands (see the module docstring): the foundation never fabricates a broken renderer to look complete.

index: int

None

physical_device: Any

None

queue_families: simvx.graphics.gpu.device.QueueFamilies

None

device: Any

None

graphics_queue: Any

None

present_queue: Any

None

compute_queue: Any

None

transfer_queue: Any

None

name: str = <Multiline-String>
renderer: Any

None

property is_primary: bool[source]
class simvx.graphics.gpu.multi_device.MultiDeviceManager(*, primary_physical_device: Any, primary_queue_families: simvx.graphics.gpu.device.QueueFamilies, primary_device: Any, primary_graphics_queue: Any, primary_present_queue: Any, physical_devices: list[Any], enabled: bool, capabilities: simvx.graphics.gpu.capabilities.RenderCapabilities | None = None, primary_compute_queue: Any = None, primary_transfer_queue: Any = None, find_queue_families: Any = None)[source]

Owns the per-physical-GPU logical devices for the D8 offload renderer.

Construct with the primary device’s already-created handles (the engine’s existing single device, so the primary slot is never re-created) plus the enumerated physical devices and the opt-in flag. When enabled is true and more than one physical device is present, a secondary :class:DeviceSlot is created per additional physical GPU with its own independent VkDevice via

Func:

create_logical_device. Otherwise the manager holds exactly the single primary slot and :attr:multi_gpu is False (today’s path, unchanged).

The manager does NOT own the primary device’s lifetime (the engine created and destroys it); :meth:destroy only tears down the secondary devices it created itself.

Initialization

property multi_gpu: bool[source]

True only when an opted-in multi-device renderer is active (>= 2 slots).

property device_count: int[source]

Number of logical devices the manager owns (1 on the single-GPU path).

property primary: simvx.graphics.gpu.multi_device.DeviceSlot[source]
property secondaries: list[simvx.graphics.gpu.multi_device.DeviceSlot][source]
property slots: list[simvx.graphics.gpu.multi_device.DeviceSlot][source]
slot(index: int) simvx.graphics.gpu.multi_device.DeviceSlot[source]
attach_renderer(index: int, renderer: Any) None[source]

Bind a per-device :class:Renderer to slot index (rig-side).

The primary renderer is the engine’s existing one. Each secondary needs its OWN renderer (its device’s pipelines / descriptor pools / SSBOs / residency); constructing that is the documented rig-side completion. This setter is the seam where the rig hands the constructed per-device renderer back to the manager so the offload loop can record into it.

register_teardown(hook: Any) None[source]

Register a callback to free per-secondary GPU resources before device destroy.

The offload coordinator registers its :meth:SRUOffloadCoordinator.destroy here so its secondary facades / targets / staging buffers are freed while the secondary VkDevice\ s are still alive (they own those resources). Invoked first by :meth:destroy.

destroy() None[source]

Destroy only the SECONDARY logical devices this manager created.

The primary device is owned by the engine and left untouched. Safe to call on the single-GPU path (no secondaries => no-op). Registered teardown hooks run FIRST so per-secondary GPU resources are freed before the devices.

class simvx.graphics.gpu.multi_device.SRUAssignment[source]

Which device renders one SRU and whether its result must be transferred.

device_index 0 means the SRU renders on the primary and is already resident for compositing (needs_transfer is False). A non-zero index means the SRU is offloaded to that secondary and its finished colour image must be moved to the primary before compositing (needs_transfer is True).

sru_id: int

None

device_index: int

None

property needs_transfer: bool[source]
simvx.graphics.gpu.multi_device.assign_srus(srus: list[Any], device_count: int, *, sru_id: Any = None, cost: Any = None) list[simvx.graphics.gpu.multi_device.SRUAssignment][source]

Decide which device renders each SRU (pure, unit-testable, no Vulkan).

Policy (TAA-safe by construction: an SRU is assigned to exactly one device, so its temporal history never crosses devices):

  • device_count <= 1: EVERY SRU stays on GPU 0. This is the single-GPU / unopted path and produces the same per-SRU work as today, byte-identical (no transfer is ever flagged).

  • device_count >= 2: the main scene is implicitly GPU 0 (it is not an SRU and is not in this list). Among the SRUs, the cheapest stay on GPU 0 (composited locally) and the heaviest independent SRUs are offloaded to the secondaries round-robin. Concretely: sort SRUs by descending cost and walk them, sending the next heaviest to the least-loaded secondary while that keeps the primary from being the bottleneck; ties and the remainder stay on GPU 0.

The returned list preserves the INPUT order of srus (the P1 producer-before-consumer topological order), so a consumer SRU still follows the producer it samples; only the device choice is decided here.

Args: srus: Ordered SRU plans (SubViewportSRU or any object exposing the cost inputs). Order is preserved in the result. device_count: Number of devices available (MultiDeviceManager.device_count). sru_id: Optional accessor sru -> int for the stable id; defaults to reading sru.sru_id. cost: Optional accessor sru -> int overriding the default instance-count heuristic (handy for tests).

class simvx.graphics.gpu.multi_device.TransferMethod[source]

Selectable cross-device image-transfer strategies (enum-like constants).

NONE

‘none’

SRU already lives on the primary device: compositing samples it directly.

STAGING_COPY

‘staging_copy’

Secondary image -> host-visible staging buffer -> primary image. The guaranteed floor, works on any pair of devices (the colour RenderTarget already carries TRANSFER_SRC|TRANSFER_DST).

DMABUF

‘dmabuf’

Zero-copy via VK_KHR_external_memory_fd (dma-buf import/export). The rig optimisation; requires the external-memory extensions to be enabled on both devices. Gated and currently raises until the rig path is implemented.

simvx.graphics.gpu.multi_device.select_transfer_method(assignment: simvx.graphics.gpu.multi_device.SRUAssignment, capabilities: simvx.graphics.gpu.capabilities.RenderCapabilities | None, *, prefer_dmabuf: bool = True) str[source]

Pick the transfer strategy for one SRU assignment (pure, no Vulkan).

  • An SRU on the primary needs no transfer -> :data:TransferMethod.NONE.

  • An offloaded SRU uses :data:TransferMethod.DMABUF when the caller prefers it AND the capability snapshot reports the external-memory-fd path enabled; otherwise the always-available :data:TransferMethod.STAGING_COPY.

The capability gate is :attr:RenderCapabilities.external_memory_fd_enabled: the extension must be enabled at device creation, not merely probed available. A probed-but-not-enabled device cannot export/import fds, so DMABUF must not be selected for it (that would later raise with no working path). On this dev box the single-GPU path never enables it, so the field is False, the staging-copy floor is always chosen, and the dma-buf raise is never reached.

class simvx.graphics.gpu.multi_device.CrossDeviceTransfer[source]

Moves a finished offscreen colour image from a secondary to the primary.

Holds the source (secondary) and destination (primary) :class:DeviceSlot plus the chosen :class:TransferMethod. Cross-device synchronisation is explicit: each device signals a fence/timeline when its half of the copy is done, and the primary’s composite waits on the import being complete. The sync handles are carried on this object so the rig wiring is a single seam.

Implemented:

Meth:

run_staging_copy defines the staging-copy sequence (secondary GPU copy-to-buffer -> host-visible staging -> primary copy-to-image). The CPU-roundtrip floor: correct on any device pair, slower than dma-buf. The actual vkCmd* recording is the rig-side completion because it needs both devices’ live command pools + a host-visible staging allocation per device, which cannot be exercised on this single-GPU box; the method documents and gates that precisely rather than emitting unverifiable copy code.

Gated (rig):

Meth:

run_dmabuf raises a clear, actionable error until VK_KHR_external_memory_fd is enabled and the import/export is wired on the rig.

src: simvx.graphics.gpu.multi_device.DeviceSlot

None

dst: simvx.graphics.gpu.multi_device.DeviceSlot

None

method: str

None

src_done: Any

None

dst_ready: Any

None

width: int

0

height: int

0

bytes_per_pixel: int

8

extra: dict

‘field(…)’

run(*, src_image: Any = None, dst_image: Any = None) None[source]

Execute the selected transfer. Dispatches by :attr:method.

Data:

TransferMethod.NONE is a no-op (single-GPU / primary-resident SRU, the byte-identical path). The other two dispatch to their handlers.

run_staging_copy(*, src_image: Any = None, dst_image: Any = None) None[source]

Staging-copy floor: secondary image -> host staging -> primary image.

The CPU-roundtrip floor: correct on any device pair, slower than dma-buf. src_image is the finished SRU colour image on the SECONDARY device (left in SHADER_READ_ONLY_OPTIMAL by the offscreen pass; the RenderTarget carries TRANSFER_SRC_BIT). dst_image is the primary-device image the main scene samples for this SubViewport feed (carries TRANSFER_DST_BIT).

Exact sequence + every wait (the contract recorded here, verified on the rig). All staging buffers are host-visible|host-coherent and pre-built once per (transfer, size) into :attr:extra by :meth:ensure_staging:

  1. SOURCE (secondary) device, one-shot command buffer on the secondary command pool:

    a. barrier src_image: SHADER_READ_ONLY_OPTIMAL -> TRANSFER_SRC_OPTIMAL (src stage FRAGMENT_SHADER, dst stage TRANSFER; src access SHADER_READ, dst access TRANSFER_READ). b. vkCmdCopyImageToBuffer src_image -> src_staging (tightly packed, bufferRowLength=0, bufferImageHeight=0). c. barrier src_image: TRANSFER_SRC_OPTIMAL -> SHADER_READ_ONLY_OPTIMAL so the secondary can render into / sample it again next frame. d. submit on src.graphics_queue signalling :attr:src_done (a fence on the secondary device). WAIT: the host read in step 2 blocks on src_done (vkWaitForFences); the GPU copy must complete before the bytes are mapped. The offscreen SRU render fence is itself waited on before this submit by the caller (_record_offloaded_sru), so the colour image is fully written first.

  2. HOST: map src_staging (secondary device), memmove the height * row_bytes bytes into dst_staging (primary device), unmap both. Both are host-coherent so no explicit flush/invalidate is needed. WAIT: gated on src_done from 1d before the read; the write into dst_staging happens-before the primary GPU read in 3 because step 3 is submitted only after this memmove returns on the same (host) thread.

  3. DESTINATION (primary) device, one-shot command buffer on the primary command pool:

    a. barrier dst_image: SHADER_READ_ONLY_OPTIMAL -> TRANSFER_DST_OPTIMAL. b. vkCmdCopyBufferToImage dst_staging -> dst_image. c. barrier dst_image: TRANSFER_DST_OPTIMAL -> SHADER_READ_ONLY_OPTIMAL so the main composite pass samples it. d. submit on dst.graphics_queue signalling :attr:dst_ready (a fence on the primary device). WAIT: the main composite pass that samples dst_image runs after dst_ready; the caller waits on it before recording the frame’s main pass (or, in the same-frame in-cmd model, this transfer is submitted + waited before the main pass is recorded).

The implementation records exactly this. It needs both devices’ command pools + a host-visible staging buffer per device, supplied via

Meth:

ensure_staging. On the single-GPU path the method is NONE and this is never reached; it is exercised end-to-end only on the multi-GPU rig.

ensure_staging(src_command_pool: Any, dst_command_pool: Any) dict[source]

Allocate (once) the per-device host-visible staging buffers + cache pools.

Builds one host-visible|host-coherent buffer on each device, sized width * height * bytes_per_pixel (the raw colour image bytes). Cached on

Attr:

extra ['staging'] so a steady-state per-frame transfer reuses them; reallocated only when the size changes (a SubViewport resize). Returns the staging dict. Rig-side GPU allocation; never reached on this box.

destroy() None[source]

Free the per-device staging buffers this transfer allocated (rig-side).

abstractmethod run_dmabuf(*, src_image: Any = None, dst_image: Any = None) None[source]

Zero-copy dma-buf transfer (VK_KHR_external_memory_fd). Rig optimisation.

Export the secondary colour image’s memory as an opaque fd / dma-buf, import it on the primary as an external image, and composite directly with a single cross-device semaphore wait (no host roundtrip). Requires the external-memory-fd extension enabled on BOTH devices at init.

Gated: raises until the rig path is implemented and the external_memory_fd capability is enabled.

class simvx.graphics.gpu.multi_device.OffloadRoute[source]

The per-SRU offload decision the recording path consults.

Computed once per frame by :class:SRUOffloadCoordinator.plan from the ordered SRU list. device_index 0 means “render this SRU on the primary, exactly as today” (offloaded is False, transfer is

Data:

TransferMethod.NONE); a non-zero index means the SRU renders on that secondary’s renderer and its colour image is moved to the primary via

Attr:

transfer before the main pass samples it.

sru_id: int

None

device_index: int

None

transfer: str

None

property offloaded: bool[source]

True when this SRU renders on a secondary device (not GPU 0).

class simvx.graphics.gpu.multi_device.SRUOffloadCoordinator(manager: simvx.graphics.gpu.multi_device.MultiDeviceManager, capabilities: simvx.graphics.gpu.capabilities.RenderCapabilities | None = None, *, prefer_dmabuf: bool = True, content_scale: tuple[float, float] = (1.0, 1.0), secondary_renderer_factory: Any = None)[source]

Decides + drives where each SubViewport SRU renders across the devices.

Glue between the (already-tested) :func:assign_srus policy, the

Func:

select_transfer_method selector, and the :class:MultiDeviceManager device topology. The recording path (SceneAdapter.render_sru_from_plan / render_to_target) consults a coordinator, when one is present, to decide per SRU whether to take today’s primary-device path or route the SRU to a secondary device and transfer the result back.

Constructed only when the manager is actively multi-GPU (opted in AND >= 2 devices). On the single-GPU / unopted path the engine builds no coordinator, so the recording path’s coordinator is None branch is taken and the frame is byte-identical to today.

The decision logic (plan / route_for) is pure and unit-tested with no Vulkan. :meth:render_offloaded is the seam where a secondary-assigned SRU would be recorded on its device’s renderer and transferred back; that needs a per-device renderer (DeviceSlot.renderer), which is None on every box without the rig-side per-device-renderer construction, so it raises a clear, capability-gated error rather than emitting unverifiable cross-device code.

Initialization

property active: bool[source]

True only when the backing manager is an opted-in multi-device renderer.

plan(srus: list[Any], *, sru_id: Any = None, cost: Any = None) list[simvx.graphics.gpu.multi_device.OffloadRoute][source]

Compute + cache this frame’s per-SRU routes from the ordered SRU list.

Returns one :class:OffloadRoute per SRU in INPUT order (the P1 producer-before-consumer topological order is preserved). When the coordinator is inactive (single-GPU / unopted) every route stays on GPU 0 with :data:TransferMethod.NONE, so the caller takes today’s path.

sru_id / cost are forwarded to :func:assign_srus so a caller whose ordered items are not SubViewportSRU (e.g. the synchronous path’s live nodes) can supply the id + offload-cost accessors.

route_for(sru_id: int) simvx.graphics.gpu.multi_device.OffloadRoute | None[source]

The cached route for sru_id from the last :meth:plan, or None.

None means “no decision recorded” (the SRU was not in the planned list); callers treat that as “render on the primary”, today’s path.

transfer_for(route: simvx.graphics.gpu.multi_device.OffloadRoute, *, width: int = 0, height: int = 0) simvx.graphics.gpu.multi_device.CrossDeviceTransfer[source]

Build the :class:CrossDeviceTransfer for an offloaded route.

Pairs the secondary (route.device_index) and primary (0) device slots with the chosen transfer method so the rig wiring is a single seam.

render_offloaded(route: simvx.graphics.gpu.multi_device.OffloadRoute) Any[source]

Return the per-device renderer for an offloaded SRU, building it if needed.

Resolution order:

  1. An explicitly attached renderer (MultiDeviceManager.attach_renderer) is returned as-is (the rig may pre-build + bind one; also the unit-test seam).

  2. Otherwise, if a :attr:_secondary_renderer_factory was injected (rig), lazily build the secondary :class:SecondaryRenderContext facade, run the factory to construct + setup() a Renderer(facade) on the secondary device, bind it to the slot, and return it.

  3. With neither (this single-GPU box: no factory is ever injected because no secondary device exists), raise the clear, capability-gated rig-completion error. NEVER reached on the single-GPU / unopted path: that path has no coordinator at all, so the seam’s coordinator is None branch keeps it byte-identical.

render_sru_offloaded(sru: Any, primary_dst_image: Any) bool[source]

Render one SRU on its secondary device and transfer it to the primary image.

End-to-end multi-GPU SubViewport offload for one SRU (the

Class:

~simvx.graphics.renderer.render_packet.SubViewportSRU plan):

  1. Resolve the SRU’s route (must be offloaded; else returns False so the caller takes the primary path).

  2. Ensure the secondary render context (facade + Renderer + residency).

  3. Mirror the SRU’s mesh geometry (and, when textured-residency is wired, its sampled textures) onto the secondary device via :class:SecondaryResidency.

  4. Size / create the secondary offscreen :class:RenderTarget for this SRU.

  5. Record + submit the SRU’s draws into that target on the secondary device (one submit on the secondary graphics queue; the offscreen RenderTarget leaves the colour image in SHADER_READ_ONLY_OPTIMAL).

  6. Run the :class:CrossDeviceTransfer (staging-copy floor) to move the secondary colour image into primary_dst_image (the SubViewport’s primary-device bindless image the main scene samples).

Returns True when the SRU was handled on a secondary, False when it was not offloaded (caller renders it on the primary as today). The actual per-device GPU record + submit (step 5) is delegated to the injected secondary renderer; this method owns the orchestration + the cross-device transfer. Exercised on the 4x Arc Pro B70 rig; never reached on the single-GPU path.

destroy() None[source]

Tear down all lazily-built secondary render contexts (facade + target + transfer).