Update vertex buffer with data from GPU

I try to get data that I compute on the GPU using CUDA into a vertex buffer for rendering without copying the data back to the host in between.

The standard way would be:

// The game thread does some computation and copies the result to CPU
void* gpuDataPtr = doSomeCudaComputation();
void* cpuDataPtr = new uint8_t[...];
cudaMemcpy(cpuDataPtr, gpuDataPtr, ..., cudaMemcpyDeviceToHost);

// Later the render thread updates the vertex buffer
void* cpuUnrealPtr = RHILockVertexBuffer(VertexBufferRHI, ..., RLM_WriteOnly);
FMemory::Memcpy(cpuUnrealPtr, cpuDataPtr);
RHIUnlockVertexBuffer(VertexBufferRHI);

This does GPU-to-CPU-to-GPU copies though and might therefore be quite costly when there is lots of data.

Instead I want to do something like this:

// The game thread does some computation and copies the result to (a different location on the) GPU
void* gpuDataPtr = doSomeCudaComputation();
void* gpuDataPtrForRendering = cudaMalloc(...);
cudaMemcpy(gpuDataPtrForRendering, gpuDataPtr, ..., cudaMemcpyDeviceToDevice);

// Later the render thread updates the vertex buffer
void* gpuUnrealPtr = MapVertexBufferIntoCuda(VertexBufferRHI);
cudaMemcpy(gpuUnrealPtr, gpuDataPtrForRendering, ..., cudaMemcpyDeviceToDevice);
UnmapVertexBufferFromCuda(VertexBufferRHI);

This only does GPU-to-GPU copies. I implemented MapVertexBufferIntoCuda() and UnmapVertexBufferFromCuda() successfully using Cuda’s Graphics Interop functionality and it seems to work properly.

My problem: When using the latter technique some frames are broken and the object will not be rendered correctly. Almost as if I modify the buffer while it is being used. I tried to figure out whether RHILockVertexBuffer() and RHIUnlockVertexBuffer() do something special to protect buffers while they are being written to but couldn’t figure it out.
Can anyone help me?

I suggest taking a look at how the Niagara plugin does things. We use compute shaders to generate our particle data. We output the particle counts to an ancillary buffer and use the DrawIndirect API to render the data. This feels closer to your use case.

Files to look at specifically:
Engine\Plugins\FX\Niagara\Source\Niagara\Private\NiagaraEmitterInstanceBatcher.cpp - This is where we issue the compute shader call that runs our particle scripts

Engine\Plugins\FX\Niagara\Source\NiagaraVertexFactories - This folder contains all of the logic for drawing the sprites/meshes/etc that make up the visuals for a particle simulation

Engine\Shaders\Private\NiagaraEmitterInstanceShader.usf - This is the outer shell code that we use for the Niagara compute shaders, note the indirect argument buffer that is filled out at the end

Engine\Shaders\Private\NiagaraSpriteVertexFactory.ush - This is the vertex/pixel shader shell that is generated that uses the input data

Thanks, I’ll take a look at it! You do not use CUDA though if I understand you correctly?

We do not, just the underlying platform’s compute shader API, so DX11 or DX12 compute shaders on PC.

Sorry to hijack this thread, but I was hoping @Shaun_Kime could answer this:

Will a Niagara system render properly if split across 2 ndisplays?