Instanced rendering performance issue

phoboz · November 18, 2018, 1:59am

Hi There!

I’ve got a custom component, which renders millions of camera-facing quads (Point Clouds).

My original approach was to generate 4 vertices per point and use that to populate one, huge Vertex Buffer. Unfortunately, that could quickly exceed available VRAM - even using just Location and Color per vertex results in 64 bytes / point, multiply by 100M points (or more) and we end up with 6GB for Vertex Buffer alone.

Actual number of points rendered at any given time was controlled by the dynamically generated Index Buffer.

Since the geometry of each quad is identical, next logical step was to use Instancing. Simple Vertex and Index buffers to build a quad, and rest of the data as separate per-instance buffer. All the quads have identical rotation and scale, which allowed me to use just a single FVector (location) instead of full Transform as per-instance data.

This is where the trouble begins. I was expecting some performance hit due to per-instance overhead, but the result is nearly 2x higher frame time. My test point cloud had about 1.2M points visible (2.4M triangles), and the render times were 9ms (Merged) vs 16ms (Instanced). I’ve also noticed, that using only a single triangle per point (instead of a quad, effectively cutting the total number of polygons to render by half) reduces Merged time to ~6ms, while not having any effect on the Instanced method – suggesting the bottleneck being the sheer number of instances, rather than the polycount.

Is that kind of overhead within what’s to be expected, or am I missing something here?

Full source available upon request.

Thanks!

Deathrey · November 18, 2018, 12:47pm

In given case(simple instances, large number of instances), one large merged mesh is expected to be faster at increased memory cost.

Generally, you should implement some sort of data structure for efficient rendering of those things. 100 million points rendered does not make sense, since even at the most brave resolutions you have less than 10 million pixels on screen.
You gotta bring the count down before it reaches GPU, both in terms of geometry rendered and memory bandwidth usage.

But as for 16 ms for just 2 million instances, that is an indication that you are doing something wrong. It should be significantly faster.

phoboz · November 18, 2018, 1:23pm

Thanks for your reply.

I fully expected drop in performance, just not by that much

I’m not rendering all 100M at once. There is a LOD in place (using dynamically generated IndexBuffer), which effectively reduces the actually rendered count to few million. However, I have to pre-allocate the VertexBuffer data resulting in huge VRAM footprint.

Alternative would be to generate the VertexBuffer every time the visible point set changes (so potentially every frame), requiring many GB/s of VRAM transfer (64 bytes per point x 2M points x 60 frames = 7+ GB/s). Granted, I haven’t tested how much performance that would cost, but seeing how it can stutter transferring much smaller IndexBuffer, I imagine the performance to be even worse.

If the 16ms doesn’t sound right, could you point me to what could potentially cause such poor performance?

Erasio · November 18, 2018, 2:01pm

While I’m not sure whether this is affecting you, another thing with instanced meshes is that culling doesn’t work on a per instance basis. As soon as one element of your instanced mesh is visible all will be rendered whether they are in view or not.

If you have a lot of elements besides or behind you, this could result in a consistent but permanent performance hit (as everything is rendered all the time instead of just while visible).

phoboz · November 18, 2018, 4:01pm

Hi, and thanks for the suggestion

Just to clarify, I’m not using InstancedStaticMeshComponent, but a custom component implementing instancing.

Frustum culling is being handled as part of dynamic LOD system and is not the culprit. Both frame times provided were comparing the same effective number of polygons.

Deathrey · November 18, 2018, 5:18pm

Alternative would be to generate the
VertexBuffer every time the visible
point set changes (so potentially
every frame),

That is definitely not a way to go.

However, I have to pre-allocate the
VertexBuffer data resulting in huge
VRAM footprint.

I am not really understanding that part, especially in view of the fact that your vertex buffer should be nothing but 4 verts.

It is your instance buffer that should be your main memory user.

Rendering two million instances should never bottleneck you. Updating instance buffer for two million instances? Quite probably. So how exactly are you updating it?

phoboz · November 18, 2018, 5:59pm

I might have not explained it correctly, let me try in more detail

Merged Approach

Large, static VertexBuffer, containing all vertices for all points. Has to be pre-allocated in GPU prior to rendering
Dynamic IndexBuffer populated by the LOD system, driving which parts of the VB to render.

Instanced Approach

Static VertexBuffer containing 4 vertices
Static IndexBuffer containing 6 indices (using TriangleList)
Dynamic InstanceBuffer populated by LOD system, containing data for visible points only

Point Cloud data is structured using Octree. The LOD system is running frustum culling and LOD selection per frame. In case the visible data set changes (camera translation, rotation, etc.), relevant dynamic buffer is updated. Each LOD is actually rendered as separate MeshBatch and uses its own independent, dynamic buffer. This results in more draw calls (one per LOD), but allows for smaller and less frequent update calls to the buffer.

Causing too much data to be updated at any single time results in very distinct, but momentary stuttering (as expected), and can happen in both, Merged, and Instanced approaches. However, the 16ms result happens regardless of whether the camera is moving or stationary, which would indicate it’s not the buffer update that bottlenecks the process (since there are no updates when the camera stays still).

Hope this helps clear the situation a little!

Deathrey · November 18, 2018, 6:26pm

I frankly expected buffer update to be bottlenecking you and was about to suggest splitting whole octree with separate instance buffers, but you are already doing it. At this point I can’t say anything constructive without seeing profiler and stats.

phoboz · November 18, 2018, 7:05pm

Profiler and Stats dumps

Exports contain dumps from Merged and Instanced approaches, same viewpoint, same settings. Visible point count is ~4.4M. I’ve included 2 scenarios, where each point is rendered as a single triangle (4.4M) and as a full quad (8.8M).

Using Merged, this results in 8.6ms and 12ms respectively.
Using Instanced, this results in 17.7ms and 18ms respectively.

Both stats recorded with stationary camera.

Deathrey · November 18, 2018, 11:15pm

Whatever you are doing in PointCloudSceneProxy_GetDynamicMeshElements bottlenecks you. Also note bloated occlusion query times. What are you trying to occlusion cull there?

Deathrey · November 18, 2018, 11:48pm

Slate is just stalling. Not an indicator. Yeah it is clearly bottlenecked by rendering thread.

phoboz · November 18, 2018, 10:02pm

Results of further testing:

Performance seems highly sensitive to the amount of data passed within the instance buffer (14.4ms position vs 18ms position + color), regardless of whether that data is actually being used inside shader code. This would indicate that iteration over the instance buffer is what’s actually causing the performance loss. Using separate buffers for position and color yielded no difference.
No performance difference if the instance buffer has been marked as BUF_STATIC, and the content is pre-allocated, with no further update calls
Interestingly, no performance difference when batched 8 instances together

phoboz · November 18, 2018, 11:41pm

PointCloudSceneProxy_GetDynamicMeshElements triggers LOD calculation and dynamic buffer updates. Interestingly, Instanced is actually faster than Merged (5.8ms vs 6.9ms). Will add more benchmark groups to see where the difference comes from.

I’m not doing any occlusion culling myself, I assumed it’s the engine, and that the query time scales with total polycount. Any way to manually exclude specific objects from query?

Now that I’ve looked, I’ve also noticed greatly increased SlateDrawWindowsCommand in Instanced (1.4ms vs 5.5ms), any idea why that could be?

Altogether those times do indeed add up to nearly frame time. Do you think it’s the RenderThread that bottlenecks the whole thing?

phoboz · November 19, 2018, 12:43am

I’m gonna make this worse now…

To investigate the CPU effect on performance, I’ve reduced the depth of the Octree. This resulted in slightly higher polycount (9.2M vs 8.8M), but dropped the CPU time to just over 2ms (from 6.9ms) and the culling went down to almost nothing. However, this resulted in increased total frame time (18.8ms vs 18ms).

Attaching new stats.

Please help?