Instanced rendering performance issue
I’ve got a custom component, which renders millions of camera-facing quads (Point Clouds).
My original approach was to generate 4 vertices per point and use that to populate one, huge Vertex Buffer. Unfortunately, that could quickly exceed available VRAM - even using just Location and Color per vertex results in 64 bytes / point, multiply by 100M points (or more) and we end up with 6GB for Vertex Buffer alone.
Actual number of points rendered at any given time was controlled by the dynamically generated Index Buffer.
Since the geometry of each quad is identical, next logical step was to use Instancing. Simple Vertex and Index buffers to build a quad, and rest of the data as separate per-instance buffer. All the quads have identical rotation and scale, which allowed me to use just a single FVector (location) instead of full Transform as per-instance data.
This is where the trouble begins. I was expecting some performance hit due to per-instance overhead, but the result is nearly 2x higher frame time. My test point cloud had about 1.2M points visible (2.4M triangles), and the render times were 9ms (Merged) vs 16ms (Instanced). I’ve also noticed, that using only a single triangle per point (instead of a quad, effectively cutting the total number of polygons to render by half) reduces Merged time to ~6ms, while not having any effect on the Instanced method – suggesting the bottleneck being the sheer number of instances, rather than the polycount.
Is that kind of overhead within what’s to be expected, or am I missing something here?
Full source available upon request.
asked Nov 18 '18 at 01:59 AM in Rendering
In given case(simple instances, large number of instances), one large merged mesh is expected to be faster at increased memory cost.
Generally, you should implement some sort of data structure for efficient rendering of those things. 100 million points rendered does not make sense, since even at the most brave resolutions you have less than 10 million pixels on screen. You gotta bring the count down before it reaches GPU, both in terms of geometry rendered and memory bandwidth usage.
But as for 16 ms for just 2 million instances, that is an indication that you are doing something wrong. It should be significantly faster.
While I'm not sure whether this is affecting you, another thing with instanced meshes is that culling doesn't work on a per instance basis. As soon as one element of your instanced mesh is visible all will be rendered whether they are in view or not.
If you have a lot of elements besides or behind you, this could result in a consistent but permanent performance hit (as everything is rendered all the time instead of just while visible).
answered Nov 18 '18 at 02:01 PM
Follow this question
Once you sign in you will be able to subscribe for any updates here