Media framework: 4K stereo playback performance (HEVC)

We are trying to play 4K stereo videos via the UE’s media framework, but are encountering performance issues. 4K mono (in h.264) is running just fine, but when trying to go for 4K stereo (in HEVC), it plays at best at around 0.5x speed and occasionally is just a slideshow.

I have not yet done much testing with it, but would like to ask already now whether I should even expect this to work? What I already know is this:

4K mono video, 4096x2048, mp4, h.264, 80mbit/s, 30fps:
runs smoothly

2K stereo video, 2048x2048 (top-bottom stereo), mp4, HEVC, 30fps, profile=main, level=6.1, tier=main:
smooth at any bitrate (tested up to 100mbit/s)

4K stereo video, 4096x4096 (top-bottom stereo), mp4, HEVC, 30fps, profile=main, level=6.1, tier=main:
0.5x playback speed at best (at any bitrate, from 0.2mbit/s to 100mbit/s!)

Interestingly, the video bitrate seems to not affect the playback performance at all in that last case. Could it be that it is bottlenecked not by decoding performance but by memory bandwidth when transferring around the decoded frames?

Gmpreussner, I think I read you say in some older post that the implementation is (was?) not very efficient, in that it makes some technically unnecessary copying of the decoded frames, do I remember correctly? Could that be related to this?

Test computer specs:
Intel i7-6700 @ 3.40 GHz (16 GB RAM)
GeForce GTX 980 (4 GB VRAM)
Windows 10

All videos are encoded using Adobe Media Encoder CC 2017.

There is nothing else happening in the game (film) while a video is playing.

…Bump?

the implementation is (was?) not very efficient, in that it makes some technically unnecessary copying of the decoded frames […] Could that be related to this?

Could be. 4K stereo puts quite a bit of strain on the CPU and GPU memory bandwidth.

4096 (width) x 4096 (height) x 0.5 (packed UV horizontally) x 4 (bytes per texel) x 30 (fps) = 1 GB / sec

This is what needs to happen to get the video into a texture:

  • load data from disk into CPU memory (80 Mbit/sec)
  • decode into CPU memory frame buffer (1 GB / sec)
  • copy frame buffer into separate buffer (1 GB / sec) ← this is the extra copy I mentioned
  • copy frame buffer from separate buffer to GPU (1 GB / sec)
  • convert from YUV to RGBA on the GPU

The performance critical parts here are the copies from and to CPU memory, which basically amounts to 3 GB / sec. The extra copy is needed, because the IMFSampleGrabberSinkCallback API that I’m currently using does not allow me to hold on to the decoded buffer. I’m planning to try some other approaches, such as a custom media sample sink, which might allow me to eliminate this extra copy. I’m also investigating asynchronous GPU texture uploads, which we added support for in the DirectX driver some time ago.

Ultimately, going forward with 4K stereo and 8K video, we’ll need a decoder that doesn’t require any CPU copies at all. We currently have this capability with AvfMedia on macOS / iOS, and we’re working with Google and Khronos on Android and Vulkan support, but it will still take some time. I’m currently working on Media Framework 3.0, but I’m not sure how many of these performance optimizations will get done for 4.17.

Ok, thank you for the answer!

Hello,
I wonder where I can find more information about HEVC.
Have these improvements arrived on the engine ?

Sorry if this is not the right place to ask this question, and thanks,