4.10.0 semi-Random Freeze when TArray tries to resize memory

This started as soon as I upgraded my project to 4.10 from 4.9.
I’ve been catching this repeatedly and its semi random to reproduce.

Basically, I run about 40 threads in the background doing complex AI and signal analysis and out of about a dozen game runs, 1 thread will randomly appear to freeze. The stack always ends with the tbbmalloc/frontend.cpp privatizePublicFreeList() where it is actually stuck in what appears to be a while loop.

in the screenshot below, you can see my stack. I followed it back down to verify I’m not passing garbage around and everything was solid. for reference, the line before TArray<>::CopyToEmpty() is copying 512 floats from one TArray to another, it’s here where it gets stuck.

there is no voodoo magic here

TArray<float> Real = signal.signalData;

So, I have no real way to fix this. It’s internal to the engine allocating memory on its data structures. Was not an issue in 4.9, suddenly appeared in 4.10. happens more often in developer mode than in debug.

68763-stackcapture.png

Has anyone else encountered this?
I found a random version of that file online here tbb/frontend.cpp at master · jckarter/tbb · GitHub which i took a glance at for reference. That file is not exposed to visual studio through UE4 though so I could only assume a few things about the problem. it appears that the buffer being passed in is circular and so never hits the null terminator needed to break out of the loop. is there a rare issue with the memory allocation behind TArray?

I found the following which may be related
https://software.intel.com/en-us/forums/intel-threading-building-blocks/topic/372934

[edit] also found a few more which all pointed to this one which is more specific and even describes the issue in more detail including tools to fix it. Software - Intel Communities

Hi xdbxdbx,

We didn’t update TBB for 4.10, so it is a little surprising that this issue suddenly started happening for you. We looked through the description you provided, as well as the links that you provided, and came to the same conclusion that Vladimir at Intel mentioned. It seems to be a race condition where two threads attempt to release the same pointer twice. Vladimir mentions that they added some debugging code to identify where this may be happening in version 4.3 update 1. We have not updated to that version or beyond yet, but from what Vladimir says it should be possible to drop the updated version in in place of the currently-used version.

Please download the 4.3 update 1 TBB source code (you can get that here), and paste it into the Engine source code (after making a copy of the existing TBB code). You’ll want to run GenerateProjectFiles.bat and build the Engine again at this point. I have not tried debugging TBB before, but I believe you should then be able to run your project in Visual Studio’s debugger to help pinpoint where the issue lies. This information may also be helpful.

Thanks much for looking into this, I did in fact run a few tests and ultimately solved the issue which turned out to be a race condition as you pointed out, but not in a straight forward way. so, I’m writing this to hopefully inform others who might stumble onto this.I tried earlier but the forums were down for maintenance.

The root of the issue happens to me when when allocating memory elsewhere in the code. This is unlike what is mentioned in the links for deleting.

I accidentally had 8 threads racing to Allocate a shared TArray outside the class. something like:

if (ParentClass->MyTArray.Num() == 0)
   CreateTheTArrayData();

this line which was in error has been in place and working since probably UE4.7. The freeze error happens like 20 lines of code later.

I suspect more than 1 thread get to allocate memory via the function above, although the last thread to hit it probably sets the memory pointers so it all looks valid when you breakpoint afterwards. Somewhere down below, I think another TArray which is being populated by 512 floats happens to overlap the invalid memory space created by one of the other threads and thats where the issue triggers. There’s no way to easily catch this unfortunately.

so the lesson learned for anyone reading this. If you run into a similar issue, check for race conditions of any sort that deal with allocating or deleting or even resizing memory.

Thanks again for looking into this.