Crash in FMallocBinned2::MallocExternal on Win64

Update

If you have not read it, please see ‘Original Post’ below

In order to track down this memory corruption, i.e. why we had pointers pointing to bad (unmapped) memory regions sitting in the free pool and eventually getting handed out, I put together some rather brute force instrumentation in MallocBinned2. Every time it hands out a pointer, it records it in an array, and every time it frees a pointer it removes the array element (I’m just using a flat array that can hold millions of pointers …)

With this I hoped to catch frees on pointers not handed out by mallocbinned2 and double frees.
It may be a red herring, but the most likely suspect seems to be a callstack coming from AsyncLineTraces.
There is a copy that invokes a TArray copy constructor, that is pushing a pointer into the free blocks that either
a) didn’t come from mallocbinned2 or b) was double freed (because it failed my test).

Stack trace looks like:

Seems like its the ‘OutHits’ array being resized from 4 to 0 in FtraceDatum in the AsyncTrace data’s ‘Trace Data’ buffer

TArray>>			TraceData;

And then the OutHits’s original data pointer being passed to mallocbinned2. At this time I have not been able to ‘catch’ this memory going from a valid state to invalid, so its possible I have the wrong suspect, but the behavior seems odd. Also of note, these pointers are being pushed into SmallPoolTables[8], which is the table that we are consistently crashing on. We also make heavy heavy use of async line traces on the server.

I have to continue to dig, but I wanted to get feedback from someone who knows the allocation system better than I if this is indeed odd behavior I should investigate or if it seems benign.

Original Post

We are seeing what was a somewhat rare, but now more frequent (possibly due to a much higher running server instance count)
where our server build (running on windows server 2012 built as a shipping x64 target) was crashing and bringing up the native crash handler, without creating a dump or flushing the log.

We recently caught it and created a mini dump with heap through windows, and saw the crash handler was actually crashing in FMallocBinned2::MallocExternal, specifically inside:

void* Result = Pool->AllocateRegularBlock();

Judging from the disassembly, it seemed to be crashing when dereferencing FirstFreeBlock in

	bool HasFreeRegularBlock() const
	{
		CheckCanary(ECanary::FirstFreeBlockIsPtr);
		return FirstFreeBlock && FirstFreeBlock->GetNumFreeRegularBlocks() != 0;
	}

Here is the callstack:

I did a bit of digging to see why FirstFreeBlock access was causing a 0xc0000005, and saw from the register state that it was looking up SmallPoolTables[8] which looked valid.

However, the first free block pointer, who’s dereference caused the crash, was of course invalid (even though it looks somewhat in the right range) Looking it up in windbg gave me

Content source: 1 (target), length: dd40
0:000> !address 0x000001ec47d0d159

Usage:                  Free
Base Address:           000001ec`34d90000
End Address:            000001ec`6e170000
Region Size:            00000000`393e0000 ( 915.875 MB)
State:                  00010000          MEM_FREE
Protect:                00000001          PAGE_NOACCESS
Type:                   

But running that on several pools around that pool gave me pointer in valid (64KB) mapped pages. Also of note, the page info pointed to the bad free block and all surrounding structures seemed valid with correct canaries and valid pointers (verified with windbg) so to me it doesn’t look like (at least not a trivial) stomp.

Now since this crash was in the crash reporter, obviously we crashed before this, and I was able to pull the crash context out of memory and run use [rip] and [rsp] from the context to resurrect the initial crash in windbg, and it seems to have crashed in the same place but from an allocation from gameplay code:

I again looked at the register state and it was getting the same exact bad block pointer from that pool state. Which seems consistent, at least.

Has this issue come up at all, or perhaps you have some insight in what could be happening? I’ve run the game with the stomp allocator to try to suss out any bad memory access on our part or any plugins, but as I mentioned, all the other memory around the bad pointer (which itself is a reasonable value) is pristine.

The only other issue on UDN that looked remotely similar was:
https://udn.unrealengine.com/questions/359869/mallocbinned-crash.html

But this looks like its in the original mallocbinned, albeit in almost the exact corresponding place.
I’ll continue trying to see if we are inducing this in anyway, but if you have any insight it would be most helpful.
Thanks!

This might not be the issue, but I have seen a broadly similar crash, where some code incorrectly calls FMemory::Free (possibly via another mechanism, such as the delete operator) with a valid memory pointer that was allocated by another allocator, not FMemory::Malloc. Later on, that memory is cleaned up by the allocator that owns it and FMallocBinned is left with a pointer to invalid memory in its list of free blocks, and it doesn’t blow up until some unrelated code tries to allocate some memory.

If you suspect this is what’s happening, you could try looking at any code that’s not part of standard UE4 (third party plugins, your own code, etc) that might be using its own memory allocator, or using features that change allocation behaviour such as FMemStack, or code that overrides the new and delete operators, and look for any possible bugs there.

Good luck!

Hey Stephan,

We have the MallocVerify proxy which does the same thing as you have hacked in - you can just enable it via the MALLOC_VERIFY macro at the top of MallocVerify.h.

The pointer value you have is absolutely invalid - you can tell as it’s not aligned. This suggests that it’s not merely stale, but has been corrupted somehow. This could, as you and others have suggested above, have occurred as a result of bad pointers coming in and out of the allocator. Another simple thing to add would be to check that the pointers coming into Free and Realloc are aligned to 16.

I notice that tracing pops up in both callstacks so I agree that it could be coming from there. I will ask around for someone more familiar with that code.

Steve

That’s exactly the kind of info I was looking for, I’ll try auditing our plugins and see if I can isolate anything doing manual memory management.

Hey,

yeah, just as a note. I’m suspecting exactly this and am looking into that direction at the moment. This or a double deletion of the same pointer. I expect this to be something that does not belong to any of the bin pools - has never or not anymore.

If you find something please share - so will I.

Thanks,
Oliver

I updated my post with my investigation so far, but I’m still digging to try to find more compelling evidence.

This happens on Linux pretty regularly due to mismatched libc++ vs libstd++ calls. Make sure to never cross API boundaries with those classes because it’s very easy to get into a crash state.

The only area of our code that uses std::* classes (assuming that’s what you are referring too) is our plugin code that interfaces with the amazon aws libraries (which expect std::shared_ptr to be passed back and forth)

If you think this is a likely candidate I’ll look into it, otherwise the entirety of our code base should be using the entirely UE4 types.

Hey Stephan,

We have the MallocVerify proxy which does the same thing as you have hacked in - you can just enable it via the MALLOC_VERIFY macro at the top of MallocVerify.h.

The pointer value you have is absolutely invalid - you can tell as it’s not aligned. This suggests that it’s not merely stale, but has been corrupted somehow. This could, as you and others have suggested above, have occurred as a result of bad pointers coming in and out of the allocator. Another simple thing to add would be to check that the pointers coming into Free and Realloc are aligned to 16.

I notice that tracing pops up in both callstacks so I agree that it could be coming from there. I will ask around for someone more familiar with that code.

Steve

I knew that functionality probably existed! I’ll switch over to using that, thank you!

Yes the alignment was strange to me also, it jumps past he entire body of MallocBinned2::Realloc() because its not aligned to the 4 byte boundary (Alignment <= BINNED2_MINIMUM_ALIGNMENT) and jumps straight into ReallocExternal(). I don’t know mallocbinned2 well enough to say that is incorrect, but it looked fishy.

I’m not live debugging atm, but I saved a dump with the above callstack so I can check, but from my memory, alignment was being passed in as something odd like 81, and if memory serves it was just a raw sizeof(THitResult). This is going from memory, so if its useful I’ll check the dump I saved and get back to you.

Thanks!

Are you sure it skipped the block because the alignment was 81? That shouldn’t be possible. Alignment is determined by the allocator of the TArray, which is the default for OutHits, which is FHeapAllocator, which doesn’t pass any alignment argument to Realloc, and so should be 0.

Also, can you confirm that you are not running multithreaded physics on your server?

Steve

Sorry, you are right, alignment is 0 as it should be.
The code is entering the body, but not returning in it, and thats why it calls realloc external. These pointers seem ok too, not aligned strangely like the one causing the crash. I will modify my check to screen for non 16 aligned.

To my knowledge we are not using multithreaded physics on the server, I couldn’t find any configuration for that though.

Just to follow up - upon switching to malloc verify, the asyncline traces did not show up as an issue, I must have missed a malloc entry point in my simple test. I ran the whole game under those conditions for a while and nothing tripped.

To better answer your question on multithreaded physics, it looks like the async line traces are of course happening on a task thread, but other than that there should be no multithreading.

I will run a much larger scale test using mallocverify to catch our issue, thank you very much for your help.

Another thing to check would be that all UWorld::Async*() calls are only coming from the game thread. The code for starting a new trace is not thread-safe, even though the traces themselves should be safe once they’ve started.

Just put this in StartNewTrace:

check(IsInGameThread());

Steve

Added that and did a quick sanity check and it looked good. However, we came across this CL recently and we are going to integrate and see if it affects our crash rate
https://github.com/EpicGames/UnrealEngine/commit/3a274e0296a40f5f290caa17f1dd6e3f963370e2

Was same issue and yes, it is because something is overriding or use its own allocator. For me it was fixed by changing stl to ue-classes.