Realtime GC with multiply threads cause logic thread wait forever

Hi dear support team,

    We met a crash/stuck issue on our shipping build. The game will crash or stuck randomly during a normal gameplay.

Some context:

  1. We use the streaming level, UMG, and
    localization features to build our
    game.

  2. The bug happens with 1/6
    probabilities in our first 3
    gameplay levels.

  3. It seems that the bug never happened
    on PC platform, but only on
    console platform.

  4. As we don`t have any logs on
    shipping build, we can only get
    bellow stacks which stuck the logic thread:
    Stucked Stack: (For authority of console platform I hide the dll names, it is kernel lib somehow)


    xxxx.dll!00000008000029CC (Module: 0x0000000800000000 + 10700 bytes)	C++
 	xxxx.dll!0000000800004A4C (Module: 0x0000000800000000 + 19020 bytes)	C++
 	xxxx.dll!0000000800004E03 (Module: 0x0000000800000000 + 19971 bytes)	C++
 	yyy.bin!FPThreadEvent::Wait(uint32 WaitTime, const bool bIgnoreThreadIdleStats) Line 393	C++
 	yyy.bin!FNamedTaskThread::Stall(int,FNothing,bool) Line 115 + 57 bytes	C++
 	yyy.bin!FNamedTaskThread::ProcessTasksNamedThread(int,bool) Line 115 + 64 bytes	C++
 	yyy.bin!FNamedTaskThread::ProcessTasksUntilQuit(int) Line 504 + 81 bytes	C++
yyy.bin!TFastReferenceCollector<FGCReferenceProcessor,FGCCollector,FGCArrayPool,false>::CollectReferences(TArray<UObject*,FDefaultAllocator>&,bool) Line 140 + 20132 bytes	C++
>	yyy.bin!FRealtimeGC::PerformReachabilityAnalysis(EObjectFlags,bool) + 403 bytes	C++
 	yyy.bin!CollectGarbageInternal(enum EObjectFlags KeepFlags, bool bPerformFullPurge) Line 1253	C++
 	yyy.bin!TryCollectGarbage(enum EObjectFlags KeepFlags, bool bPerformFullPurge) Line 1361	C++
 	yyy.bin![Inline Function] UWorld::PerformGarbageCollectionAndCleanupActors() Line 1656 + 12 bytes	C++
 	yyy.bin!UWorld::Tick(enum ELevelTick TickType, float DeltaSeconds) Line 1534	C++
 	yyy.bin!UGameEngine::Tick(float DeltaSeconds, bool bIdleMode) Line 1127	C++
 	yyy.bin!FEngineLoop::Tick() Line 2853 + 23 bytes	C++
 	yyy.bin!tchar_main(int32 ArgC, TCHAR** ArgV) Line 183	C++
 	yyy.bin!main(int32 ArgC, ANSICHAR** Utf8ArgV) Line 82 + 713 bytes	C++
 	yyy.bin!_start + 63 bytes	C++

From what i can found:

    It seems the call to "PerformReachabilityAnalysis" cause the logic thread start a child thread to do a object reference collection task and wait until the child quit (Blocking call), but the child seems never return....
    I don`t know how to debug the child thread`s code, sorry for my weak of knowledges about multiply-thread debugging. Very appreciate if you can provide me some helps about how to debugging threads.
    Another detailed uncertain information: it seems sometimes I can click somewhere (I don`t know how i did that, just click like a monkey) to see a stack looks like the child threads is doing FTextReference collection, I guess it maybe stucked there but not very sure.

Question:

  Can you give me some guides or hints on how to debug the issue, or any possible bad practises we may used which can cause this kind of issue.
  Thanks in advance, i keeps online for this question..... Boss would`t let me home if i can not fix this today T_T  SOS!!!

*Thanks and wish u have a nice day ^^

— Qu*

I’m not sure if this helps, but not a lot of stuff is thread safe. For example they have a special thread safe FMallocThreadSafeProxy memory managment. I kind of get the feeling that I’m on my own with my worker thread(s) and as such never make any calls into the UE4 api.

What I normally do to debug multi threaded issues is to make sure everything is working wonderfully on a single thread. I sort of have an ability to switch between running single threaded and multi threaded with a tiny compile change.

Then I strategically check the state of the system before and after a big call. The other thread might have changed something. This normally does not directly lead to the issue, but it sometimes points the way. When I do get an issue, sometimes it takes all day to solve, so I think you should set expectations.

I also tend to use lock-less paradigms, optimistic locking and message passing. Both for performance and maintainability.

IMHO, Solving multi threaded bugs takes a lot time and experience. I’ve seen system shipped with such bugs as a PM call.

Thank u guy, we currently are trying to debug this feature by closing some components… wish we can find some clues from terative-exclusive method.