4.18 crash after destroying DestructibleComponent

Answers.Archive · September 12, 2018, 1:36pm

We are seeing a crash in SyncComponentsToBodies_AssumesLocked (with SceneType == PST_Async).

This appears to be due to a change in the way physics results are fetched and processed.

In 4.17, PScene->getActiveActors was called immediately after scene completion and fetchResults, when presumably the returned actors should have been valid.

In 4.18, the call has moved to FPhysScene::EndFrame, which can be delayed almost a full frame after fetchResults has completed for the async scene. Because of this, when a DestructibleComponent (if simulating in the async scene) is destroyed and it calls DeferredRelease, the release will happen between the time that fetchResults has been called and the subsequent frame’s call to getActiveActors (for the previous frame’s simulation results); and if the destructible actors that were destroyed also happened to be in the active actor list from the previous frame’s results, they will be destroyed in place, with dangling references to garbage data.

Here’s a log with some debug prints that show this happening:

[600]LogPhysics: Warning: START TICK PHYSX ASYNC SCENE (TickPhysScene)
[600]LogPhysics: Warning: COMPLETION AND FETCH RESULTS PHYSX ASYNC SCENE. (ProcessPhysScene)
[600]LogPhysics: Warning: RELEASE APEXINTERFACE. (FPhysCommandHandler::ExecuteCommands)
[600]LogPhysics: Warning: RELEASE APEXINTERFACE. (FPhysCommandHandler::ExecuteCommands)
[600]LogPhysics: Warning: RELEASE APEXINTERFACE. (FPhysCommandHandler::ExecuteCommands)
[600]LogPhysics: Warning: RELEASE APEXINTERFACE. (FPhysCommandHandler::ExecuteCommands)
[601]LogPhysics: Warning: START TICK PHYSX SYNC SCENE. (TickPhysScene)
[601]LogPhysics: Warning: COMPLETION AND FETCH RESULTS PHYSX SYNC SCENE. (ProcessPhysScene)
[601]LogPhysics: Warning: PHYSX EndPhysicsTickFunction::ExecuteTick

You can see that the results from the async tick in frame 600 are the ones that are still pending to be accessed in the frame 601 EndPhysicsTick, but the release/destroy occurs in between.

I have some ideas for a quick patch to allow us to keep working, but none are obviously the best/easiest/working approach yet. Would appreciate any guidance or a CL where a fix might have already been made.

Answers.Archive · September 12, 2018, 1:36pm

Hi,

Thanks for this info. We have an IgnoreActiveActors array which is meant to handle this case. When a BodyInstance is destroyed we add its physx actor to the IgnoreActiveActors array which we use to skip dangling pointers later on.

In the destructible case we have multiple physx actors which we are currently (incorrectly) not marking for ignoring. Can you try modifying UDestructibleComponent::OnDestroyPhysicsState so it looks like this:

void UDestructibleComponent::OnDestroyPhysicsState()
{
#if WITH_APEX
	if(ApexDestructibleActor != NULL)
	{
		if(UWorld * World = GetWorld())
		{
			if (FPhysScene * PhysScene = World->GetPhysicsScene())
			{
				PxScene* PScene = nullptr;
				TSet PActorsToRemove;

				for (int32 ChunkIdx = 0; ChunkIdx < ChunkInfos.Num(); ++ChunkIdx)
				{
					if (PxRigidDynamic* PDynamic = ApexDestructibleActor->getChunkPhysXActor(ChunkIdx))
					{
						PActorsToRemove.Add(PDynamic);
						if (PScene == nullptr)
						{
							PScene = PDynamic->getScene();
						}
					}
				}

				const int32 SceneType = PScene == PhysScene->GetPhysXScene(PST_Sync) ? PST_Sync : PST_Async;
				for (PxRigidDynamic* PDynamic : PActorsToRemove)
				{
					PhysScene->RemoveActiveRigidActor(SceneType, PDynamic);
				}

				GPhysCommandHandler->DeferredRelease(ApexDestructibleActor);
			}
		}
		
		ApexDestructibleActor = NULL;
		
		//Destructible component uses the BodyInstance in PrimitiveComponent in a very dangerous way. It assigns PxRigidDynamic to it as it needs it.
		//Destructible PxRigidDynamic actors can be deleted from under us as PhysX sees fit.
		//Ideally we wouldn't ever have a dangling pointer, but in practice this is hard to avoid.
		//In theory anyone using BodyInstance on a PrimitiveComponent should be using functions like GetBodyInstance - in which case we properly fix up the dangling pointer
		BodyInstance.RigidActorSync = NULL;
		BodyInstance.RigidActorAsync = NULL;
	}
#endif	// #if WITH_APEX
	Super::OnDestroyPhysicsState();
}

Please let me know if this doesn’t fix the issue, but hopefully this is the right fix in which case I will submit it into the engine for 4.19

Answers.Archive · September 12, 2018, 1:36pm

Hi Ori,

I tried a very similar approach first, but there are more complications. At the time OnDestroyPhysicsState runs, the async scene has started, but not completed processing yet, so when it completes, it will immediately clear the ignore actor list in ProcessPhysScene. Then later the deferred release will run and cause the dangling pointer issue.

Another minor issue is that even if the actor was not prematurely removed from the ignore list, the compare in SyncComponentsToBodies_AssumesLocked happens after a virtual “is” call on the dangling pointer.

What we ended up doing for a workaround is making a DeferredReleaseAsync function which takes an apex destructible actor pointer, a scene pointer, and which does the loop and the calls to RemoveActiveRigidActor which you posted in your answer before it does the release call.

To fix the secondary issue, we changed RemoveActiveRigidActor to RemoveActiveActor and made the ignore list contain PxActor* instead of PxRigidActor. Then we do the compare before the “is” call.

This solution fixed our issue, but I still have a concern about what happens if the async scene is still processing during the deferred release call. Is that call safe to make while the scene is active?

In any case, please let me know if you see any other potential problems with our method.

-Eric

Answers.Archive · September 12, 2018, 1:36pm

I believe I’ve submitted a fix for this in CL-3774236. I was unable to repro the crash, but based on the information given I think this should be correct. Please let me know if this still happens. At the moment this change should make it in for 4.19

Answers.Archive · September 12, 2018, 1:36pm

Hi all,

Thanks for all the information about this. I submitted a bigger change to the way we handle physx to ue4 sync which I think will be a lot simpler / more robust. We now sync immediately after fetchResults which should completely eliminate this type of bug. Please give this CL a try as we’d like to make sure it’s all fixed before it goes out in 4.19

CL-3783974

One implication with this is that sync components now happens TG_DuringPhysics instead of at the very end of that tick group. This should be OK because tasks executing in TG_DuringPhysics should not care about the indeterminate state of the simulation. Prior to this change we still had an inconsistent state where any scene queries would see the result of the sim (because we called fetchResult), but UE4 components would still be seen as the pre-sim state. So I think this is actually more consistent.

Please let me know if you run into problems

Answers.Archive · September 12, 2018, 1:36pm

Ah yes I missed the defer case. Your solution sounds right, do you mind posting the code so I can properly integrate it into the engine?

I think the case where the deferred release happens while processing is running is safe because the physx API should ensure that when we call fetchResults the active transforms array contains no deleted actors. If this is not the case then I’d expect it was broken before this change

Answers.Archive · September 12, 2018, 1:36pm

We’ve also seen a similar crash in SyncComponentsToBodies_AssumesLocked due to the same reasons, I’ve worked around the issue temporarily, is there a proper fix for this available? Cheers.

Answers.Archive · September 12, 2018, 1:36pm

I’m also experiencing this issue, I believe, although we’re not using the async scene. Could the issue be tracked publicly so I can keep an eye on the status?

Thanks,
Isaac

Answers.Archive · September 12, 2018, 1:36pm

Still seeing this crash with that changelist, rarer than before however. All of our DestructibleComponents use is within the Async scene.

Looking at the list returned by:

PxActor** PActiveActors = PScene->getActiveActors(NumActors);

It seems that while the destructible actor which was destroyed contained 25 chunks, only one of those chunks still remains in PActiveActors (the one which ultimately causes the code to crash).

All other actors are either valid, or in the IgnoreActiveActors list.

The invalid actor seems to have been killed several frames before, and I’m seeing two sets of calls to fetchResults (and thus two clearings of the IgnoreActiveActors list) before the crash occurs.

I removed the call to clear the IgnoreActiveActors list at the bottom of SyncComponentsToBodies_AssumesLocked as it seems to me the only time this list should be cleared is when fetchResults is called? Either way it didn’t help.

This detail is from three crashes post adding your fix, so take it with a pinch of salt as I may have been lucky to get three similar crashes!

Answers.Archive · September 12, 2018, 1:36pm

Sorry for not posting my code earlier, but Ori’s CL looks like pretty much what I had. I didn’t create the nice generic callback approach, and I used a different method to get the actor list, so there might be some slight differences, but the approach definitely fixed our issue. Here’s the actual phys command (which I only call on async scene components) from our current fix:

		case PhysCommand::ReleaseAsync:
		{
			nvidia::apex::DestructibleActor * ApexDestructibleActor = Command.Pointer.DestructibleActor;
			PxRigidDynamic** PActorBuffer = NULL;
			PxU32 PActorCount = 0;
			if (ApexDestructibleActor->acquirePhysXActorBuffer(PActorBuffer, PActorCount, apex::DestructiblePhysXActorQueryFlags::Dynamic))
			{
				while (PActorCount--)
				{
					PxRigidDynamic* PActor = *PActorBuffer++;
					if (PActor != NULL)
					{
						Command.PhysScene->RemoveActiveActor(PST_Async, PActor);
					}
				}
				ApexDestructibleActor->releasePhysXActorBuffer();
			}
			ApexDestructibleActor->release();
			break;
		}

Answers.Archive · September 12, 2018, 1:36pm

Here’s the new code:

struct APEXDESTRUCTION_API FDestructibleActorDeferredReleaseCallback : IDeferredReleaseCallback
{
	FDestructibleActorDeferredReleaseCallback(apex::DestructibleActor* Actor, FPhysScene* InOwningScene)
		: Destructible(Actor)
		, OwningScene(InOwningScene)
	{
	}

	virtual void ExecCommand()
	{
		PxRigidDynamic** PActorBuffer = NULL;
		PxU32 PActorCount = 0;
		int32 SceneType = INDEX_NONE;

		if (Destructible->acquirePhysXActorBuffer(PActorBuffer, PActorCount, apex::DestructiblePhysXActorQueryFlags::Dynamic))
		{
			while(PActorCount--)
			{
				if (PxRigidDynamic* PActor = *PActorBuffer++)
				{
					if (SceneType == INDEX_NONE)
					{
						if (PxScene* PScene = PActor->getScene())
						{
							SceneType = PScene == OwningScene->GetPhysXScene(PST_Sync) ? PST_Sync : PST_Async;
						}
					}

					if (SceneType != INDEX_NONE)
					{
						OwningScene->RemoveActiveRigidActor(SceneType, PActor);
					}
				}
			}
			Destructible->releasePhysXActorBuffer();
		}
		Destructible->release();

	}

private:

	apex::DestructibleActor* Destructible;
	FPhysScene* OwningScene;
};

Answers.Archive · September 12, 2018, 1:36pm

This looks better than what I wrote. I think mine has problems as it only uses visible chunks. Stephen do you mind trying the acquirePhysXActorBuffer approach? You should be able to just replace this with the loop I use in the callback.

The reason for the callback is that destructibles are now a plugin so engine can’t know about it

Answers.Archive · September 12, 2018, 1:36pm

Unfortunately I tried acquirePhysXActorBuffer last night and had no luck.

Looking into this further today it seems like the crash never happens on the frame the actors are released, nor the frame after, in fact once removed from the actors list they don’t return until either another physics actor is created, or different one is killed.

Then it seems one of the PxActors owned by the previous killed ApexDestructibleActor will be resurrected into the actor list.

It also seems like it might always be the last actor inside the ApexDestructibleActor that is resurrected.

However it doesn’t always occur even when multiple actors are created and destroyed over several frames.

I’ve also tried making all the physics run synchronously by immediately calling fetchResults after simulate- still crashes.

I previously made a hack to get this to work for us- but the only thing I found which was robust was to hold off calling release on the ApexDestructibleActors until after SyncComponentsToBodies_AssumesLocked was done processing the list- but in this case I’m guessing I just moved the race, or whatever is causing this to a benign point in the update where it wasn’t causing issues rather than this being a real fix.

Answers.Archive · September 12, 2018, 1:36pm

Hi, did your change look like what I pasted below? When you say the actors are released do you mean the call into Destructible->release(), or the DeferredRelease call? I would expect that if the destructible is released the next call to fetchResults would not have it in the active actors list since fetchResults rebuilds this list every frame.

Answers.Archive · September 12, 2018, 1:36pm

The Destructible->release() call. I went with

virtual void ExecCommand()
{
	PxRigidDynamic** PActorBuffer = NULL;
	PxU32 PActorCount = 0;
	if (Destructible->acquirePhysXActorBuffer(PActorBuffer, PActorCount, DestructiblePhysXActorQueryFlags::AllStates | DestructiblePhysXActorQueryFlags::AllowActorsNotInScenes))
	{
		for(uint32 ActorIdx = 0; ActorIdx < PActorCount; ++ActorIdx)
		{
			PxRigidDynamic* PActor = PActorBuffer[ActorIdx];
			OwningScene->RemoveActiveRigidActor(PST_Sync, PActor);
			OwningScene->RemoveActiveRigidActor(PST_Async, PActor);
		}
		Destructible->releasePhysXActorBuffer();
	}

	Destructible->release();
}

When testing acquirePhysXActorBuffer.

Answers.Archive · September 12, 2018, 1:36pm

I just tried changing FrameLagAsync() to return false always and this seems to prevent the crash- not dug in any further yet however.

Answers.Archive · September 12, 2018, 1:36pm

I would expect that if the destructible is released the next
call to fetchResults would not have it in the active actors
list since fetchResults rebuilds this list every frame.

The next call to fetchResults doesn’t have it as expected- it’s one or two frames later that it reappears!

Which is super strange, perhaps physx is double/triple buffering something internally?

Answers.Archive · September 12, 2018, 1:36pm

Hi Ori,
Can you point to the github commit that matches this CL, please?
Thanks!

Answers.Archive · September 12, 2018, 1:36pm

This change seems to have fixed our issue, thanks!

Answers.Archive · September 12, 2018, 1:36pm

Working for us as well. Thanks!

In the process of testing this though, we found a bug in physx related to using impact damage and destroying destructible chunks. What is the recommended approach to reporting and getting physx bugs fixed?