Network serialization is sensible to packet corruption

Introduction:

I started to notice this issue after our update from 4.11 to 4.13, but i’m not sure if it was present even before.

In our project, we manually serialize some data, and send it to clients using some replicated TArray, to grab them (if and when they arrive client-side), deserialize and use the data. So far so good: locally it worked like a charm.

But when we tested it online, we got totally random issues/crashes: after a lot of tests, we saw that sometimes the deserialization output data was just garbage, and we narrowed the problem to the deserialization input data.

We created a new project to test this issue, and confirmed it. Below you can find the issue description, how to reproduce it, source files etc.

Description:

A replicated TArray of bytes (uint8) can get altered/corrupted over a internet network connection, probably due to UDP packet alteration/corruption. I’m not sure if this can happen with every replicated variable (i.e. if it’s a global problem or just a TArray’s one).

How we tested it

To test the isssue, we created the following situation:

server: it spawns 4 actors that will contain a TArray, and randomly generates 4 byte sequences that will be used as test cases (random length, random data), and save them in the GameState (these are replicated too, since the client will use them for the validity check).

Every tick, the server will swap the sequence between replicated actors, forcing them to be sent to the client (in particular, we choose to cycle them, i.e. the first actor will have on the first tick the sequence 1, on the second tick the sequence2, on the third the sequence 3, and so on).

client: it receives the replicated data on each actor, and checks it with the 4 known sequences, to see if corresponds or it got corrupted somehow.

Results

Testing it locally, it got 0 errors in an hour of continued testing, as expected.

However, we got 2 errors testing it over the internet network after half an hour (we did just a test with this setup, but as we saw in our project, it could be as low as 3 minutes, it’s totally random).

How to reproduce it

  1. Create a new empty c++ project called “NetTest”
  2. Extract in the project folder the attached .rar archive
  3. Build & package it (in development mode)
  4. Prepare to run it on 2 different machines, with a internet-network connection in between (or with something that can simulate packet corruption)
  5. Launch the game
  6. On server, open the console with the “end” key, and type: “travel TestMap?listen”
  7. On client, open the console with the “end” key, and type: "travel "
  8. The server will print on screen when the test will start (a couple of seconds after the client connection)
  9. The client will continuosly show the amount of correct replicated sequences received, as well as the amount of corrupted ones.
  10. After one or more corrupted ones are detected, you can open the console and type exit on both client and server, and check the logs: on the server you will see the 4 valid sequences, on the client you will see the corrupted ones received. Most of the times, the difference is located on just a single byte.

Attachments

I will attach a .rar with the configs, content and source files. (source attachment)

I can provide you with the packaged version too if needed.

[edit1:]

Expected behavior

If a packet gets corrupted, the data should be dropped/ignored, and the variable should not be updated.

Final considerations

At this point, i am curious and have a couple of questions:

  • how does Unreal manage corrupted packets?
  • is this an issue about TArrays, replicated variables or even RPC can be affected?
  • as a workaround, do you think adding a CRC or something similar to the byte sequence could be enought to avoid garbage input on my system?

[edit 2]:

Test result example

Here a log result example:

Server side (test sequences)

[2016.10.28-19.18.37:109][851]LogTemp:Warning: Sequence 1:
[2016.10.28-19.18.37:109][851]LogTemp:Warning: Sequence data: 107 33 63 183 152 88 178 28 38 143 37 63 235 216 39 188 189 52 175 41 212 129 8 132 37 147 20 123 47 243 196 44 64 24 249 157 206 148 166 120 212 20 114 67 201 151 167 184 53 214 175 107 109 42 239 86 127 240 69 136 208 198 235 141 125 141 22 196 1 153 6 64 38 242 220 103 214 21 0 236 191 23 78 89 100 153 234 191 156 106 237 201 7 70 191 
[2016.10.28-19.18.37:109][851]LogTemp:Warning: Sequence 2:
[2016.10.28-19.18.37:109][851]LogTemp:Warning: Sequence data: 129 165 106 243 28 227 155 92 165 209 178 4 73 240 148 113 78 96 148 29 120 208 53 74 25 105 13 122 94 23 150 32 211 10 46 185 179 248 21 252 1 222 167 59 0 208 173 170 254 160 240 170 17 214 244 204 186 226 15 161 5 227 174 21 32 243 7 109 1 100 12 151 19 209 219 244 127 101 126 62 15 79 57 188 62 190 51 97 109 78 197 10 215 94 46 118 64 26 255 45 235 190 255 238 
[2016.10.28-19.18.37:109][851]LogTemp:Warning: Sequence 3:
[2016.10.28-19.18.37:109][851]LogTemp:Warning: Sequence data: 231 208 225 244 102 32 34 237 111 173 212 183 160 230 107 69 176 102 73 85 68 35 90 49 197 138 224 251 162 89 8 130 231 117 123 53 145 149 142 43 136 245 62 219 159 233 152 127 145 148 188 168 87 131 31 26 191 130 54 102 181 103 53 89 30 30 1 191 175 200 4 130 169 30 13 179 227 0 219 128 5 88 136 163 100 45 105 73 94 43 164 6 166 98 13 216 100 137 125 184 143 138 78 214 108 161 26 237 158 
[2016.10.28-19.18.37:109][851]LogTemp:Warning: Sequence 4:
[2016.10.28-19.18.37:110][851]LogTemp:Warning: Sequence data: 112 213 165 35 152 190 60 230 39 52 124 47 106 96 41 107 111 159 23 219 191 238 8 150 126 49 240 91 10 225 9 4 218 49 148 206 163 28 189 93 143 123 128 162 143 159 252 137 132 120 29 160 98 214 248 141 237 197 222 149 71 143 101 0 229 114 124 155 37 134 200 39 31 227 84 254 225 150 228 172 187 82 228 206 196 142 73 250 62 238 102 253 4 159 252 49 166 156 143 49 98 109 74 195 156 93 110 40 

client side (wrong sequence received):

[2016.10.28-19.43.28:315][365]LogTemp:Warning: Error found, received sequence:
[2016.10.28-19.43.28:315][365]LogTemp:Warning: Sequence data: 112 213 165 35 102 190 60 230 39 52 124 47 106 96 41 107 111 159 23 219 191 238 90 150 126 49 240 91 10 225 9 4 218 49 148 206 163 28 189 93 143 123 128 162 143 159 252 137 132 120 29 160 98 214 248 141 237 197 222 149 71 143 101 0 229 114 124 155 37 134 200 39 31 227 84 254 225 150 228 172 187 82 228 206 196 142 73 250 62 238 102 253 4 159 252 49 166 156 143 49 98 109 74 195 156 93 110 40 
[2016.10.28-19.43.28:605][383]LogTemp:Warning: Error found, received sequence:
[2016.10.28-19.43.28:605][383]LogTemp:Warning: Sequence data: 112 213 165 35 28 190 60 230 39 52 124 47 106 96 41 107 111 159 23 219 191 238 53 150 126 49 240 91 10 225 9 4 218 49 148 206 163 28 189 93 143 123 128 162 143 159 252 137 132 120 29 160 98 214 248 141 237 197 222 149 71 143 101 0 229 114 124 155 37 134 200 39 31 227 84 254 225 150 228 172 187 82 228 206 196 142 73 250 62 238 102 253 4 159 252 49 166 156 143 49 98 109 74 195 156 93 110 40 

As you can see, it was the same sequence in this case, with 2 different errors contained.
This happened after 25 minutes from the test begin.

Addendum:

After some other tests on our project, it seems even toher data can be affected (we have the binary data array in a replucated ustruct, and got wrong data in the other fields as well)

Hey zamy,

I ran your test project for 60,000 replication cycles and haven’t had a “corrupted” sequence.

113177-515898_cycle.png

I have it running through the internet (not LAN).

Hey, thanks for the answer.
So, how can i help you replicating it?

Ok, i was able to replicate it at will on a local machine, using “clumsy”
[link text][1]

My bad, it seems it isn’t about packet corruption (that in fact causes connection drops or crashes), but about packed drops.

With the given configuration, it generates a lot of errors:

113274-reperrors.png

113275-cumsyconfiguration.png

Hey zamy,

To start, if you are intentionally using third party software to prevent your network from functioning as it normally does, that isn’t a bug with the Unreal Engine; I understand it is for your test but we can’t verify bugs in this fashion.

Secondly, I did send an email out to a networking engineer and got an answer to how UE4 handles packet loss:

If packets are dropped, properties will eventually replicate such that the client will eventually match what the server says.

The caveat is that it’s possible to change property a on frame 1, b on frame 2, and c on frame 3. If all 3 frames worth of packets were dropped, frame 4 could deliver the properties values on the same frame, even though they changed on separate frames.

On the other hand, if you change values all on the same frame, they are at a minimum guaranteed to arrive on the same frame (but you might also get other properties combined still as explained above).

This means that even though you are seeing a mismatch between your client and server response with the byte arrays, UE4 is setup in a way that will make sure to sync the client to what the server has, it just might take some time for another packet to get through.

Lastly, thank you for the detailed write up explaining your issue. We appreciate the time you took to put that together so we can investigate your problem.

Thank you for submitting a bug report, however at this time we believe that the issue you are describing is not actually a bug with the Unreal Engine, and so we are not able to take any further action on this. If you still believe this may be a bug, please provide steps for us to reproduce the issue, and we will continue our investigation.

Thanks for the answer.

The problem is, i have one single property that gets half replicated.
I know about the possible delay and/or avoided replication of a change in a frame, and i was prepared to account for that.

But how can i distinguish a fully replicated property (even if i get the event 3 times in a row in a single frame), by a half replicated property (thus full of dangerous garbage)?

What property are you trying to replicate? Can you give an example of how much data there is in and how often you are replicating it?

Thanks.

Sure, here’s an example:

USTRUCT()
struct FEntitySerializedState
{
	GENERATED_USTRUCT_BODY()
	UPROPERTY()
	int32 CommandFrameNumber;
	UPROPERTY()
	int32 IdNumber;
	UPROPERTY()
	int32 OwnerNumber;
	UPROPERTY()
	UClass* EntityClass;
	UPROPERTY()
	TArray<uint8> Data;
};

Data contains binary data that can vary from 48 to about 100 bytes at the moment.

A little explanation: this is some data that i may use on clients to fix some simulation errors, and i don’t care when or if it will be sent. In fact, actually a variable of this type is located on a dedicated actor with a low NetPriority, and a low MinNetUpdateFrequency.

The basic idea is: if we have bandwidth to spare, let unreal send this data too, else it doesn’t matter.
This system is working as expected, except the fact that sometimes (rarely), i receive a OnRep notification for the data update, but the struct was only half replicated, leading in garbage data (not only the binary data, but it happened that the whole struct data made no sense at all), and that means tons of errors and crashes.

So at this point the question changes to: if i can’t trust a replicated variable to contain coherent, even if old, data (and honestly it sounds really strange to me), is there a way for me to detect in the OnRep event that the actual replicated variable is garbage?

Note: you may wonder why i use a system like this instead of simply using replicated actors. It’s because of my restrictions: i would need literally over 5MB/s of bandwidth per client with a pure client/server approach, and the test obviously didn’t work well, so we switched to a deterministic simulation. And the data given above is used, if needed, to eventually fix some errors in the simulation (the server still is authoritative).

Hey again,

If you changed all the properties in a single frame, it should all replicate at the same time.

A lot of times, I will do something like:

FEntitySerializedState SerialState;
SerialState.CommandFrameNumber = GetCommandFrameNumber( );
SerialState.IdNumber = GenerateIDNumber( );
SerialState.OwnerNumber = GetOwningCharacter( )->GetNumber( );
SerialState.EntityClass = GetEntityClass( );

for( int i = 0; i < CurrentData.Num( ); i++ )
{
	SerialState.Data.Add( CurrentData[ i ] );
}

if( Role == ROLE_Authority )
{
	//Actual replication point (ReplicatedSerialState being what is replicated)
	ReplicatedSerialState = SerialState;
}

Furthermore, what part of the struct isn’t being replicated? I am told that there could be a issue from the EntityClass pointer, with the following explaination:

One exception is pointers to other replicated objects (the EntiyClass). If those aren’t valid yet on the client, it could be awhile until they come through.

Hey,
the problem isn’t that the stuff gets replicated late, it’s that it gets replicated with wrong values.

So yes, it should all replicate at the same time, but I opened this bug report because it doesn’t happen if there’s a packet loss: instead, it replicates the struct partially and/or it contains garbage data (to be clear: i would expect that if a packet loss occurs, the data simply don’t get replicated, instead of giving me a OnRep event and invalid data).

And the problem is not the Class pointer.

Example (made up values):
on the server i set

CommandFrameNumber = 10
IdNumber = 1463
OwnerNumber = 1
EntityClass = ClassA
Data = [ 0 0 1 2 3 0 ]

on the client i receive (when it happens) something like this:

CommandFrameNumber = 5267 <- totally wrong
IdNumber = 1463
OwnerNumber = 143      <- totally not valid and impossible, atm i can have only [0-3] or [254-255]
EntityClass = ClassA
Data = [ 0 3 1 2 3 0 ]   <- sometimes only a byte is different, sometimes nearly the whole array

I hope it’s more clear now.

And yes, i’m sure it’s not data from another frame, since i logged them all, and the garbage received data was never sent, and it presents totally invalid data (CommandFrameNumber totally out of scope, ie negative or too big, OwnerNumber totally impossible, and so on).

Thank you for the help so far.

Addendum:
i successfully replicated it locally, this time without external tools.

In DefaultEngine.ini, add the following lines:

[PacketSimulationSettings]
PktLag=50
PktLagVariance=0
PktLoss=10
PktOrder=0
PktDup=0

This way on the net test project I got 3 wrong replications on localhost within 1600 replications count.

Hey again,

I was able to recreate the issue on a project on my end. The result is pretty inconsistent but it does to appear to be an issue.

I have created a issue report regarding it, which you can follow here:

https://issues.unrealengine.com/issue/UE-38148

Thanks for the report.

Thanks, and thanks for the patience :slight_smile:

, this is a bit irrelevant to the problem itself but I want to note it out nonetheless. You said “if you are intentionally using third party software to prevent your network from functioning as it normally does”, this is incorrect. If someone is using a third party software to simulate poor network behavior, that is still a totally valid network behavior. People may have bad network, it happens. Simulating such scenarios just makes easier to spot issues. So before deeming out third party software it is always important to research what kind of software it is and how it affects Unreal Engine.

My apologies for waking a dead thread, I came across here looking for a solution of my own problem.

I had the same problem. It did not happen when dedicated servers installed in local network.
However, it was a frequent occurrence when we installed dedicated servers in foreign countries.
I installed a dedicated server locally and easily reproduced it with “net pktloss = 30”.
It seems to occur not only in TArray but also in other replicated property types.
This is a big problem that game service is impossible.
Why wait until 4.17?