FSocket: problems with GetConnectionState and Send?

Hey Everyone !

I have two problems with the FSocket class, i don’t know if they are related:

  • The GetConnectionState function does not seem too reliable.

  • The Send function needs a call to GetConnectionState to work properly.

Here how it goes.

In order to share pictures (picked up on the hard drive) between server and clients i have written a small file server.

When using the following code:

ESocketConnectionState SocketState = Socket->GetConnectionState();

bool SendOK = Socket->Send(FullMessage.GetData(), FullMessage.Num(), MessageSent);

// Debug
if (SocketState == ESocketConnectionState::SCS_Connected) GEngine->AddOnScreenDebugMessage(-1, InfoDisplayTime, FColor::Blue, TEXT("SCS_Connected"));
if (SocketState == ESocketConnectionState::SCS_NotConnected) GEngine->AddOnScreenDebugMessage(-1, InfoDisplayTime, FColor::Blue, TEXT("SCS_NotConnected"));

GEngine->AddOnScreenDebugMessage(-1, InfoDisplayTime, FColor::Blue, TEXT("FullMessage.Num:" + FString::FromInt(FullMessage.Num())));
GEngine->AddOnScreenDebugMessage(-1, InfoDisplayTime, FColor::Blue, TEXT("MessageSent:" + FString::FromInt(MessageSent)));

I get this output (look only the blue lines):

67529-socketnotconnectedmessagesentdetail.png

Notice that the state is SCS_NotConnected, yet the message is fully sent ( MessageSent == FullMessage.Num()). That’s the first problem.

Now, when using the following code:

//ESocketConnectionState SocketState = Socket->GetConnectionState();

bool SendOK = Socket->Send(FullMessage.GetData(), FullMessage.Num(), MessageSent);

// Debug
//if (SocketState == ESocketConnectionState::SCS_Connected) GEngine->AddOnScreenDebugMessage(-1, InfoDisplayTime, FColor::Blue, TEXT("SCS_Connected"));
//if (SocketState == ESocketConnectionState::SCS_NotConnected) GEngine->AddOnScreenDebugMessage(-1, InfoDisplayTime, FColor::Blue, TEXT("SCS_NotConnected"));

GEngine->AddOnScreenDebugMessage(-1, InfoDisplayTime, FColor::Blue, TEXT("FullMessage.Num:" + FString::FromInt(FullMessage.Num())));
GEngine->AddOnScreenDebugMessage(-1, InfoDisplayTime, FColor::Blue, TEXT("MessageSent:" + FString::FromInt(MessageSent)));

I get this output:

67530-nogetstatefaildetail.png

You can see that the message has not been sent (MessageSent = -1). And yet all i’ve done is silencing the call to GetConnectionState(). That’s the second problem.

It’s fairly reproducible: as long as i don’t call GetConnectionState() right before the Send, the Send never works. As soon as i add it, it works often but not always (which might be due to some slow/unreliable network: i’m doing this over the internet).

Am i using those functions incorrectly or is there a problem with them ?

Thanks !

Hi ,

Do you happen to have a small sample project that demonstrates this issue? I would like to make sure I don’t miss any small details in how you have this working.

Hey ,

Thanks, i’ll try to prepare a mini project and send you the repro steps.
This project is not open source though, how can i send you a download link privately ?

Thanks

Sorry, I normally cover that if I ask for a sample project. Not sure why I didn’t this time. If you need to provide the project privately, you can send me a [private message][1] on the forums with a link to download it.

Hi ,

Here is my dxdiag as promised in our PM.

Cheers

link text

Hi uced,

Sorry for the delay on this issue. I watched your video a couple times, and downloaded the sample project that you provided. I ran it through Visual Studio’s debugger, but the results I received were a little different from what you showed initially. What I ended up seeing was the same thing you show later in the video: The image on the server changes, but the image on the clients does not. I did notice that the debug messages that I saw included the blue and green lines, but the black Cvtp Listener messages did not appear. Am I missing something in the setup?

Hey ,

No problem, this case is not an easy one.

My full answer is 910 chars too long for this reply, so i will send it to you through PM.

But here is the picture i am refering to in my answer.

Cheers

Hi ,

I just wanted to provide a quick update on this issue. I am still looking into where the disconnect may be happening in the communications between the server and clients. There is obviously something that is not working correctly, but I have not been able to identify where that is occurring.

Hey ,

Thanks a lot for the update, nice to know you’re on it.

Needless to say, if you need any extra help/info from me i’ll be happy to provide.

On my side, i have been correcting two other critical bugs in my code that are now fixed (yay !!) and this one is the last remaining one in the current state of the game.

So it’s now on top of my priority list.

Whenever i find the time in the forthcoming days i’ll try to seriously sniff/analyse what’s going on networkwise.

Naturally i’ll keep you in touch if i find anything relevant in this line of research.

Cheers and happy Xmas

Hi ,

Sorry for the delay on this. We are still working on it. I am trying to coordinate some time with another one of our engineers to go over the sample project that you provided in detail, but he is unfortunately rather busy right now.

Hey ,

Thanks for the up, i’m still on it too (we play every tuesday evening with my friends and this is a very annoying problem for us so i work on it as much as i can).

I tried to sniff my network with wireshark but my limited network skills didn’t allow me to get any usefull info out of it.

The only difference i could notice between “it works” and “it doesn’t work” is that the data packets are present when it works and absent when it doesn’t, which is kind of obvious. But i couldn’t identify any reason why (like long time between packets and so on). So this was a shot in the water.

Next i noticed that when it didn’t work and i tried to use the “reupload” or “redownload” buttons in the “pawn properties”, i got very quickly a 100% CPU and a crash.

As those two buttons essentialy recreate a “PUT” or “GET” cvtpRequest, i went in detail into my code and found a huge bug !

The Stop() function, which delete and clean the socket, was only called in case of success, never in case of failure.

I added a correction at the end of the SendGet and SendPut functions in the CvtpRequest.cpp, so now both functions end like this:

// clean and close socket
bSucceeded = false;
Stop();

return false;

You might want to do the same corrections for your tests, as this is obviously a bug on my side (if the two places where you should add those lines are not clear, please let me know, i’ll send you more details).

Using my two computers, i still get some network errors (so i think our problem is still there), but at least now i can use the “reupload” and “redownload” buttons, which should provide a workaround.

I didn’t have yet the opportunity of testing this with my friends (all across France, a few hundreds of km away), i’ll test that next time we play all together, hopefully next tuesday.

Naturally i’ll let you know how it goes with this correction, ie. whether or not the images travel well and (if still not), whether or not the workaround works.

Cheers

Hey ,

I posted my reply as an answer because when trying to “reply” or “add comment”, hitting the “comment” button did nothing (although there were 7 characters left). But my post is not an answer to the problem.

Cheers

Hey ,

I made some tests with 5 friends yesterday and news are not so good.

My server didn’t crash, which let me try to repost pictures many times.

It still grew up to 100% CPU though, but i have a hint that the requests remain stuck during data transfert.

There was 5 clients connected to my server all across France.

Clients got a “download failed” message in 80% of tests (including trying to “repost”).

One interesting thing is that sometimes, when changing a picture twice, the fail messages arrived in inverse order on some clients.

For example, i assign picture A, then picture B, and some clients display “download failed” for B before displaying it for A.

So the CvtpRequest might be stuck during data sending and never stop. That would explain that my CPU hit the 100% usage (but never crashed as now i clean them when they stop).

Also, even the Get requests (blue message with 203 characters) fail often (displaying “-1” instead of “203”), which tells us that it is not a size problem.

In short the problem is still there.

I going on with my tests/trials and will let you know anything i find.

Cheers

I wanted to give an update here. I’ve tried to reproduce this in the sample project, and I’m currently getting the results I expect. i.e. When Socket->GetConnectionState is called, it returns SCS_Connected, and the message sends correctly. When I remove the line to call GetConnectionState, it sill sends all of the data correctly.

I’ll try to keep playing with it to see if I can repro, but so far no luck :frowning:

Hey,

Thanks a lot for having a look and sorry for my late answer, i was very busy with another bug on my side :slight_smile:

I reran some tests (4.11p4) and it’s fairly reproducible in my home. Did you watch the video i sent to ?

You can download it here:

http://yagame.fr/wp-content/uploads/YagCurrentVersion/PbSocket_20151126-185422.webm

It is essential that you use a public ip for those tests, as it works very well when using the localhost (127.0.0.1) or the LAN ip (192.186.x.x).

I can repro the test shown in my video at will, 100% accurate, i just redid it 5 minutes ago (exactly following the instructions in the video).

Now some more info: even when i call the GetConnectionState() before the socket::send() function it works only partially:

localhost and LAN ip works 100% of time.
Now, when using my ISP ip (public ip), here are the results i get (this time, GetConnectionState() is being called every time before socket::send()):

  • it works 100% of time when having both server and clients on one computer
  • it works about 70% of time when having the client on my second computer (a few meters away from the server)
  • it works about 20% of time when having the client very far (playing with 5 friends, everyone being a few 100s of km away from each other)

So it can work, but this is clearly a case where distance (over the net) seems to matter a lot.

I wanted to have a look at the socket::send() function (looking for some sort of hidden timeout somewhere) but it’s frighteningly short (only one call to UE_LOG !!!) so i can’t even imagine how it works :slight_smile:

(i know nothing about UE4 macros though).

This issue is quite important to me, please let me know if there is anything i can do to help you helping me :slight_smile:

Cheers

Hi,

Still not solution but maybe an interesting update.

First, i did some debuging, it seems that when i don’t use the GetConnectionState function and use my public IP, the Pending in CvtpServer::ListenToConnection() is never set to true, that’s why no message are sent: there is a connection problem.

I still don’t understand why though, as it works well with the lan ip (192.168) and obviously the localhost.

Now here is an interesting part.

When i use the GetConnectionState function, it works (at least when distances are not to long), but the GetConnectionState returns wrong results.

In the following capture, you can see that both the GET and PUT messages are success (the image has been successfully asked and returned) but in the first case the GetConnectionState returned a “Not Connected” state which is clearly wrong.

79242-getconnectionstatewrong.png

I had a look at the code in SocketsBSD.cpp and i could see that GetConnectionState is calling HasState with a imposed timeout of 1 ms.

I think this is much too short in a large network context.

It might be a good idea to give back to the user the choice of a reasonnable timeout.

I think that could be a bug in GetConnectionState, which currently can clearly return a wrong state.

That said, it doesn’t help much for my problem. I tried to follow the HasState function but ultimately it calls a WinSock2 Select() function for which the code is unavailable.

So i still have no idea why the call to GetConnectionState allow my image to travel on short distances.

I suspect a timeout problem here (that would explain the distance importance) but can’t find it anywhere at the moment.

Going on with tests and investigation, i’ll keep you posted if i find anything interesting.

Cheers

Ow, i might get it.

The HasPendingConnection() function makes a call to HasState with no timeout specified (the following code is from FSocketBSD.cpp):

bool FSocketBSD::HasPendingConnection(bool& bHasPendingConnection)
{
	bool bHasSucceeded = false;
	bHasPendingConnection = false;

	// make sure socket has no error state
	if (HasState(ESocketBSDParam::HasError) == ESocketBSDReturn::No)
	{
		// get the read state
		ESocketBSDReturn State = HasState(ESocketBSDParam::CanRead);
		
		// turn the result into the outputs
		bHasSucceeded = State != ESocketBSDReturn::EncounteredError;
		bHasPendingConnection = State == ESocketBSDReturn::Yes;
	}

	return bHasSucceeded;
}

So when the response time becomes longer than the unspecified timeout, bHasPendingConnection is set to false and my CvtpServer doesn’t see the pending connection.

Can you confirm that might the problem ?

If so, it would be a good thing to restore the timeout as a parameter for network functions that uses HasState().

Cheers

No, apparently i was quite wrong :smiley:

I defined a class inheriting from FSocketBSD and made a copy of HasPendingConnetion with some timeout:

bool FYagSocket::HasPendingConnection2(bool& bHasPendingConnection)
{
	bool bHasSucceeded = false;
	bHasPendingConnection = false;

	// make sure socket has no error state
	if (HasState(ESocketBSDParam::HasError, FTimespan::FromMilliseconds(10)) == ESocketBSDReturn::No)
	{
		// get the read state
		ESocketBSDReturn State = HasState(ESocketBSDParam::CanRead, FTimespan::FromMilliseconds(10));

		// turn the result into the outputs
		bHasSucceeded = State != ESocketBSDReturn::EncounteredError;
		bHasPendingConnection = State == ESocketBSDReturn::Yes;
	}

	return bHasSucceeded;
}

Well, not only does it not solve the problem but i makes it worse, this function being called many times per second, this slows the program to death.

So wrong call^^

Going on…

Cheers

All right, got it this time, at least on small distances.

I think there is a problem with FSocketBSD::GetConnectionState.

The timeouts used when calling HasState are too small, they lead to a wrong answer (as seen hereabove).

In the following lines (from FSocketBSD::GetConnectionState) at least 10 or 100 ms should be given, and idealy this should be put as a function argument so the dev can choose what fits best his/her needs.

// get the write state
ESocketBSDReturn WriteState = HasState(ESocketBSDParam::CanWrite, FTimespan::FromMilliseconds(1)); // 1 is too small
ESocketBSDReturn ReadState = HasState(ESocketBSDParam::CanRead, FTimespan::FromMilliseconds(1));

If this sort of correction is not planned, here is the workaround i found in case anyone bumps in this problem.

This will work only on systems allowing BSD sockets as i bypassed every safety checks defined in platform.h.

I created a class inheriting from FSocketBSD and containing only one custom method to check the connection of the socket.

It’s enough for me and obviously won’t be for everybody but if you are reading this you’ll get the point.

Again, i know nothing about network programming so this might be awful, maybe even wrong. Still, the thing works for me currently.

So here is the full code:

.h:
#pragma once

#include "Runtime/Sockets/Private/BSDSockets/SocketSubsystemBSD.h"
#include "Runtime/Sockets/Private/BSDSockets/SocketsBSD.h"


// see FSocketBSD code
class FYagSocket
	: public FSocketBSD
{
public:

	FYagSocket(SOCKET InSocket, ESocketType InSocketType, const FString& InSocketDescription, ISocketSubsystem * InSubsystem)
		: FSocketBSD(InSocket, InSocketType, InSocketDescription, InSubsystem)
	{ }

	virtual ~FYagSocket()
	{
		Close();
	}

	bool YagIsConnected(FTimespan WaitTime);
};

-------------------------------------------------------
.cpp
#include "yag.h"
#include "FYagSocket.h"

// returns true if connection is ok
bool FYagSocket::YagIsConnected(FTimespan WaitTime)
{
	// convert WaitTime to a timeval
	timeval Time;
	Time.tv_sec = (int32)WaitTime.GetTotalSeconds();
	Time.tv_usec = WaitTime.GetMilliseconds() * 1000;

	fd_set SocketReadSet;
	fd_set SocketWriteSet;

	FD_ZERO(&SocketReadSet);
	FD_ZERO(&SocketWriteSet);

	FD_SET(Socket, &SocketReadSet);
	FD_SET(Socket, &SocketWriteSet);

	return select(Socket + 1, &SocketReadSet, &SocketWriteSet, NULL, &Time) > 0;
}

And here is how i use it:

// check connection and send message
bool bSucceeded = Socket->YagIsConnected(FTimespan::FromMilliseconds(100)) && Socket->Send(FullMessage.GetData(), FullMessage.Num(), MessageSent);

I still havent tested it on large distances, but locally (in my home) it works well with 100 ms

I’ll update here if it doesn’t work on large scale, but for the moment i’ll mark this solved as i apparently understood the origin of the problem and, at least locally, found a workaround.

Hope this helps !

Cheers

Hi ,

John looked over the information that you provided, and he thinks that you may be on to something. Increasing the time should be reasonable, just don’t extend it too much (the connection state is checked only once every 5 seconds, so making sure it times out before then is necessary).

John did mention that there was some additional weirdness occurring here that he wants to investigate further. He is also looking into the possibility of exposing the time to the caller so you can decide what time to use, and have a reasonable default value that will be used if none is specified.

We appreciate your help and patience with this issue.