StringToBytes/BytesToString broken for some characters

Haya,
I’m trying to use StringToBytes to convert a string into a byte array so I can encrypt it, but it seems like some unicode characters cause overflows that give a different string when we call BytesToString on the output byte array. It looks like StringToBytes assumes that you’re string is UTF-8, but FString can support UTF-16, so any UTF-16 characters seem to break it down.

The character I tested with is ™.

This line seems to be the problem:

OutBytes[ NumBytes ] = (int8)(*CharPos - 1);

Since the function is shoving wchar_t’s into an int8, which isn’t big enough to store all the values.

Hello mrooney,

I have a few questions for you that will help narrow down what issue it is that you are experiencing.

Quick questions:

  1. Can you reproduce this issue in a clean project in the 4.12.5 version of the engine?
  2. If so, could you provide a detailed list of steps to reproduce this issue on our end?
  3. Could you provide and code and/or screen shots of any blueprints that may be involved?

Haya,

  1. Not able to test on 4.12.5, but looking at the code in UnrealString.h, I can see that it hasn’t been fixed yet.

2/3. If you pop this code into an exec function you should be able to see what I’m talking about:

    FString inString(TEXT("TestString™"));
    uint32 size = inString.Len();
    TArray<uint8> data;
    data.AddUninitialized(size);
    StringToBytes(inString, data.GetData(), size);

    FString outString = BytesToString(data.GetData(), size);

    ensureMsgf(inString == outString, TEXT("In String: \"%s\" Out String: \"%s\""), *inString, *outString);

inString and outString should be the same string, but inString is “TestString™” and outString is “TestString”" (note extra quotes).

Hey mrooney-

Thank you for the sample code, I was able to reproduce the issue and have logged a report for it here (Unreal Engine Issues and Bug Tracker (UE-33889)) . You can track the report’s status as the issue is reviewed by our development staff.

Cheers

Oh snap public issue tracking!


Thanks much for your follow up :slight_smile:

I’ve done some looking into this issue, and here are a few notes that I’ve come up with.

  1. The sample code here is not allocating enough space to hold the ™. inString.Len() only returns the number of characters in the string. If you were to store the string with two bytes per character, your array would need to be inString.Len() * sizeof(TCHAR).
  2. StringToBytes casts the TCHAR characters to an int8 which is why it is losing data when converting ™.
  3. On the BytesToString side of things, only a single byte at a time is used to set the characters of the FString.

This leaves you with the question of, “How should this be fixed?” From reading the documentation, it looks like Unread Strings are stored as USC-2 internall which I believe means 2-bytes per character. I’m far from an expert on unicode but I believe the difference form UTF-16 here is that you FString does not support multiple code points combining into a single character. That means that StringToBytes could be updated to simply store two bytes per character. That would also mean that BytesToString would need to be updated to handle this. Herein lies the problem of backwards compatibility because anyone who had used StringToBytes prior to this change and saved it to disk, would find that their can no longer use BytesToString.

One potential solution to this problem would be to do something similar to what the FString docs mention about serialization. It states that if the TCHAR < 0xff, then it stores a single byte, otherwise, it stores 2-bytes. Updating the two functions in question to work this way might resolve everyone’s problem since you would still be able to use BytesToString on existing saved bytes due to the fact it can handle the bytes as 1-byte or 2-bytes per character. This approach would need approval from the Epic devs though.

StringToBytes and BytesToString do some interesting things as part of their conversion. For instance, when converting to bytes, each character has 1 subtracted from it. On the other end, 1 is added back to each character. That alone makes it seem like you probably don’t want to use those functions if you’re looking for an exact representation of the string in bytes.

Another concern is that these methods don’t define an encoding for the characters in bytes. This is concerning if you wanted to pass the bytes to some other library that accepts an array of bytes that are to have a specific encoding.

You might find help looking into StringConv.h. In there, you will see two helper classes called FTCHARToUTF8 and FUTF8ToTCHAR. I’m not sure how well these are supported, but they appear to work in the following sample code.

void AMyActor::BeginPlay()
{
	Super::BeginPlay();

    FString InString(TEXT("TEST™TEST"));

    FTCHARToUTF8 ToUtf8Converter(InString.GetCharArray().GetData());
    auto Utf8StringSize = ToUtf8Converter.Length();
    auto Utf8String = ToUtf8Converter.Get();

    FUTF8ToTCHAR ToTCharConverter(Utf8String);
    FString OutString(ToTCharConverter.Get());

    GEngine->AddOnScreenDebugMessage(-1, 10.f, FColor::Red, InString);
    GEngine->AddOnScreenDebugMessage(-1, 10.f, FColor::Blue, FString::FromInt(Utf8StringSize));
    GEngine->AddOnScreenDebugMessage(-1, 10.f, FColor::Green, OutString);
}

The adding and subtracting 1 is to handle null terminators. It’s normal.

I know what the issue is, and I’ve already fixed it locally. I was just reporting the bug so Epic could look into it on their side and figure out what they wanted to do because there isn’t a great solution that has the functions working the way you expect them too and remains backwards compatible.

Had the same issue with 4.26.2, the Unreal engine issue report says they will not fix it. (Unreal Engine Issues and Bug Tracker (UE-33889))

I dont fully understand the implications of the null pointer terminator described in their comment:

// Put the byte into an int16 and add 1 to it, this keeps anything from being put into the string as a null terminator

But anyway I created my own function just removing the +1. So far seems to work ok in my case, where I’m reading a string from an HTTP request.

  inline FString MyBytesToString(const uint8* In, int32 Count)
  {
  FString Result;
  Result.Empty(Count);

  while (Count)
  {
    int16 Value = *In;

    Result += TCHAR(Value);

    ++In;
    Count--;
  }
  return Result;

}

Do you have a fix for StringToBytes()?

Hello mrooney,
Could you please explain why we need -1 and +1 when convert String and Bytes.
Thank you very much for your sharing.

1 Like