Split text into words

What’s the best way to split sentences into an array of words? I searched through the code for FText and FString, and the closest thing I found would be to use ParseIntoArray, but that has a warning that says to use it for short text or not at all, which is bad since my sentences could easily be dozens of words long.

Is there a better way or should I just take a chance with ParseIntoArray?

1 Like

Difficult to imagine nobody trying to help for 7 years xD

Anyways, with my still limited experience with C++ in Unreal, I would do as I normally would with C++:

Think of the most possibly efficient solution

So naturally, if we can avoid creating new strings for each word (ie. limit the number of dynamic memory allocations), we can speed it up quite a bit. Unreal gives us an equivalent of std::string_view, which we can use to point to individual words inside the string without creating new strings and copying all the characters over. The longer the input string, the larger the performance gain. The algorithm is a bit like this:

  1. Ensure we have a local copy of the input string so it doesn’t get garbage collected while we work on it!
  2. Create an array to hold the string views: TArray<TStringView<TCHAR>> words;
  3. Iterate through the string and create a new string view each time the delimiting character (e.g. space) is encountered.

Regarding 3., we can create a string view from a string with a start position and a length:
TStringView<TCHAR> myView(const wchar_t* data, int32 size). Instead of data, we can create an iterator from the string and use it to loop through the characters:

FString input;
for (auto it = input.begin(); it != input.end(); it++)
{
    if (*it == ' ') {
        const wchar_t* ptr = &it.operator*();
        TStringView<TCHAR> v(ptr, 5); // arbitrary word length of 5
    }
}

There are (quite) a few gotchas that we should keep in mind:

  • we cannot use a string iterator in a range-based for loop
  • a string view is by definition not null-terminated
  • we use wchar_t which might be UTF-16 or UTF-8 depending on platform, so we should probably read up on the differences and check that we do stuff in a portable manner. (super annoying topic, I blame Microsoft :slight_smile: )
  • We should keep in mind that there may be an arbitrary amount of whitespace, at both ends of the input string and in between words.
  • We should probably read the C++ spec on string views so we can use them optimally

Hopefully this will help someone else looking for a solution to this problem.
I might update / extend this post when I have created and tested a stable solution.


Side note: O(n^2) sounds really scary! But for strings shorter than 200 characters (or thereabouts), it isn’t that big of a deal. But it we do this many times per second, we might start to notice :slight_smile: .

I have doubts that ParseIntoArray is going to give you a huge problem, and I’m not sure where that warning you’re talking about is.

You could also use RegEx, which depending on what you’re actually trying to work with and achieve, might be the better thing to do.