GameMaker: filtering strings

Alternatively titled "string_letters but for any kind of character".

The idea

This one's simple enough: for every character of the string, we append it to the output if it's in the allowed character set.

We might also add an option to append characters except for the set while we're here.

The code (simple)

So an implementation might look as simple as:

function string_filter_with_string(_str, _chars, _exclude = false) {
    var _result = "";
    var _len = string_length(_str);
    for (var i = 1; i <= _len; i++) {
        var c = string_char_at(_str, i);
        if (_exclude) {
            if (string_pos(c, _chars) == 0) _result += c;
        } else {
            if (string_pos(c, _chars) != 0) _result += c;
        }
    }
    return _result;
}

This will work perfectly fine for small strings / small sets, but will get slower for bigger inputs - after all, strings are immutable in GameMaker so str += str generally re-creates the string, and string_pos can only search through a string character-by-character.

Still, you can do

var _ua_alpha = "АБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"+
    "абвгґдеєжзиіїйклмнопрстуфхцчшщьюя";
var _str = "fgїkц4з";
show_debug_message("Source string: " + _str);
show_debug_message("Ukrainian glyphs only: " + string_filter_with_string(_str, _ua_alpha));
show_debug_message("non-Ukrainian glyphs only: " + string_filter_with_string(_str, _ua_alpha, true));

and get

Source string: fgїkц4з
Ukrainian glyphs only: їцз
non-Ukrainian glyphs only: fgk4

The code (fancy)

To have the code work with bigger inputs, we'll change a few things:

  • To avoid going over a string of allowed characters, we can take a map of allowed characters.
  • To avoid re-allocating a string a bunch of times, we can use a buffer that we can prepare to be just the right size (an output string cannot be bigger than input string).
  • To avoid creating/keeping a bunch of single-character strings, we can read work with character codes instead of character strings.
    Currently, the fastest way to read a string charcode-by-charcode in GM is to put it in a buffer.

And thus, we have the following:

function string_filter_with_map(_str, _map, _exclude = false) {
    static _buf = buffer_create(1024, buffer_grow, 1);
    static _out = buffer_create(1024, buffer_grow, 1);
    
    // write the string to an input buffer:
    buffer_seek(_buf, buffer_seek_start, 0);
    buffer_write(_buf, buffer_string, _str);
    buffer_write(_buf, buffer_u32, 0); // add a few more bytes at the end just in case
    
    // prepare the output buffer:
    var _len = buffer_tell(_buf);
    if (buffer_get_size(_out) < _len) buffer_resize(_out, _len);
    var _out_pos = 0;
    
    buffer_seek(_buf, buffer_seek_start, 0);
    var b0, b1, b2, b3;
    var _print = false, _start = 0;
    do {
        b0 = buffer_read(_buf, buffer_u8);
        
        // read a UTF-8 character from a buffer (up to 4 bytes!):
        var c, n;
        if (b0 < 0x80) {
            c = b0;
            n = 1;
        } else if (b0 < 0xE0) {
            b1 = buffer_read(_buf, buffer_u8);
            c = ((b0 & 0x1F) << 6) | (b1 & 0x3F);
            n = 2;
        } else if (b0 < 0xF0) {
            b1 = buffer_read(_buf, buffer_u8);
            b2 = buffer_read(_buf, buffer_u8);
            c = ((b0 & 0x0F) << 12) | ((b1 & 0x3F) << 6) | ((b2 & 0x3F));
            n = 3;
        } else {
            b1 = buffer_read(_buf, buffer_u8);
            b2 = buffer_read(_buf, buffer_u8);
            b3 = buffer_read(_buf, buffer_u8);
            c = ((b0 & 0x07) << 18) | ((b1 & 0x3F) << 12) | ((b2 & 0x3F) << 6) | ((b3 & 0x3F));
            n = 4;
        }
        
        // a character is visible if it's not 0 (end of string) or it exists in the set
        var _was_print = _print;
        _print = c > 0 && ds_map_exists(_map, c) != _exclude;
        if (_print != _was_print) { // beginning/end of a visible section
            if (_print) {
                // beginning of a visible section
                _start = buffer_tell(_buf) - n;
            } else {
                // end of a visible section
                var _count = buffer_tell(_buf) - n - _start;
                if (_count > 0) {
                    buffer_copy(_buf, _start, _count, _out, _out_pos);
                    _out_pos += _count;
                }
            }
        }
    } until (b0 == 0);
    buffer_poke(_out, _out_pos, buffer_u8, 0);
    buffer_seek(_out, buffer_seek_start, 0);
    return buffer_read(_out, buffer_string);
}

So roughly half of the code here is responsible for converting 1..4 UTF-8 bytes into a charcode, but otherwise things are pretty normal - we write down where "visible" (allowed by filter) sequences of characters start, and copy them to the output buffer when they end.

And here's a function that creates a map of character codes from a string for convenience:

function string_filter_create_map_from_string(_string) {
    var _map = ds_map_create();
    var _len = string_length(_string);
    for (var i = 1; i <= _len; i++) {
        _map[? string_ord_at(_string, i)] = true;
    }
    return _map;
}

If we're to test it on the same string as above,

var _ua_alpha = "АБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯ"+
    "абвгґдеєжзиіїйклмнопрстуфхцчшщьюя";
var _str = "fgїkц4з";
var _ua_map = string_filter_create_map_from_string(_ua_alpha);
show_debug_message("Source string: " + _str);
show_debug_message("Ukrainian glyphs only: " + string_filter_with_map(_str, _ua_map));
show_debug_message("non-Ukrainian glyphs only: " + string_filter_with_map(_str, _ua_map, true));
ds_map_destroy(_ua_map);

it'll yield the same result:

Source string: fgїkц4з
Ukrainian glyphs only: їцз
non-Ukrainian glyphs only: fgk4

And now to performance:

This yields mostly-similar results as the simple function with small strings on VM, and slowly gets better as the string/filter gets bigger (~3x for the aforementioned filter and a 500-character string).

But if you build the game for YYC, things are better right away - ~4x on smaller strings and ~10x for a 500-character string.

Conclusions

A function to iterate a string's character codes (akin to string_foreach) and/or to read a UTF-8 character from a buffer could be neat.

Related posts:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.