Branchless UTF-8 Encoding

183 points by vortex_ape 9 months ago

comex 9 months ago

Incidentally, this automatic branch-if-zero from LLVM is being improved.

First of all, a recent LLVM patch apparently changes codegen to use CMOV instead of a branch:

https://github.com/llvm/llvm-project/pull/102885

Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified. This matches the AMD manual, and I suspect it matches the behavior of all existing x86-64 processors (but that will need to be tested, I guess).

If so, you don't need either a branch or CMOV. Just set a register to 32, then run BSR with the same register as destination. If the BSR input is nonzero, the 32 is overwritten with the trailing-zero count. If the BSR input is zero, then BSR leaves the register unmodified and you get 32.

Since this behavior is now guaranteed for future x86-64 processors, and assuming it's indeed compatible with all existing x86-64 processors (maybe even all x86 processors period?), LLVM will no longer need the old path regardless of what it's targeting.

Note that if you're targeting a newer x86-64 version, LLVM will just emit TZCNT, which just does what you'd expect and returns 32 if the input is zero (or 64 for a 64-bit TZCNT). But as the blog post demonstrates, many people still build for baseline x86_64.

(Intel does document one discrepancy between processors: "On some older processors, use of a 32-bit operand size may clear the upper 32 bits of a 64-bit destination while leaving the lower 32 bits unmodified.")

charleslmunger 9 months ago

>Beyond that, Intel recently updated their manual to retroactively define the behavior of BSR/BSF on zero inputs: it leaves the destination register unmodified.
This is very nice of them to do, but I found while optimizing a routine in protobuf that BSR is dramatically slower on AMD CPUs than LZCNT, and so I never want to use it again - it's pretty rare to have a function using BSR that can't use CLZ instead, and CLZ is faster on arm, AMD, and equivalent on Intel since haswell.
I believe there is also some errata where on some processors Intel LZCNT had a false dependency for the output register as an input, probably because of this BSR behavior, but compilers will insert a self-xor in loops where that carried dependency would matter.
hinkley 9 months ago

I was watching a video ranting about bad benchmarks yesterday and in an aside they pointed out the (gcc) generated code used Conditional Move (cmov) in several places to handle and if/else if in the code with no branches.
I think the days of trying to branches by trying to remove conditional assignments are either gone or close to it. You may still have a subsequent data race, but the conditional assignment isn't your biggest problem with throughput.
- achierius 9 months ago
  
  What makes you say that? I've seen several cases where an over-usage of branchless programming actually slowed things down. Especially once you get past 2 nested conditionals (so 4+ pathways) you do just end up executing a lot of ultimately-unused code. In fact this has been going the other direction, in some ways, for a little while now: people overestimate how much branches cost, particularly small, local, and easy-to-predict ones.

orlp 9 months ago

If you have access to the BMI2 instruction set I can do branchless UTF-8 encoding like in the article using only 9 instructions and 73 bytes of lookup tables:

    branchless_utf8:
        mov     rax, rdi
        lzcnt   ecx, esi
        lea     rdx, [rip + .L__unnamed_1]
        movzx   ecx, byte ptr [rcx + rdx]
        lea     rdx, [rip + example::DEP_AND_OR::h78cbe1dc7fe823a9]
        pdep    esi, esi, dword ptr [rdx + 8*rcx]
        or      esi, dword ptr [rdx + 8*rcx + 4]
        movbe   dword ptr [rdi], esi
        mov     qword ptr [rdi + 8], rcx
        ret

The code:

    static DEP_AND_OR: [(u32, u32); 5] = [
        (0, 0),
        (0b01111111_00000000_00000000_00000000, 0b00000000_00000000_00000000_00000000),
        (0b00011111_00111111_00000000_00000000, 0b11000000_10000000_00000000_00000000),
        (0b00001111_00111111_00111111_00000000, 0b11100000_10000000_10000000_00000000),
        (0b00000111_00111111_00111111_00111111, 0b11110000_10000000_10000000_10000000),
    ];

    const LEN: [u8; 33] = [
        // 0-10 leading zeros: not valid.
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        // 11-15 leading zeros: 4 bytes.
        4, 4, 4, 4, 4,
        // 16-20 leading zeros: 3 bytes.
        3, 3, 3, 3, 3,
        // 21-24 leading zeros: 2 bytes.
        2, 2, 2, 2,
        // 25-32 leading zeros: 1 byte.
        1, 1, 1, 1, 1, 1, 1, 1,
    ];

    pub unsafe fn branchless_utf8(codepoint: u32) -> ([u8; 4], usize) {
        let leading_zeros = codepoint.leading_zeros() as usize;
        let bytes = LEN[leading_zeros] as usize;
        let (mask, or) = *DEP_AND_OR.get_unchecked(bytes);
        let ret = core::arch::x86_64::_pdep_u32(codepoint, mask) | or;
        (ret.swap_bytes().to_le_bytes(), bytes)
    }

adrian_b 9 months ago

While Intel has LZCNT and TZCNT (leading-zero count and trailing-zero count), which replace the wrongly-defined BSR and BSF, only since Haswell (June 2013), AMD has LZCNT since Barcelona (September 2007) and TZCNT since Piledriver (May 2012).
The author has made the mistake of not using the right compilation options for the CPU, in order to enable the use of LZCNT and TZCNT, because it is very likely that the author uses a CPU that supports these instructions, unless it is an older Intel Atom CPU, up to Tremont.
Had the author compiled correctly the program, there should not have been any branches since the beginning.
When Intel has added the BSF and BSR instructions in 1985 to 80386, they have made a very serious mistake in their definition, despite the fact that they should have followed the example of much older ISAs, where these instructions were defined correctly.
AMD has defined LZCNT and TZCNT in order to correct Intel's mistake, but in order to ensure backward compatibility, the corrected instructions use an additional prefix that is ignored by older CPUs, instead of using a new encoding. This makes the encoding of these instructions much longer than it should be.
- hn3er1q 9 months ago
  
  Interesting. What compiler options would you have used? Do you know if the options are applicable for ARM as well?
  - adrian_b 9 months ago
    
    "-march=...", e.g. "-march=skylake" or "-march=znver3" or whatever CPU you are using.
    When you do not know the correct type, and you do not cross-compile, you can just use "-march=native".
    I recommend to never use either gcc or clang without giving an explicit "-march=..." option. Otherwise you do not know which is the default compilation target and it is almost certain that the quality of the generated code will be bad. Even for code that will be distributed in binary form for multiple computers, you must choose consciously a compilation target that is the minimum that should be supported and you must not depend on a random default choice that may change at each compiler update.
    For ARM, there are similar options for specifying the CPU model, e.g. "-march=armv8.2-a". However there are much fewer models and counting leading zeroes has been supported for about 20 years, so even the default ARM CPU model must include it.
    On ARM only the compiler options for specifying the floating-point instruction subset can be tricky (an example of such an option: "-mfpu=fpv4-sp-d16"), when you are targetting embedded computers or microcontrollers, because there you can still encounter a lot of older ARM cores that have a much more restricted floating-point support than in more recent models. Moreover, some vendors of ARM-based devices may choose to reduce the cost by disabling some floating-point features, so knowing the name of an ARM core may be not enough for the selection of the compiler options, you must also know whether the optional core synthesis options have been enabled or disabled.

deathanatos 9 months ago

  /// Encode a UTF-8 codepoint.
  /// […]
  /// Returns a length of zero for invalid codepoints (surrogates and out-of-bounds values);
  /// it's up to the caller to turn that into U+FFFD, or return an error.

It's not a "UTF-8 codepoint", that's horridly mangling the terminology. Code points are just code points.

The input to a UTF-8 encode is a scalar value, not a code point, and encoding a scalar value is infallible. What doubly kills me is that Rust has a dedicated type for scalar values. (`char`.)

(In languages with non-[USV]-strings…, Python raises an exception, JS emits garbage.)

xeeeeeeeeeeenu 9 months ago

> So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”.

Not if you're targeting x86-64-v3 or higher. Haswell (Intel) and Piledriver (AMD) introduced the LZCNT instruction that doesn't have this problem.

sltkr 9 months ago

You can also very trivially do (codepoint | 1).leading_zeros(), then you can also shave one byte off the LEN table. (This doesn't affect the result because LEN[32] == LEN[33] == 1).
pklausler 9 months ago

Easy to count leading zeroes in a branch-free manner without a hardware instruction using a conditional move and a de Bruijn sequence; see https://github.com/llvm/llvm-project/blob/main/flang/include... .
- hinkley 9 months ago
  x |= x >> 1; x |= x >> 2; x |= x >> 4; x |= x >> 8; x |= x >> 16; x |= x >> 32;
  Isn't there another way to do this without so many data races?
  I feel like this should be
  x |= x >> 1 | x >> ??? ...
  - gpderetta 9 months ago
    
    By data races I assume you actually mean data dependencies?
    
    hinkley 9 months ago
    
    Oops. Yes.
adrian_b 9 months ago

Piledriver (2012) has introduced TZCNT. LZCNT had already been supported by all AMD CPUs since Barcelona (2007).
Moreover, LZCNT is the more important instruction, because it can replace TZCNT with the addition of a couple of instructions, without using any branches, even in the more rare cases when TZCNT is desired instead of LZCNT. Some older software continues to use the "ffs" gcc intrinsic, which corresponds to TZCNT, even if it is trivial to rewrite it to use LZCNT instead. In old gcc, "ffs" was available instead of the more useful "LZCNT", because the ancient DEC VAX computers included a "FFS" (find first set bit, starting from the LSB) instruction and the gcc "ffs" mapped directly to the DEC VAX instruction.
Haswell (2013) has added both LZCNT and TZCNT, being the first Intel CPU which supports them.
Unfortunately, even if Haswell is more than a decade old, the older Intel Atom CPUs until Tremont did not provide support for these instructions, so they have become ubiquitous only since 2021, even if all non-Atom CPUs have supported them for more than a decade (up to 18 years for AMD).

koala_man 9 months ago

I'm surprised there are no UTF-8 specific decode instructions yet, the way ARM has "FJCVTZS - Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero"

jsheard 9 months ago

FJCVTZS isn't really as specific to Javascript as the name suggests, it actually copies the semantics of an x86 instruction, which JS took its semantics from.
JS runtimes do use it but it's useful anywhere you need to do what x86 does, which obviously includes running x86 binaries under emulation.
HeliumHydride 9 months ago

https://developer.arm.com/documentation/ddi0602/2024-12/SIMD...

Arnavion 9 months ago

>So on x86_64 processors, we have to branch to say “a 32-bit zero value has 32 leading zeros”. Put differently, the “count leading zeros” intrinsic isn’t necessarily a branchless instruction. This might look nicer on another architecture!

Yes, RISC-V for example defines the instructions for counting leading / trailing zeros (clz, clzw, ctz, ctzw) such that an N-bit zero value has N of them.

I don't know if I can show it on Rust Godbolt because none of the default RISC-V targets that Rust has support the Zbb extension, but I checked with a custom target that I use locally for my emulator, and `leading_zeros()` indeed compiles to just one `clz` without any further branches. Here's a C demonstration though: https://gcc.godbolt.org/z/cKx3ajsjh

adrian_b 9 months ago

Due to AMD, the x86-64 ISA has been corrected several years before the birth of RISC-V (in 2007).
The 32-bit ARM ISA also had CLZ many years before RISC-V.
IBM POWER had the correct instruction since 1990.
The mnemonic LZCNT comes from Cray-1 (1976), but similar instructions had already existed even in some of the first computers with vacuum tubes.
Few computer ISAs except 80386 (where these had mistaken definitions, while also using mnemonics not encountered elsewhere: BSF and BSR) have included both LZCNT and TZCNT, but many have included at least 1 of them.
LZCNT has been much more widespread, because it is more useful (allowing both the computation of the integer part of the base 2 logarithm of an integer number and of TZCNT, when that is desired), but DEC VAX has been one of the few that had only the equivalent of TZCNT (though VAX also had "FFC", find first clear bit, i.e. first zero bit after a string of "1", starting from the LSB).

purplesyringa 9 months ago

Instead of

    let surrogate_mask = surrogate_bit << 2 | surrogate_bit << 1 | surrogate_bit;
    len &= !surrogate_mask;

consider

    len &= surrogate_bit.wrapping_sub(1);

This should still work better. Alternatively, invert the conditions and do

    len &= non_surrogate_bit.wrapping_neg();

emilfihlman 9 months ago

I mean, isn't the trivial answer to just collapse the if else tree into just math that's evaluated always?

  u32 a = (code <= 0x7F);
  u32 b = (code <= 0x07FF);
  u32 c = ((code < 0xD800) || 
  (0xDFFF < code));
  u32 d = (code <= 0xFFFF) * c;
  u32 e = (code <= 0x10FFFF);
  u32 v = (c && e);
  return(-1 * !v + v * (4 - a - b - d));

Highly likely easy to optimise.

emilfihlman 9 months ago

Can't edit anymore but a better (only slightly) and clearer version is

  u32 length = (code <= 0x7F) + (code <= 0x07FF) + (code <= 0xFFFF);
  u32 limit = (code <= 0x10FFFF);
  u32 allowed = ((code < 0xD800) || 
  (0xDFFF < code));
  u32 valid = (limit && allowed);
  return(-1 * !valid + valid * (4 - length));

Validark 9 months ago

I wrote this stub a while back

https://gist.github.com/Validark/457b6db8aa00ded26a6681d4d25...

marxisttemp 9 months ago

I love weird little tricks with popcnt/leading/trailing zero instructions.

I recently had a lot of fun getting Swift’s OptionSet bitset interface to iterate over active members.

(Unfortunately, because of weird specifics of Swift protocol associated types, I wasn’t able to actually conform OptionSet to Collection like I wanted to originally. I find it amusing that one of the first examples in the official documentation for Swift macros is to make OptionSet used an associated enum like it should.)

lxgr 9 months ago

> Compiler explorer confirms that, with optimizations enabled, this function is branchless.

Only if you don't consider conditional move instructions branching/cheating :)

RenThraysk 9 months ago

Or-ing 1 onto codepoint before calling leading_zeroes() should get a decent compiler to remove the branch.

Dwedit 9 months ago

Wouldn't branchless UTF-8 encoding always write 3 bytes to RAM for every character (possibly to the same address)?

ack_complete 9 months ago

CPUs are surprisingly good at dealing with this in their store queues. I see this write-all-and-increment-some technique used a lot in optimized code, like branchless left-pack routines or overcopying in the copy handler of an LZ/Deflate decompressor.
- atq2119 9 months ago
  
  Yep, same with overlapping unaligned loads. It's just fairly cheap to make that stuff pipelined and run fast. It's only when you mix loads and stores in the same memory region that there are conflicts that can slow you down (and then quite horribly actually, depending on the exact processor).
  - Sesse__ 9 months ago
    
    The place where I see this really hurts goes when Clang/LLVM gets too fancy, in situations like this:
    - Function A calls function B, which returns some struct S (for instance on the stack). - B writes S by individual (small) stores. - A wants to copy S from some place to another (e.g. store it in some other struct). - LLVM coalesces the individual loads/stores needed to copy S, into one or a series of large operations (e.g. 128-bit SSE2 loads+stores). - These large loads are issued while the small stores from B are still pending, and necessarily overlap them.
    Boom, store-to-load forwarding failure, and a bad stall. E.g., the Zen series seem to be really bad at this (only tried up to Zen 3), but there are pretty much no out-of-order CPUs that handle this without some kind of penalty.
    
    ack_complete 9 months ago
    
    This happens with partial autovectorization, too. Compiler fails to vectorize a first loop and then vectorizes the second, result is a store forwarding failure at the start of the second loop trying to read the output of the first loop, erasing the vectorization gains.
ngoldbaum 9 months ago

You could do two passes over the string, first get the total length in bytes, then fill it in codepoint by codepoint.
You could also pessimistically over-allocate assuming four bytes per character and then resize afterwards.
With the API in the linked blog post it's up to the user to decide how they want to use the output [u8;4] array.

sylware 9 months ago

CPU hardware ISA can load tables into regs (RISC-V for instance), then on sufficently big data under UTF-8 processing, you get branchless without cache memory request.

ThatGuyRaion 9 months ago

So is this potentially performance improving?.

not2b 9 months ago

Usually people are interested in branchless implementations for cryptography applications, to avoid timing side channels (though you then have to make sure that the generated instructions don't have different timing for different input values), and will pay some time penalty if they have to.
PhilipRoman 9 months ago

Last time I tested branchless UTF-8 algorithms, I came to the conclusion that they only perform [slightly] better for text consisting of foreign multibyte characters. Unless you expect lots of such inputs on the hot path, just go with traditional algorithms instead. Even in the worst case the difference isn't that big.
Sometimes people fail to appreciate how insanely fast a predictable branch really is.
- dbcurtis 9 months ago
  
  Pretty much. A strongly predicted branch is as fast as straight-line code, for most practical purposes (in modern processors). It is the mis-predicted branch that causes a pipeline flush and a re-fetch and so forth. The whole point of instructions like CMOV is to replace "flakey" branches with a CMOV so that you can execute both code paths and the test condition path all in parallel and grab the right answer at the end. This avoids paying the mis-predict penalty, and gives more time to compute the test condition, which for a branch is almost always only available awkwardly late in the pipeline. So as long as the compiler can do a decent job of identifying "flakey" branches up front for replacement with CMOV, it is a win. And many branches are easy for the compiler to classify. For instance -- if(SomeRareExceptionCondition) handle_exception(); -- for bonus points, move the exception handling code way the heck out to a different text page so that it isn't hanging around taking up I-cache space for no good reason.
- Laiho 9 months ago
  fn validate_ascii(bytes: &[u8]) -> bool{ bytes.iter().fold(true, |acc, b| acc & (\*b <= 127)) }
  This check will likely be the best for english text/code. You can check in varying size chunks depending on how common you think non-ascii will be. If its ascii you can move 128 bytes forward on avx2 in a couple of cycles.

decafbad 9 months ago

Checkout Erlang bit parsing.

SALCKIN 9 months ago

Reply

jan_haker 9 months ago

[flagged]