Skip to content

Improve utf8LengthByLeader #630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
meooow25 opened this issue Mar 27, 2025 · 8 comments · May be fixed by #635
Open

Improve utf8LengthByLeader #630

meooow25 opened this issue Mar 27, 2025 · 8 comments · May be fixed by #635

Comments

@meooow25
Copy link
Contributor

utf8LengthByLeader is widely used, appearing in most functions that perform iteration as it's used in stream, iter, etc. The current definition is

-- This is a branchless version of
-- utf8LengthByLeader w
-- | w < 0x80 = 1
-- | w < 0xE0 = 2
-- | w < 0xF0 = 3
-- | otherwise = 4
--
-- c `xor` I# (c# <=# 0#) is a branchless equivalent of c `max` 1.
-- It is crucial to write c# <=# 0# and not c# ==# 0#, otherwise
-- GHC is tempted to "optimize" by introduction of branches.
-- | @since 2.0
utf8LengthByLeader :: Word8 -> Int
utf8LengthByLeader w = c `xor` I# (c# <=# 0#)
where
!c@(I# c#) = countLeadingZeros (complement w)
{-# INLINE utf8LengthByLeader #-}

However,

  1. In most cases where this is used, one needs to read the next Char. This is done by branching on the result and checking if it's 1, 2, 3, or something else (4). This means that everything utf8LengthByLeader does is a waste! One can simply branch on the leading byte, as shown in the comment above.
  2. The definition does not seem to be very efficient, at least on my machine. I have a couple of alternative branchless versions that are consistently faster.

Here are the implementations:

utf8LengthByLeader_branching :: Word8 -> Int
utf8LengthByLeader_branching w
  | w < 0x80  = 1
  | w < 0xE0  = 2
  | w < 0xF0  = 3
  | otherwise = 4
{-# INLINE utf8LengthByLeader_branching #-}

utf8LengthByLeader_simple :: Word8 -> Int
utf8LengthByLeader_simple (W8# w8#) =
  let w# = word8ToWord# w8#
  in   1
     + I# (w# `geWord#` 0x80##)
     + I# (w# `geWord#` 0xE0##)
     + I# (w# `geWord#` 0xF0##)
{-# INLINE utf8LengthByLeader_simple #-}

utf8LengthByLeader_lookup :: Word8 -> Int
utf8LengthByLeader_lookup (W8# w8#) =
  I# (int8ToInt# (indexInt8OffAddr# a# (word2Int# (uncheckedShiftRL# (word8ToWord# w8#) 4#))))
  where
    a# = "\1\1\1\1\1\1\1\1\1\1\1\1\2\2\3\4"#
    -- 0..7 -> 1
    -- 8..B -> invalid leading byte
    -- C..D -> 2
    -- E    -> 3
    -- F    -> 4
{-# INLINE utf8LengthByLeader_lookup #-}

I benchmarked the implementations on two situations:

  • sum: branches on the result
  • length: does not branch on the result

The results are

  100 ascii
    sum
      current:   242  ns ±  23 ns
      branching: 70.6 ns ± 5.3 ns, 0.29x
      simple:    162  ns ±  11 ns, 0.67x
      lookup:    108  ns ± 5.2 ns, 0.45x
    length
      current:   467  ns ±  43 ns
      branching: 56.7 ns ± 5.5 ns, 0.12x
      simple:    306  ns ±  24 ns, 0.65x
      lookup:    288  ns ±  21 ns, 0.62x
  100 random
    sum
      current:   331  ns ±  21 ns
      branching: 153  ns ±  10 ns, 0.46x
      simple:    227  ns ±  21 ns, 0.69x
      lookup:    188  ns ±  11 ns, 0.57x
    length
      current:   466  ns ±  43 ns
      branching: 152  ns ±  12 ns, 0.33x
      simple:    307  ns ±  23 ns, 0.66x
      lookup:    285  ns ± 5.5 ns, 0.61x
  100000 ascii
    sum
      current:   226  μs ±  11 μs
      branching: 57.4 μs ± 5.4 μs, 0.25x
      simple:    148  μs ±  11 μs, 0.65x
      lookup:    95.0 μs ± 5.3 μs, 0.42x
    length
      current:   447  μs ±  43 μs
      branching: 44.1 μs ± 2.8 μs, 0.10x
      simple:    279  μs ±  27 μs, 0.62x
      lookup:    272  μs ±  22 μs, 0.61x
  100000 random
    sum
      current:   1.00 ms ±  87 μs
      branching: 566  μs ±  46 μs, 0.57x
      simple:    830  μs ±  45 μs, 0.83x
      lookup:    850  μs ±  59 μs, 0.85x
    length
      current:   452  μs ±  45 μs
      branching: 537  μs ±  49 μs, 1.19x
      simple:    282  μs ±  24 μs, 0.62x
      lookup:    274  μs ±  23 μs, 0.61x

So I think it would be good to

  1. Add a branching version of utf8LengthByLeader and use it where one needs to branch on the result
  2. Replace the current version with either simple or lookup. lookup is sometimes faster but it means inlining that table everywhere.

It would also be good if someone can repeat these on a different CPU or arch to see if anything changes. The results above are from GHC 9.10.1 on an x86 Ryzen 5 3600.

Full benchmark source below.

Show
{- cabal:
build-depends: base, random, text, tasty-bench
-}
{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE MagicHash #-}
{-# OPTIONS_GHC -ddump-simpl -ddump-to-file #-}

import qualified Data.List as L
import qualified Data.Text as T
import System.Random (mkStdGen, uniforms)

import qualified Data.Text.Array as A
import Data.Text.Internal (Text(..))
import Data.Text.Internal.Encoding.Utf8 (utf8LengthByLeader, chr2, chr3, chr4)
import Data.Text.Internal.Unsafe.Char (unsafeChr8)
import Data.Text.Unsafe (iterArray, Iter(..))
import GHC.Word (Word8(..))
import GHC.Exts

import Test.Tasty.Bench

main :: IO ()
main = defaultMain
  [ env (pure (mk n)) $ \t ->
    bgroup name
      [ bgroup "sum"
        [ bench "current" $ whnf (mkSum utf8LengthByLeader) t
        , bcompare (currentName "sum") $
            bench "branching" $ whnf (mkSum utf8LengthByLeader_branching) t
        , bcompare (currentName "sum") $
            bench "simple" $ whnf (mkSum utf8LengthByLeader_simple) t
        , bcompare (currentName "sum") $
            bench "lookup" $ whnf (mkSum utf8LengthByLeader_lookup) t
        ]
      , bgroup "length"
        [ bench "current" $ whnf (mkLength utf8LengthByLeader) t
        , bcompare (currentName "length") $
            bench "branching" $ whnf (mkLength utf8LengthByLeader_branching) t
        , bcompare (currentName "length") $
            bench "simple" $ whnf (mkLength utf8LengthByLeader_simple) t
        , bcompare (currentName "length") $
            bench "lookup" $ whnf (mkLength utf8LengthByLeader_lookup) t
        ]
      ]
  | n <- [100, 100000]
  , (typ, mk) <- [("ascii", mkAscii), ("random", mkRandom)]
  , let name = show n ++ " " ++ typ
        currentName which = "All." ++ name ++ "." ++ which ++ ".current"
  ]

----------------------
-- Benchmarked funcs
----------------------

-- Branches on the char length
mkSum :: (Word8 -> Int) -> Text -> Int
mkSum f = \(Text arr off len) -> loop arr (off + len) off 0
  where
    loop !a !n !i !acc
      | i >= n = acc
      | otherwise = loop a n (i + d) (acc + fromEnum c)
      where
        m0 = A.unsafeIndex a i
        m1 = A.unsafeIndex a (i + 1)
        m2 = A.unsafeIndex a (i + 2)
        m3 = A.unsafeIndex a (i + 3)
        d = f m0
        c = case d of
          1 -> unsafeChr8 m0
          2 -> chr2 m0 m1
          3 -> chr3 m0 m1 m2
          _ -> chr4 m0 m1 m2 m3
{-# INLINE mkSum #-}

-- Does not branch on the char length
mkLength :: (Word8 -> Int) -> Text -> Int
mkLength f = \(Text arr off len) -> loop arr (off + len) off 0
  where
    loop !a !n !i !acc
      | i >= n = acc
      | otherwise = loop a n (i + f (A.unsafeIndex a i)) (acc + 1)
{-# INLINE mkLength #-}

---------
-- Data
---------

mkAscii :: Int -> Text
mkAscii n =
  T.pack $
  L.take n $
  L.cycle $
  "The quick brown fox jumps over the lazy dog\n"

-- Each Char's utf-8 byte length is random. This takes away any advantage
-- the branching impl might have due to branch prediction.
mkRandom :: Int -> Text
mkRandom n =
  T.pack $
  L.map mkC $
  L.take n $
  uniforms (mkStdGen 42)
  where
    mkC :: Int -> Char
    mkC i = toEnum $ case i `mod` 4 of
      0 -> 0x00007F
      1 -> 0x0007FF
      2 -> 0x00FFFF
      3 -> 0x10FFFF

-----------------------------
-- utf8LengthByLeader impls
-----------------------------

utf8LengthByLeader_branching :: Word8 -> Int
utf8LengthByLeader_branching w
  | w < 0x80  = 1
  | w < 0xE0  = 2
  | w < 0xF0  = 3
  | otherwise = 4
{-# INLINE utf8LengthByLeader_branching #-}

utf8LengthByLeader_simple :: Word8 -> Int
utf8LengthByLeader_simple (W8# w8#) =
  let w# = word8ToWord# w8#
  in   1
     + I# (w# `geWord#` 0x80##)
     + I# (w# `geWord#` 0xE0##)
     + I# (w# `geWord#` 0xF0##)
{-# INLINE utf8LengthByLeader_simple #-}

utf8LengthByLeader_lookup :: Word8 -> Int
utf8LengthByLeader_lookup (W8# w8#) =
  I# (int8ToInt# (indexInt8OffAddr# a# (word2Int# (uncheckedShiftRL# (word8ToWord# w8#) 4#))))
  where
    a# = "\1\1\1\1\1\1\1\1\1\1\1\1\2\2\3\4"#
    -- 0..7 -> 1
    -- 8..B -> invalid leading byte
    -- C..D -> 2
    -- E    -> 3
    -- F    -> 4
{-# INLINE utf8LengthByLeader_lookup #-}
@meooow25
Copy link
Contributor Author

@Bodigrim @Lysxia thoughts?

@Lysxia
Copy link
Contributor

Lysxia commented Mar 31, 2025

+1 I'm inclined to keep it simple and use only the branching version, with the hope that branch prediction will work well enough even in non-English languages.

I'm not qualified to have anything to say about comparing the branchless versions.

I get pretty much the same numbers on my i7-10750H, GHC 9.12.1.

@Bodigrim
Copy link
Contributor

Bodigrim commented Apr 1, 2025

Benchmarks on aarch64:

All
  100 ascii
    sum
      current:   OK
        553  ns ±  51 ns
      branching: OK
        125  ns ± 7.0 ns, 0.23x
      simple:    OK
        225  ns ±  14 ns, 0.41x
      lookup:    OK
        220  ns ±  15 ns, 0.40x
    length
      current:   OK
        404  ns ±  30 ns
      branching: OK
        126  ns ± 7.0 ns, 0.31x
      simple:    OK
        211  ns ±  16 ns, 0.52x
      lookup:    OK
        218  ns ±  13 ns, 0.54x
  100 random
    sum
      current:   OK
        603  ns ±  52 ns
      branching: OK
        311  ns ±  28 ns, 0.51x
      simple:    OK
        372  ns ±  26 ns, 0.62x
      lookup:    OK
        343  ns ±  28 ns, 0.57x
    length
      current:   OK
        900  ns ±  53 ns
      branching: OK
        280  ns ±  18 ns, 0.31x
      simple:    OK
        707  ns ±  60 ns, 0.79x
      lookup:    OK
        750  ns ±  55 ns, 0.83x
  100000 ascii
    sum
      current:   OK
        521  μs ±  27 μs
      branching: OK
        102  μs ± 6.7 μs, 0.20x
      simple:    OK
        204  μs ±  14 μs, 0.39x
      lookup:    OK
        204  μs ±  13 μs, 0.39x
    length
      current:   OK
        349  μs ±  29 μs
      branching: OK
        102  μs ± 6.7 μs, 0.29x
      simple:    OK
        156  μs ±  14 μs, 0.45x
      lookup:    OK
        104  μs ± 7.1 μs, 0.30x
  100000 random
    sum
      current:   OK
        1.76 ms ± 106 μs
      branching: OK
        1.00 ms ±  56 μs, 0.57x
      simple:    OK
        1.31 ms ± 109 μs, 0.75x
      lookup:    OK
        1.42 ms ± 112 μs, 0.81x
    length
      current:   OK
        886  μs ±  53 μs
      branching: OK
        918  μs ±  59 μs, 1.04x
      simple:    OK
        681  μs ±  53 μs, 0.77x
      lookup:    OK
        723  μs ±  54 μs, 0.82x

@Bodigrim
Copy link
Contributor

Bodigrim commented Apr 1, 2025

The definition does not seem to be very efficient, at least on my machine.

It would be interesting to figure out why. The assembly looks good to me: https://play.haskell.org/saved/ryiSjDbk

@meooow25
Copy link
Contributor Author

meooow25 commented Apr 3, 2025

A couple more lookup-based branchless versions that work slightly better:

utf8LengthByLeader_lookupInt :: Word8 -> Int
utf8LengthByLeader_lookupInt w8 =
  1 + (0x3 .&. unsafeShiftR a (unsafeShiftL (unsafeShiftR (w8ToI w8) 4) 1))
  where
    a = 0xE5000000 :: Int
{-# INLINE utf8LengthByLeader_lookupInt #-}

utf8LengthByLeader_lookupInt64 :: Word8 -> Int
utf8LengthByLeader_lookupInt64 w8 =
  0xF .&. i64ToI (unsafeShiftR a (unsafeShiftL (unsafeShiftR (w8ToI w8) 4) 2))
  where
    a = 0x4322111111111111 :: Int64
{-# INLINE utf8LengthByLeader_lookupInt64 #-}

i64ToI :: Int64 -> Int
i64ToI = fromIntegral

w8ToI :: Word8 -> Int
w8ToI = fromIntegral

+1 I'm inclined to keep it simple and use only the branching version, with the hope that branch prediction will work well enough even in non-English languages.

I agree that if you'd rather have just one version, it should be the branching one.

The definition does not seem to be very efficient, at least on my machine.

It would be interesting to figure out why. The assembly looks good to me: https://play.haskell.org/saved/ryiSjDbk

I wish I could explain it, but you'll need someone more knowledgeable in this area.
If the branchless version is removed though, this question becomes less important.

@Lysxia
Copy link
Contributor

Lysxia commented Apr 4, 2025

I propose to move forward with the branching version.

Neat trick to make the lookup-table fit an Int!

@meooow25
Copy link
Contributor Author

@Bodigrim what do you think?

@Bodigrim
Copy link
Contributor

Bodigrim commented May 1, 2025

I don't have time to think about it right now, I'm happy to defer judgement to @Lysxia. So I guess you are good to go.

@meooow25 meooow25 linked a pull request May 18, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants