Skip to content

Decoupling byte-level encoding #535

Open
@BurningWitness

Description

@BurningWitness

When writing a JSON parser (GaloisInc/json#17) I needed some way to decode UTF-8 and to my dismay I found all existing solutions do not fit my expectations:

  • GHC.Encoding.UTF8 and GHC.IO.Encoding are IO-based and I don't want that in a parser;
  • Data.Text.Internal.Encoding.Utf8, while pure, appears to both return Reject as an error and has a rather complex interface;
  • Data.Text.Encoding.* and Data.Text.Lazy.Encoding.* are already parsers themselves, too high-level for this task;
  • utf-string's Codec.Binary.UTF8.String consumes and returns lists, so it isn't parser-compatible.

I decided to handroll the UTF-8 decoding, which allowed me to categorize the errors (see Encoding.Mixed.Error) and resulted in a lot of code on the parser side that has little to do with consuming bytes per se (see Codec.Web.JSON.Parse.String).

However the code I wrote can instead be generalized to:

-- Assume Error is Encoding.Mixed.Error.Error

data UTF8 a = UTF8_1 a
            | Part_2 (Word8 -> UTF8_2 a)
            | Part_3_1 (Word8 -> Part_3_1 a)
            | Part_4_1 (Word8 -> Part_4_1 a)
            | Error_1 Error


data UTF8_2 a = UTF8_2 a
              | Error_2 Error


data Part_3_1 a = Part_3_2 (Word8 -> UTF8_3 a)
                | Error_3_1 Error

data UTF8_3 a = UTF8_3 a
              | Error_3_2 Error


data Part_4_1 a = Part_4_2 (Word8 -> Part_4_2 a)
                | Error_4_1 Error

data Part_4_2 a = Part_4_3 (Word8 -> UTF8_4 a)
                | Error_4_2 Error

data UTF8_4 a = UTF8_4 a
              | Error_4_3 Error


newtype Conv1 a = Conv1 (Word8 -> a)
newtype Conv2 a = Conv2 (Word8 -> Word8 -> a)
newtype Conv3 a = Conv3 (Word8 -> Word8 -> Word8 -> a)
newtype Conv4 a = Conv4 (Word8 -> Word8 -> Word8 -> Word8 -> a)

utf8 :: Conv1 a -> Conv2 a -> Conv3 a -> Conv4 a -> Word8 -> UTF8 a
utf8 = -- I'm omitting the implementation, but it's only 50 lines long

Parsing then is simply unwrapping UTF8. This decouples character validation and conversion, the only part of decoding left is ensuring only the maximal subpart of an ill-formed sequence is consumed, which is the responsibility of the parser.


My proposal is creating a separate package with a focus specifically on decoding/encoding UTF-8/UTF-16/UTF-32 on byte-level. Then text can drop some internal modules in favor of a simpler common interface.

This proposal is however naive: I do not know whether GHC can inline these datatypes reliably or, indeed, at all. Based on my cursory reading of the Secrets of the Glasgow Haskell Compiler inliner paper it should, as each of these expressions is trivial.

This doesn't clash with the issue of GHC's many UTF-8 implementations (outlined in GHC.Encoding.UTF8) as all other algorithms are in IO.

Other concerns:

  • text is a core library, so I assume an extra dependency can't just be added on a whim;
  • Package named utf already exists and is deprecated. I don't know how hard reclaiming deprecated packages is.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions