Decoupling byte-level encoding

When writing a JSON parser (GaloisInc/json#17) I needed some way to decode UTF-8 and to my dismay I found all existing solutions do not fit my expectations:
- `GHC.Encoding.UTF8` and `GHC.IO.Encoding` are `IO`-based and I don't want that in a parser;
- `Data.Text.Internal.Encoding.Utf8`, while pure, appears to both return `Reject` as an error and has a rather complex interface;
- `Data.Text.Encoding.*` and  `Data.Text.Lazy.Encoding.*` are already parsers themselves, too high-level for this task;
- `utf-string`'s `Codec.Binary.UTF8.String` consumes and returns lists, so it isn't parser-compatible.

I decided to handroll the UTF-8 decoding, which allowed me to categorize the errors (see [Encoding.Mixed.Error](https://github.com/BurningWitness/json/blob/ea8d0c80/src/Encoding/Mixed/Error.hs#L7-L51)) and resulted in a lot of code on the parser side that has little to do with consuming bytes per se (see [Codec.Web.JSON.Parse.String](https://github.com/BurningWitness/json/blob/ea8d0c80/src/Codec/Web/JSON/Parse/String.hs#L64-L286)).

However the code I wrote can instead be generalized to:

```haskell
-- Assume Error is Encoding.Mixed.Error.Error

data UTF8 a = UTF8_1 a
            | Part_2 (Word8 -> UTF8_2 a)
            | Part_3_1 (Word8 -> Part_3_1 a)
            | Part_4_1 (Word8 -> Part_4_1 a)
            | Error_1 Error


data UTF8_2 a = UTF8_2 a
              | Error_2 Error


data Part_3_1 a = Part_3_2 (Word8 -> UTF8_3 a)
                | Error_3_1 Error

data UTF8_3 a = UTF8_3 a
              | Error_3_2 Error


data Part_4_1 a = Part_4_2 (Word8 -> Part_4_2 a)
                | Error_4_1 Error

data Part_4_2 a = Part_4_3 (Word8 -> UTF8_4 a)
                | Error_4_2 Error

data UTF8_4 a = UTF8_4 a
              | Error_4_3 Error


newtype Conv1 a = Conv1 (Word8 -> a)
newtype Conv2 a = Conv2 (Word8 -> Word8 -> a)
newtype Conv3 a = Conv3 (Word8 -> Word8 -> Word8 -> a)
newtype Conv4 a = Conv4 (Word8 -> Word8 -> Word8 -> Word8 -> a)

utf8 :: Conv1 a -> Conv2 a -> Conv3 a -> Conv4 a -> Word8 -> UTF8 a
utf8 = -- I'm omitting the implementation, but it's only 50 lines long
```

Parsing then is simply unwrapping `UTF8`. This decouples character validation and conversion, the only part of decoding left is ensuring only the maximal subpart of an ill-formed sequence is consumed, which is the responsibility of the parser.

---

My proposal is creating a separate package with a focus specifically on decoding/encoding UTF-8/UTF-16/UTF-32 on byte-level. Then `text` can drop some internal modules in favor of a simpler common interface.

This proposal is however naive: I do not know whether GHC can inline these datatypes reliably or, indeed, at all. Based on my cursory reading of the [Secrets of the Glasgow Haskell Compiler inliner](https://www.microsoft.com/en-us/research/wp-content/uploads/2002/07/inline.pdf) paper it should, as each of these expressions is *trivial*.

This doesn't clash with the issue of *GHC's many UTF-8 implementations* (outlined in [`GHC.Encoding.UTF8`](https://hackage.haskell.org/package/base-4.18.0.0/docs/src/GHC.Encoding.UTF8.html)) as all other algorithms are in `IO`.

Other concerns:
- `text` is a core library, so I assume an extra dependency can't just be added on a whim;
- Package named [`utf`](https://hackage.haskell.org/package/utf) already exists and is deprecated. I don't know how hard reclaiming deprecated packages is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decoupling byte-level encoding #535

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decoupling byte-level encoding #535

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions