Description
When writing a JSON parser (GaloisInc/json#17) I needed some way to decode UTF-8 and to my dismay I found all existing solutions do not fit my expectations:
GHC.Encoding.UTF8
andGHC.IO.Encoding
areIO
-based and I don't want that in a parser;Data.Text.Internal.Encoding.Utf8
, while pure, appears to both returnReject
as an error and has a rather complex interface;Data.Text.Encoding.*
andData.Text.Lazy.Encoding.*
are already parsers themselves, too high-level for this task;utf-string
'sCodec.Binary.UTF8.String
consumes and returns lists, so it isn't parser-compatible.
I decided to handroll the UTF-8 decoding, which allowed me to categorize the errors (see Encoding.Mixed.Error) and resulted in a lot of code on the parser side that has little to do with consuming bytes per se (see Codec.Web.JSON.Parse.String).
However the code I wrote can instead be generalized to:
-- Assume Error is Encoding.Mixed.Error.Error
data UTF8 a = UTF8_1 a
| Part_2 (Word8 -> UTF8_2 a)
| Part_3_1 (Word8 -> Part_3_1 a)
| Part_4_1 (Word8 -> Part_4_1 a)
| Error_1 Error
data UTF8_2 a = UTF8_2 a
| Error_2 Error
data Part_3_1 a = Part_3_2 (Word8 -> UTF8_3 a)
| Error_3_1 Error
data UTF8_3 a = UTF8_3 a
| Error_3_2 Error
data Part_4_1 a = Part_4_2 (Word8 -> Part_4_2 a)
| Error_4_1 Error
data Part_4_2 a = Part_4_3 (Word8 -> UTF8_4 a)
| Error_4_2 Error
data UTF8_4 a = UTF8_4 a
| Error_4_3 Error
newtype Conv1 a = Conv1 (Word8 -> a)
newtype Conv2 a = Conv2 (Word8 -> Word8 -> a)
newtype Conv3 a = Conv3 (Word8 -> Word8 -> Word8 -> a)
newtype Conv4 a = Conv4 (Word8 -> Word8 -> Word8 -> Word8 -> a)
utf8 :: Conv1 a -> Conv2 a -> Conv3 a -> Conv4 a -> Word8 -> UTF8 a
utf8 = -- I'm omitting the implementation, but it's only 50 lines long
Parsing then is simply unwrapping UTF8
. This decouples character validation and conversion, the only part of decoding left is ensuring only the maximal subpart of an ill-formed sequence is consumed, which is the responsibility of the parser.
My proposal is creating a separate package with a focus specifically on decoding/encoding UTF-8/UTF-16/UTF-32 on byte-level. Then text
can drop some internal modules in favor of a simpler common interface.
This proposal is however naive: I do not know whether GHC can inline these datatypes reliably or, indeed, at all. Based on my cursory reading of the Secrets of the Glasgow Haskell Compiler inliner paper it should, as each of these expressions is trivial.
This doesn't clash with the issue of GHC's many UTF-8 implementations (outlined in GHC.Encoding.UTF8
) as all other algorithms are in IO
.
Other concerns:
text
is a core library, so I assume an extra dependency can't just be added on a whim;- Package named
utf
already exists and is deprecated. I don't know how hard reclaiming deprecated packages is.