`UTF8JsonGenerator` writes supplementary characters as a surrogate pair -- should use 4-byte encoding

When outputting a string value containing a supplementary Unicode code point, UTF8JsonGenerator is encoding the supplementary character as a pair of `\uNNNN` escapes representing the two halves of the surrogate pair that would denote the code point in UTF-16 instead of using the correct multi-byte UTF-8 encoding of the character.  The following Groovy script demonstrates the behaviour:

```
@Grab(group='com.fasterxml.jackson.core', module='jackson-core', version='2.6.2')
import com.fasterxml.jackson.core.JsonFactory

def factory = new JsonFactory()
def bytes1 = new ByteArrayOutputStream()
def gen1 = factory.createGenerator(bytes1) // UTF8JsonGenerator
gen1.writeStartObject()
gen1.writeStringField("test", new String(Character.toChars(0x1F602)))
gen1.writeEndObject()
gen1.close()
System.out.write(bytes1.toByteArray())
println ""
// prints {"test":"\uD83D\uDE02"}


def bytes2 = new ByteArrayOutputStream()
new OutputStreamWriter(bytes2, "UTF-8").withWriter { w ->
  def gen2 = factory.createGenerator(w) // WriterBasedJsonGenerator
  gen2.writeStartObject()
  gen2.writeStringField("test", new String(Character.toChars(0x1F602)))
  gen2.writeEndObject()
  gen2.close()
}
System.out.write(bytes2.toByteArray())
println ""
// prints {"test":"😂"}
```

When generating to a Writer rather than an OutputStream (and letting Java handle the UTF-8 byte conversion) the supplementary character U+1F602 is encoded as the correct UTF-8 four byte sequence `f0 9f 98 82`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`UTF8JsonGenerator` writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

UTF8JsonGenerator writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`UTF8JsonGenerator` writes supplementary characters as a surrogate pair -- should use 4-byte encoding #223