Skip to content

bloblang: unicode code-point range handling #752

@jem-davies

Description

@jem-davies

Certain outputs or systems consider certain unicode code point ranges to be invalid. For example on SQS's SendMessage Endpoing there is this warning:

A message can include only XML, JSON, and unformatted text. The following Unicode characters are allowed. 

For more information, see the [W3C specification for characters](http://www.w3.org/TR/REC-xml/#charsets).

#x9 | #xA | #xD | #x20 to #xD7FF | #xE000 to #xFFFD | #x10000 to #x10FFFF

If a message contains characters outside the allowed set, Amazon SQS rejects the message and returns an InvalidMessageContents error. 

Ensure that your message body includes only valid characters to avoid this exception.

While it is possible to reference unicode code points in a bloblang mapping:

root.result = this.data.contains("\uFFFE")

We appear to lack an ergonomic way to potentially deal with unicode code-point ranges, like is possible with Go's strings.Map func.

Relevant closed PR: https://github.com/warpstreamlabs/bento/pull/717/changes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions