bloblang: unicode code-point range handling

Certain outputs or systems consider certain unicode code point ranges to be invalid. For example on SQS's SendMessage Endpoing there is [this warning](https://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_SendMessage.html): 

```
A message can include only XML, JSON, and unformatted text. The following Unicode characters are allowed. 

For more information, see the [W3C specification for characters](http://www.w3.org/TR/REC-xml/#charsets).

#x9 | #xA | #xD | #x20 to #xD7FF | #xE000 to #xFFFD | #x10000 to #x10FFFF

If a message contains characters outside the allowed set, Amazon SQS rejects the message and returns an InvalidMessageContents error. 

Ensure that your message body includes only valid characters to avoid this exception.
```

While it is possible to reference unicode code points in a bloblang mapping: 

```
root.result = this.data.contains("\uFFFE")
```

We appear to lack an ergonomic way to potentially deal with unicode code-point ranges, like is possible with [Go's strings.Map func](https://pkg.go.dev/strings#Map). 

Relevant closed PR: https://github.com/warpstreamlabs/bento/pull/717/changes 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bloblang: unicode code-point range handling #752

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bloblang: unicode code-point range handling #752

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions