Skip to content

ReadFile/ReadConsoleA append U+FFFD to each emoji read in UTF-8 #19436

@ckwastra

Description

@ckwastra

Windows Terminal version

1.23.250825001

Windows build number

10.0.26100.0

Other Software

No response

Steps to reproduce

Test code:

// /std:c++latest /utf-8

#include <exception>
#include <print>
#include <string>

#include <windows.h>

namespace my {
template <auto... Errors> auto check(auto result) {
  static_assert(sizeof...(Errors) != 0);
  if ((... && (result != Errors))) {
    return result;
  }
  std::terminate();
}
void assert_equal(auto x, auto y) {
  if (x != y) {
    std::terminate();
  }
}
} // namespace my

int main() {
  my::check<FALSE>(::SetConsoleCP(CP_UTF8));
  my::check<FALSE>(::SetConsoleOutputCP(CP_UTF8));
  const auto std_input = my::check<INVALID_HANDLE_VALUE, nullptr>(
      ::GetStdHandle(STD_INPUT_HANDLE));
  const auto std_output = my::check<INVALID_HANDLE_VALUE, nullptr>(
      ::GetStdHandle(STD_OUTPUT_HANDLE));
  char c = {};
  std::string s = {};
  ::DWORD number_of_bytes_read = {};
  ::DWORD number_of_bytes_written = {};
  while (true) {
    my::check<FALSE>(
        ::ReadFile(std_input, &c, 1, &number_of_bytes_read, nullptr));
    if (number_of_bytes_read == 0) {
      break;
    }
    std::print("{:02x}{}", c, c == '\n' ? '\n' : ' ');
    s += c;
    if (c == '\n') {
      const auto number_of_bytes_to_write = static_cast<::DWORD>(s.size());
      my::assert_equal(number_of_bytes_to_write, s.size());
      my::check<FALSE>(::WriteFile(std_output, s.data(),
                                   number_of_bytes_to_write,
                                   &number_of_bytes_written, nullptr));
      my::assert_equal(number_of_bytes_written, number_of_bytes_to_write);
      s = {};
    }
  }
}

Run the above code inside Windows Terminal (or in conhost.exe - the result is the same). Type in some emojis; a possible output is:

😀
f0 9f 98 80 ef bf bd 0d 0a
😀�
😀😀
f0 9f 98 80 ef bf bd f0 9f 98 80 ef bf bd 0d 0a
😀�😀�
^Z

Observe that each emoji read is followed by the ef bf bd sequence (the UTF-8 encoding of the replacement character).

Expected Behavior

These replacement characters should not appear in the read byte stream.

Actual Behavior

For some unknown reason they do appear. If there is a bug in the test code above, please let me know.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Issue-BugIt either shouldn't be doing this or needs an investigation.Needs-TriageIt's a new issue that the core contributor team needs to triage at the next triage meeting

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions