Skip to content

Intermittent Segmentation Faults in Backend Processes #65867

@ShMob

Description

@ShMob

Possibly a continuation of this issue.

Our backends sometimes randomly fail because of Segmentation fault. Upon inspecting the coredump file, we found that some threads have the following backtrace at the time of error:

(gdb) bt full
#0  0x000000000563cd81 in starrocks::ScalarTypeInfoImpl<(starrocks::LogicalType)51>::from_string (buf=0x7f18f15ee828, 
    scan_key=<error reading variable: Cannot access memory at address 0x8>) at be/src/storage/types.cpp:871
        timestamp = {static MAX_TIMESTAMP_VALUE = {static MAX_TIMESTAMP_VALUE = <same as static member of an already seen type>, static MIN_TIMESTAMP_VALUE = {
              static MAX_TIMESTAMP_VALUE = <same as static member of an already seen type>, 
              static MIN_TIMESTAMP_VALUE = <same as static member of an already seen type>, _timestamp = 1892325482100162560}, _timestamp = 5908208226068291583}, 
          static MIN_TIMESTAMP_VALUE = <same as static member of an already seen type>, _timestamp = 139745105471392}
#1  0x893fafe7d0b7c600 in ?? ()
No symbol table info available.
#2  0x00007f184b614588 in ?? ()
No symbol table info available.
#3  0x0000000000000000 in ?? ()
No symbol table info available.

At be/src/storage/types.cpp:

template <>
struct ScalarTypeInfoImpl<TYPE_DATETIME> : public ScalarTypeInfoImplBase<TYPE_DATETIME> {
    static Status from_string(void* buf, const std::string& scan_key) {
        auto timestamp = unaligned_load<TimestampValue>(buf);
        if (!timestamp.from_string(scan_key.data(), scan_key.size())) { // THIS IS LINE 871
            // Compatible with TYPE_DATETIME_V1
            timestamp.from_string("1400-01-01 00:00:00", sizeof("1400-01-01 00:00:00") - 1);
        }
        unaligned_store<TimestampValue>(buf, timestamp);
        return Status::OK();
    }

The error happens when it executes scan_key.data(), scan_key.size(), and the scan_key memory address is 0x8, which should be the cause of segmentation fault.

All threads have this for LogicalType = 51, which is TYPE_DATETIME.

The function that called from_string and the functions before that do not have symbol or file info, so we can't trace them back.

Steps to reproduce the behavior (Required)

We don't know why this happens, but it might be related to this issue.

StarRocks version (Required)

Version 3.3.18, shared-nothing, deployed with helm

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions