Skip to content

Fix Appender API Type/Order Validation to Prevent AccessViolationException #251

@horizonchasers

Description

@horizonchasers

Fix Appender API Type/Order Validation to Prevent AccessViolationException

Background

DuckDB.NET is responsible for providing a safe bridge between .NET applications and the native DuckDB engine, including:

  • Converting .NET data types to native DuckDB types
  • Ensuring data is passed in the correct order and format
  • Providing meaningful errors when something goes wrong

The current Appender API allows values to be appended by position only, with no validation against column types or expected order. This can lead to AccessViolationException when values don't match expected types or order, potentially causing memory corruption in native code.

Problem Description

When using the current appender API, values must be provided in the exact schema order. Type mismatches or incorrect ordering aren't detected until a native crash occurs:

// Schema: (directory_id INT, scan_id INT, path VARCHAR, created_at BIGINT, parent_id BIGINT)
// Repro example:
appender.CreateRow()
    .AppendValue(directoryId)             // OK: INT for directory_id
    .AppendValue(scanId)                  // OK: INT for scan_id
    .AppendValue(safePath)                // OK: VARCHAR for path
    .AppendValue(dir.Depth)               // OK: (assuming schema matches)
    .AppendValue(Models.QueueStatusDb.Pending) // OK: VARCHAR for status
    .AppendValue(dir.ParentId)            // Mismatch: Expected BIGINT (created_at) but provided an INT (parent_id)
    .AppendValue(timestamp)               // Mismatch: Expected BIGINT (parent_id) but provided BIGINT (created_at)
    .EndRow();

Issues Found

  1. No Type Checks: DuckDBAppenderRow.AppendValueInternal<T>() writes directly to native vectors without type verification
  2. Order Dependency: The API requires values in exact schema order with no safety checks
  3. Poor Error Handling: Type/order mismatches result in native crashes rather than meaningful exceptions

🤖 AI-Generated Solution Proposal

I propose adding a safe, column-aware API while preserving the existing fast path:

1. Create a New NamedDuckDBAppenderRow Class

public class NamedDuckDBAppenderRow : IDuckDBAppenderRow
{
    private readonly string qualifiedTableName;
    private readonly DuckDBAppenderRow innerRow;
    private readonly Dictionary<string, (int Index, DuckDBType Type)> columnInfo;
    private readonly HashSet<string> usedColumns = new();
    
    // Constructor that takes the schema information
    internal NamedDuckDBAppenderRow(DuckDBAppenderRow innerRow, string qualifiedTableName, 
        Dictionary<string, (int Index, DuckDBType Type)> columnInfo)
    {
        this.innerRow = innerRow;
        this.qualifiedTableName = qualifiedTableName;
        this.columnInfo = columnInfo;
    }
    
    // New named value append method
    public IDuckDBAppenderRow AppendNamedValue<T>(string columnName, T value)
    {
        if (!columnInfo.TryGetValue(columnName, out var column))
        {
            throw new ArgumentException($"Column '{columnName}' does not exist in table '{qualifiedTableName}'");
        }
        
        // Track that we've used this column
        usedColumns.Add(columnName);
        
        // Perform type compatibility check
        ValidateTypeCompatibility<T>(column.Type, columnName);
        
        // Append the value
        return innerRow.AppendValue(value);
    }
    
    // Override EndRow to validate all columns are present
    public void EndRow()
    {
        // Fill in any missing columns with NULL
        foreach (var column in columnInfo.OrderBy(c => c.Value.Index))
        {
            if (!usedColumns.Contains(column.Key))
            {
                innerRow.AppendNullValue();
            }
        }
        
        innerRow.EndRow();
        usedColumns.Clear();
    }
    
    // Type validation helper method
    private void ValidateTypeCompatibility<T>(DuckDBType expectedType, string columnName)
    {
        // Check type compatibility based on T and expectedType
        bool isCompatible = IsTypeCompatible<T>(expectedType);
        
        if (!isCompatible)
        {
            throw new DuckDBException(
                $"Type mismatch for column '{columnName}'. " +
                $"Expected {expectedType}, but received {typeof(T).Name}.");
        }
    }
    
    // Type compatibility check
    private bool IsTypeCompatible<T>(DuckDBType type) 
    {
        // Implement type compatibility rules here
        // For example: int is compatible with BIGINT but not with VARCHAR
        // ...
    }
    
    // Forward other IDuckDBAppenderRow methods to innerRow
    // ...
}

2. Extend DuckDBConnection for Creating Named Appenders

public partial class DuckDBConnection
{
    // New method to create a named appender
    public IDuckDBAppender CreateNamedAppender(string tableName)
    {
        return CreateNamedAppender(null, tableName);
    }
    
    public IDuckDBAppender CreateNamedAppender(string? schemaName, string tableName)
    {
        // Create the standard appender
        var appender = CreateAppender(schemaName, tableName);
        
        // Get column information using information_schema
        var columnInfo = GetColumnInformation(schemaName, tableName);
        
        // Return a new named appender that wraps the standard appender
        return new NamedDuckDBAppender(appender, columnInfo);
    }
    
    // Helper to query schema information
    private Dictionary<string, (int Index, DuckDBType Type)> GetColumnInformation(string? schemaName, string tableName)
    {
        var result = new Dictionary<string, (int Index, DuckDBType Type)>(StringComparer.OrdinalIgnoreCase);
        
        // Use information_schema.columns to get column details
        var cmd = CreateCommand();
        cmd.CommandText = @"
            SELECT column_name, ordinal_position, data_type 
            FROM information_schema.columns
            WHERE table_name = $tableName 
            AND ($schemaName IS NULL OR table_schema = $schemaName)
            ORDER BY ordinal_position";
            
        cmd.Parameters.Add(new DuckDBParameter("tableName", tableName));
        cmd.Parameters.Add(new DuckDBParameter("schemaName", schemaName));
        
        using var reader = cmd.ExecuteReader();
        while (reader.Read())
        {
            string columnName = reader.GetString(0);
            int position = reader.GetInt32(1) - 1; // 0-based index
            string dataType = reader.GetString(2);
            
            // Map data_type string to DuckDBType enum
            DuckDBType type = MapStringToType(dataType);
            
            result.Add(columnName, (position, type));
        }
        
        return result;
    }
}

3. Create a Custom Exception Type

public class DuckDBAppenderException : DuckDBException
{
    public string? ColumnName { get; }
    public Type? ProvidedType { get; }
    public DuckDBType? ExpectedType { get; }
    
    public DuckDBAppenderException(string message) : base(message) { }
    
    public DuckDBAppenderException(string message, string columnName, 
                                  Type providedType, DuckDBType expectedType) 
        : base(message)
    {
        ColumnName = columnName;
        ProvidedType = providedType;
        ExpectedType = expectedType;
    }
}

Benefits

  1. Safer API: Prevents crashes by validating before data reaches native code
  2. Column-Name Access: Supports order-independent insertion using named columns
  3. Better Error Messages: Provides clear, managed exceptions instead of native crashes
  4. Backward Compatibility: Preserves the existing fast API for performance-critical code
  5. Auto-Null Filling: Can automatically handle missing columns with NULL values

Implementation Plan

  1. Add new named appender API alongside the existing API
  2. Create extensive unit tests for both valid and invalid cases
  3. Update documentation to explain the tradeoffs between fast and safe paths
  4. Add schema caching to minimize performance impact

Why Fix in DuckDB.NET and Not DuckDB Core

DuckDB.NET is the appropriate place to fix this issue because:

  1. Validation Responsibility: Bindings should ensure data correctness before sending to native code
  2. Better Error Handling: .NET exceptions are more helpful than native crashes
  3. Separation of Concerns: DuckDB should focus on performance; DuckDB.NET should handle .NET safety
  4. Flexibility: This approach allows both fast and safe paths

The DuckDB core engine (C++) is optimized for performance and expects correctly formatted data. Adding validation there would compromise performance for all users.


This issue description and code examples were generated with AI assistance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions