-
-
Notifications
You must be signed in to change notification settings - Fork 87
Description
Fix Appender API Type/Order Validation to Prevent AccessViolationException
Background
DuckDB.NET is responsible for providing a safe bridge between .NET applications and the native DuckDB engine, including:
- Converting .NET data types to native DuckDB types
- Ensuring data is passed in the correct order and format
- Providing meaningful errors when something goes wrong
The current Appender API allows values to be appended by position only, with no validation against column types or expected order. This can lead to AccessViolationException
when values don't match expected types or order, potentially causing memory corruption in native code.
Problem Description
When using the current appender API, values must be provided in the exact schema order. Type mismatches or incorrect ordering aren't detected until a native crash occurs:
// Schema: (directory_id INT, scan_id INT, path VARCHAR, created_at BIGINT, parent_id BIGINT)
// Repro example:
appender.CreateRow()
.AppendValue(directoryId) // OK: INT for directory_id
.AppendValue(scanId) // OK: INT for scan_id
.AppendValue(safePath) // OK: VARCHAR for path
.AppendValue(dir.Depth) // OK: (assuming schema matches)
.AppendValue(Models.QueueStatusDb.Pending) // OK: VARCHAR for status
.AppendValue(dir.ParentId) // Mismatch: Expected BIGINT (created_at) but provided an INT (parent_id)
.AppendValue(timestamp) // Mismatch: Expected BIGINT (parent_id) but provided BIGINT (created_at)
.EndRow();
Issues Found
- No Type Checks:
DuckDBAppenderRow.AppendValueInternal<T>()
writes directly to native vectors without type verification - Order Dependency: The API requires values in exact schema order with no safety checks
- Poor Error Handling: Type/order mismatches result in native crashes rather than meaningful exceptions
🤖 AI-Generated Solution Proposal
I propose adding a safe, column-aware API while preserving the existing fast path:
1. Create a New NamedDuckDBAppenderRow
Class
public class NamedDuckDBAppenderRow : IDuckDBAppenderRow
{
private readonly string qualifiedTableName;
private readonly DuckDBAppenderRow innerRow;
private readonly Dictionary<string, (int Index, DuckDBType Type)> columnInfo;
private readonly HashSet<string> usedColumns = new();
// Constructor that takes the schema information
internal NamedDuckDBAppenderRow(DuckDBAppenderRow innerRow, string qualifiedTableName,
Dictionary<string, (int Index, DuckDBType Type)> columnInfo)
{
this.innerRow = innerRow;
this.qualifiedTableName = qualifiedTableName;
this.columnInfo = columnInfo;
}
// New named value append method
public IDuckDBAppenderRow AppendNamedValue<T>(string columnName, T value)
{
if (!columnInfo.TryGetValue(columnName, out var column))
{
throw new ArgumentException($"Column '{columnName}' does not exist in table '{qualifiedTableName}'");
}
// Track that we've used this column
usedColumns.Add(columnName);
// Perform type compatibility check
ValidateTypeCompatibility<T>(column.Type, columnName);
// Append the value
return innerRow.AppendValue(value);
}
// Override EndRow to validate all columns are present
public void EndRow()
{
// Fill in any missing columns with NULL
foreach (var column in columnInfo.OrderBy(c => c.Value.Index))
{
if (!usedColumns.Contains(column.Key))
{
innerRow.AppendNullValue();
}
}
innerRow.EndRow();
usedColumns.Clear();
}
// Type validation helper method
private void ValidateTypeCompatibility<T>(DuckDBType expectedType, string columnName)
{
// Check type compatibility based on T and expectedType
bool isCompatible = IsTypeCompatible<T>(expectedType);
if (!isCompatible)
{
throw new DuckDBException(
$"Type mismatch for column '{columnName}'. " +
$"Expected {expectedType}, but received {typeof(T).Name}.");
}
}
// Type compatibility check
private bool IsTypeCompatible<T>(DuckDBType type)
{
// Implement type compatibility rules here
// For example: int is compatible with BIGINT but not with VARCHAR
// ...
}
// Forward other IDuckDBAppenderRow methods to innerRow
// ...
}
2. Extend DuckDBConnection
for Creating Named Appenders
public partial class DuckDBConnection
{
// New method to create a named appender
public IDuckDBAppender CreateNamedAppender(string tableName)
{
return CreateNamedAppender(null, tableName);
}
public IDuckDBAppender CreateNamedAppender(string? schemaName, string tableName)
{
// Create the standard appender
var appender = CreateAppender(schemaName, tableName);
// Get column information using information_schema
var columnInfo = GetColumnInformation(schemaName, tableName);
// Return a new named appender that wraps the standard appender
return new NamedDuckDBAppender(appender, columnInfo);
}
// Helper to query schema information
private Dictionary<string, (int Index, DuckDBType Type)> GetColumnInformation(string? schemaName, string tableName)
{
var result = new Dictionary<string, (int Index, DuckDBType Type)>(StringComparer.OrdinalIgnoreCase);
// Use information_schema.columns to get column details
var cmd = CreateCommand();
cmd.CommandText = @"
SELECT column_name, ordinal_position, data_type
FROM information_schema.columns
WHERE table_name = $tableName
AND ($schemaName IS NULL OR table_schema = $schemaName)
ORDER BY ordinal_position";
cmd.Parameters.Add(new DuckDBParameter("tableName", tableName));
cmd.Parameters.Add(new DuckDBParameter("schemaName", schemaName));
using var reader = cmd.ExecuteReader();
while (reader.Read())
{
string columnName = reader.GetString(0);
int position = reader.GetInt32(1) - 1; // 0-based index
string dataType = reader.GetString(2);
// Map data_type string to DuckDBType enum
DuckDBType type = MapStringToType(dataType);
result.Add(columnName, (position, type));
}
return result;
}
}
3. Create a Custom Exception Type
public class DuckDBAppenderException : DuckDBException
{
public string? ColumnName { get; }
public Type? ProvidedType { get; }
public DuckDBType? ExpectedType { get; }
public DuckDBAppenderException(string message) : base(message) { }
public DuckDBAppenderException(string message, string columnName,
Type providedType, DuckDBType expectedType)
: base(message)
{
ColumnName = columnName;
ProvidedType = providedType;
ExpectedType = expectedType;
}
}
Benefits
- Safer API: Prevents crashes by validating before data reaches native code
- Column-Name Access: Supports order-independent insertion using named columns
- Better Error Messages: Provides clear, managed exceptions instead of native crashes
- Backward Compatibility: Preserves the existing fast API for performance-critical code
- Auto-Null Filling: Can automatically handle missing columns with NULL values
Implementation Plan
- Add new named appender API alongside the existing API
- Create extensive unit tests for both valid and invalid cases
- Update documentation to explain the tradeoffs between fast and safe paths
- Add schema caching to minimize performance impact
Why Fix in DuckDB.NET and Not DuckDB Core
DuckDB.NET is the appropriate place to fix this issue because:
- Validation Responsibility: Bindings should ensure data correctness before sending to native code
- Better Error Handling: .NET exceptions are more helpful than native crashes
- Separation of Concerns: DuckDB should focus on performance; DuckDB.NET should handle .NET safety
- Flexibility: This approach allows both fast and safe paths
The DuckDB core engine (C++) is optimized for performance and expects correctly formatted data. Adding validation there would compromise performance for all users.
This issue description and code examples were generated with AI assistance.