Conversation
Deploying datachain with
|
| Latest commit: |
1d7275c
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://806471d4.datachain-2g6.pages.dev |
| Branch Preview URL: | https://cleanup-file-path-handling.datachain-2g6.pages.dev |
de4fe51 to
522f929
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
106219e to
0a715be
Compare
0a715be to
22a9a61
Compare
Replace Path.resolve(strict=False) with os.path.abspath + os.path.normcase in is_path_in() for deterministic, filesystem-independent path comparison. resolve() varies across Python versions (3.11 uses abspath, 3.12+ uses realpath) and can produce different results depending on which path components already exist on disk. Also use startswith(output + os.sep) instead of is_relative_to() to prevent false positives where /foo/bar2 would match /foo/bar. Add drive-letter absolute path check to validate_local_relpath() so Windows paths like C:/secret/file are rejected.
050bf22 to
df7de37
Compare
There was a problem hiding this comment.
Pull request overview
This pull request addresses critical security vulnerabilities and architectural issues in DataChain's file path handling. The changes fix path traversal vulnerabilities, refactor version ID handling from URL string manipulation to explicit parameters, enforce URI scheme validation for File.source, and eliminate percent-encoding ambiguities.
Changes:
- Fix path traversal vulnerabilities in file export by validating relative paths and introducing
is_subpath()guard - Refactor version ID handling to use explicit
_version_kwargs()method instead of embedding versions in URL strings - Add strict validation for File.source requiring URI schemes (e.g.,
file://,s3://) and File.path to reject traversal segments - Sanitize dataset names in listing URIs to prevent SQL injection from special characters
- Treat percent-encoding as opaque bytes rather than decoding/encoding during path operations
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/datachain/lib/file.py | Core File model changes: added source/path validators, refactored export() with traversal protection, replaced get_path_normalized() with client-specific rel_path_for_file(), updated open() to use full_path_for_file() |
| src/datachain/lib/utils.py | Simplified rebase_path() to handle URI schemes cleanly without urlparse |
| src/datachain/lib/listing.py | Added _sanitize_ds_name() to encode special chars in dataset names for SQL safety |
| src/datachain/fs/utils.py | Added is_subpath() helper for path traversal protection |
| src/datachain/client/fsspec.py | Base client refactoring: removed version_path(), added _version_kwargs(), full_path_for_file(), rel_path_for_file() methods |
| src/datachain/client/local.py | Added validate_file_relpath(), _has_drive_letter(), updated path_to_uri() to avoid percent-encoding |
| src/datachain/client/s3.py | Refactored url() to use explicit VersionId parameter instead of query string manipulation |
| src/datachain/client/gcs.py | Added _path_with_generation() and _version_kwargs() for generation handling |
| src/datachain/client/azure.py | Refactored url() to append versionid only for signing, not general I/O |
| src/datachain/client/http.py | Added percent-encoding to get_full_path() preserving query/fragment |
| src/datachain/cache.py | Updated download() to use full_path_for_file() and get_size(file) |
| src/datachain/studio.py | Removed dateparser warnings suppression workaround for Python 3.13 |
| tests/* | Comprehensive test coverage for path validation, traversal protection, Windows paths, special characters, version handling |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
df7de37 to
8f4456f
Compare
851933f to
4fa058c
Compare
56d7df1 to
88705d9
Compare
dreadatour
left a comment
There was a problem hiding this comment.
Big PR, a lot of changes, but all changes looks good and relevant to me 👍
Haven't tested it locally though, but tests seems are passed, so I think we good to go.
8f17b3e to
ad010c5
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 37 out of 37 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ad010c5 to
674e0b4
Compare
Fixes:
The DataChain
Filemodel stores two key fields —source(a storage URI likes3://mybucket) andpath(an object key likephotos/cat.jpg). Before thischange, several interrelated problems existed in how these fields were used to
construct filesystem paths, signed URLs, and local export destinations.
1.1 Path Traversal on Local Export
When a user calls
file.export(output_dir)orchain.to_storage(output_dir),the library combines the output directory with a suffix derived from
file.path. Iffile.pathcontained traversal segments (..) or was anabsolute path, the exported file could be written outside the intended
output directory.
Concrete example — tar member with malicious name:
Concrete example — percent-encoded traversal:
1.2 Version ID Embedded in URL Paths
Each backend had a
version_path()class method that mutated the URL stringto embed version identifiers (e.g.,
s3://bucket/key?versionId=abc,gs://bucket/key#generation). This was used throughoutget_full_path,get_size,get_file,open_object, etc.Problems:
?versionId=.../?versionid=...),which broke when URLs already contained query parameters.
#generation), which conflicted with object keysthat legitimately contained
#.get_full_path(path, version_id)threaded a version ID throughstring manipulation — error-prone and inconsistent across backends.
version_pathraisedValueErrorif version was already embedded, requiringcallers to carefully avoid double-encoding.
Concrete example:
1.3 Inconsistent
File.sourceFormatFile.sourcecould be set to bare strings like"mybucket"or"s3"withouta URI scheme, leading to ambiguous behavior in path construction and client
dispatch. Which client handles
"mybucket"? Is it local? S3?1.4 Percent-Encoding Confusion
File.get_uri()usedget_path_normalized()which calledos.path.normpath,then the result was concatenated with source.
get_fs_path()calledunquote(get_uri())— double-decoding percent-encoded characters. Anobject key literally named
file%20name.txt(a valid S3 key) would becomefile name.txtwhen accessed.Local
path_to_uri()usedPath.as_uri()which percent-encodes specialcharacters. Then
split_url()usedurlparsewhich decoded them. A localfile at
/data/dir #hash/file.txtwould lose the#during round-tripping.