Skip to content

File path handling cleanup#1604

Open
shcheklein wants to merge 21 commits intomainfrom
cleanup-file-path-handling
Open

File path handling cleanup#1604
shcheklein wants to merge 21 commits intomainfrom
cleanup-file-path-handling

Conversation

@shcheklein
Copy link
Contributor

Fixes:

The DataChain File model stores two key fields — source (a storage URI like
s3://mybucket) and path (an object key like photos/cat.jpg). Before this
change, several interrelated problems existed in how these fields were used to
construct filesystem paths, signed URLs, and local export destinations.

1.1 Path Traversal on Local Export

When a user calls file.export(output_dir) or chain.to_storage(output_dir),
the library combines the output directory with a suffix derived from
file.path. If file.path contained traversal segments (..) or was an
absolute path, the exported file could be written outside the intended
output directory.

Concrete example — tar member with malicious name:

# A tar archive contains a member named "../../etc/cron.d/evil"
chain = dc.read_storage("s3://bucket").gen(file=process_tar)
chain.to_storage("/home/user/output", placement="filepath")
# OLD: writes to /home/user/etc/cron.d/evil  ← ESCAPED
# NEW: raises FileError("path must not contain '..'")

Concrete example — percent-encoded traversal:

# Object key is literally "..%2f..%2fescape.txt"
# OLD: unquote() decoded it to "../../escape.txt" → traversal
# NEW: percent-encoding is treated as opaque bytes → "..%2f..%2fescape.txt"
#      stays inside the output directory

1.2 Version ID Embedded in URL Paths

Each backend had a version_path() class method that mutated the URL string
to embed version identifiers (e.g., s3://bucket/key?versionId=abc,
gs://bucket/key#generation). This was used throughout get_full_path,
get_size, get_file, open_object, etc.

Problems:

  • S3 and Azure used query-string encoding (?versionId=... / ?versionid=...),
    which broke when URLs already contained query parameters.
  • GCS used fragment encoding (#generation), which conflicted with object keys
    that legitimately contained #.
  • Every call to get_full_path(path, version_id) threaded a version ID through
    string manipulation — error-prone and inconsistent across backends.
  • version_path raised ValueError if version was already embedded, requiring
    callers to carefully avoid double-encoding.

Concrete example:

# OLD: client.get_full_path("photos/cat.jpg", "v123")
#   S3 → "s3://bucket/photos/cat.jpg?versionId=v123"
#   GCS → "gs://bucket/photos/cat.jpg#v123"
#   Azure → "az://container/photos/cat.jpg?versionid=v123"
# Then fs._info("s3://bucket/photos/cat.jpg?versionId=v123")
#   ↑ some fsspec backends don't parse this correctly

# NEW: client.get_full_path("photos/cat.jpg")  → clean path
#      client._version_kwargs("v123")           → {"version_id": "v123"} (S3)
#                                                  {"generation": "v123"} (GCS)
#      fs._info(clean_path, **version_kwargs)   → explicit parameter

1.3 Inconsistent File.source Format

File.source could be set to bare strings like "mybucket" or "s3" without
a URI scheme, leading to ambiguous behavior in path construction and client
dispatch. Which client handles "mybucket"? Is it local? S3?

1.4 Percent-Encoding Confusion

File.get_uri() used get_path_normalized() which called os.path.normpath,
then the result was concatenated with source. get_fs_path() called
unquote(get_uri())double-decoding percent-encoded characters. An
object key literally named file%20name.txt (a valid S3 key) would become
file name.txt when accessed.

Local path_to_uri() used Path.as_uri() which percent-encodes special
characters. Then split_url() used urlparse which decoded them. A local
file at /data/dir #hash/file.txt would lose the # during round-tripping.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Feb 20, 2026

Deploying datachain with  Cloudflare Pages  Cloudflare Pages

Latest commit: 1d7275c
Status: ✅  Deploy successful!
Preview URL: https://806471d4.datachain-2g6.pages.dev
Branch Preview URL: https://cleanup-file-path-handling.datachain-2g6.pages.dev

View logs

@shcheklein shcheklein force-pushed the cleanup-file-path-handling branch from de4fe51 to 522f929 Compare February 20, 2026 04:50
@codecov
Copy link

codecov bot commented Feb 20, 2026

@shcheklein shcheklein force-pushed the cleanup-file-path-handling branch from 106219e to 0a715be Compare February 23, 2026 00:26
@shcheklein shcheklein force-pushed the cleanup-file-path-handling branch from 0a715be to 22a9a61 Compare February 23, 2026 01:26
Replace Path.resolve(strict=False) with os.path.abspath + os.path.normcase
in is_path_in() for deterministic, filesystem-independent path comparison.
resolve() varies across Python versions (3.11 uses abspath, 3.12+ uses
realpath) and can produce different results depending on which path
components already exist on disk.

Also use startswith(output + os.sep) instead of is_relative_to() to
prevent false positives where /foo/bar2 would match /foo/bar.

Add drive-letter absolute path check to validate_local_relpath() so
Windows paths like C:/secret/file are rejected.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses critical security vulnerabilities and architectural issues in DataChain's file path handling. The changes fix path traversal vulnerabilities, refactor version ID handling from URL string manipulation to explicit parameters, enforce URI scheme validation for File.source, and eliminate percent-encoding ambiguities.

Changes:

  • Fix path traversal vulnerabilities in file export by validating relative paths and introducing is_subpath() guard
  • Refactor version ID handling to use explicit _version_kwargs() method instead of embedding versions in URL strings
  • Add strict validation for File.source requiring URI schemes (e.g., file://, s3://) and File.path to reject traversal segments
  • Sanitize dataset names in listing URIs to prevent SQL injection from special characters
  • Treat percent-encoding as opaque bytes rather than decoding/encoding during path operations

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/datachain/lib/file.py Core File model changes: added source/path validators, refactored export() with traversal protection, replaced get_path_normalized() with client-specific rel_path_for_file(), updated open() to use full_path_for_file()
src/datachain/lib/utils.py Simplified rebase_path() to handle URI schemes cleanly without urlparse
src/datachain/lib/listing.py Added _sanitize_ds_name() to encode special chars in dataset names for SQL safety
src/datachain/fs/utils.py Added is_subpath() helper for path traversal protection
src/datachain/client/fsspec.py Base client refactoring: removed version_path(), added _version_kwargs(), full_path_for_file(), rel_path_for_file() methods
src/datachain/client/local.py Added validate_file_relpath(), _has_drive_letter(), updated path_to_uri() to avoid percent-encoding
src/datachain/client/s3.py Refactored url() to use explicit VersionId parameter instead of query string manipulation
src/datachain/client/gcs.py Added _path_with_generation() and _version_kwargs() for generation handling
src/datachain/client/azure.py Refactored url() to append versionid only for signing, not general I/O
src/datachain/client/http.py Added percent-encoding to get_full_path() preserving query/fragment
src/datachain/cache.py Updated download() to use full_path_for_file() and get_size(file)
src/datachain/studio.py Removed dateparser warnings suppression workaround for Python 3.13
tests/* Comprehensive test coverage for path validation, traversal protection, Windows paths, special characters, version handling

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shcheklein shcheklein force-pushed the cleanup-file-path-handling branch from df7de37 to 8f4456f Compare February 23, 2026 20:51
@shcheklein shcheklein force-pushed the cleanup-file-path-handling branch from 851933f to 4fa058c Compare February 24, 2026 18:04
@shcheklein shcheklein force-pushed the cleanup-file-path-handling branch from 56d7df1 to 88705d9 Compare February 25, 2026 07:01
Copy link
Contributor

@dreadatour dreadatour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big PR, a lot of changes, but all changes looks good and relevant to me 👍
Haven't tested it locally though, but tests seems are passed, so I think we good to go.

@shcheklein shcheklein force-pushed the cleanup-file-path-handling branch 3 times, most recently from 8f17b3e to ad010c5 Compare February 28, 2026 20:44
@shcheklein shcheklein requested a review from Copilot February 28, 2026 20:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@shcheklein shcheklein force-pushed the cleanup-file-path-handling branch from ad010c5 to 674e0b4 Compare February 28, 2026 21:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants