Skip to content

[SPARK-52334][CORE][K8S] update all files, jars, and pyFiles to reference the working directory after they are downloaded #51037

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

TongWei1105
Copy link
Contributor

@TongWei1105 TongWei1105 commented May 28, 2025

…rking directory after they are downloaded.

What changes were proposed in this pull request?

This PR fixes a bug where submitting a Spark job using the --files option and also calling SparkContext.addFile() for a file with the same name causes Spark to throw an exception
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: File a.text was already registered with a different path (old path = /tmp/spark-6aa5129d-5bbb-464a-9e50-5b6ffe364ffb/a.text, new path = /opt/spark/work-dir/a.text

Why are the changes needed?

  1. Submit a Spark application using spark-submit with the --files option:
    bin/spark-submit --files s3://bucket/a.text --class testDemo app.jar
  2. In the testDemo application code, call:
    sc.addFile("a.text", true)

This works correctly in YARN mode, but throws an error in Kubernetes mode.
After SPARK-33782, in Kubernetes mode, --files, --jars, --archiveFiles, and --pyFiles are all downloaded to the working directory.

However, in the code, args.files = filesLocalFiles, and filesLocalFiles refers to a temporary download path, not the working directory.
This causes issues when user code like testDemo calls sc.addFile("a.text", true), resulting in an error such as:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: File a.text was already registered with a different path (old path = /tmp/spark-6aa5129d-5bbb-464a-9e50-5b6ffe364ffb/a.text, new path = /opt/spark/work-dir/a.text

Does this PR introduce any user-facing change?

This issue can be resolved after this PR.

How was this patch tested?

Existed UT

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the CORE label May 28, 2025
@TongWei1105
Copy link
Contributor Author

Hi @dongjoon-hyun ,
Could you please review this PR when you have time? Many thanks!

@TongWei1105 TongWei1105 changed the title [SPARK-52334][CORE][K8S] update all files, jars, archiveFiles, and pyFiles to reference the working directory after they are downloaded [SPARK-52334][CORE][K8S] update all files, jars, and pyFiles to reference the working directory after they are downloaded May 28, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you can write a test case, @TongWei1105 ?

@TongWei1105
Copy link
Contributor Author

TongWei1105 commented May 29, 2025

Do you think you can write a test case, @TongWei1105 ?
@dongjoon-hyun
Thanks for the suggestion! I've added a unit test,please feel free to review it when convenient. Appreciate your feedback!

@TongWei1105 TongWei1105 requested a review from dongjoon-hyun May 29, 2025 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants