-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-51338][INFRA] Add automated CI build for connect-examples
#50187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@HyukjinKwon can you take a look? |
@@ -92,6 +92,7 @@ jobs: | |||
pyspark_pandas_modules=`cd dev && python -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark-pandas')))"` | |||
pyspark=`./dev/is-changed.py -m $pyspark_modules` | |||
pandas=`./dev/is-changed.py -m $pyspark_pandas_modules` | |||
connect_examples=`./dev/is-changed.py -m "connect-examples"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to run this with Java 21 (and other scheduled builds), we would need to add this in https://github.com/apache/spark/blob/master/.github/workflows/build_java21.yml too as an example, .e.g., "connect_examples": "true"
- name: Build server-library-example | ||
run: | | ||
cd connect-examples/server-library-example | ||
mvn clean package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use build/sbt
instead? We use SBT in the PR builder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good otherwise
connect-examples
connect-examples
I've noticed a rather peculiar issue here. It seems that the For example, when Spark 4.0.0 is released, the version of this project in the published v4.0.0 tag won't be Spark 4.0.0, but rather some previously released version. Am I right about this? If I'm wrong, please correct me. If I'm right, doesn't that seem a bit strange? Therefore, I still believe that spark/connect-examples/server-library-example/pom.xml Lines 19 to 42 in f40bf4d
What do you think about this? @HyukjinKwon @hvanhovell @dongjoon-hyun @cloud-fan @srowen @yaooqinn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should discuss #50187 (comment) clearly before merging
Can the examples module simply point to SNAPSHOT versions like everything else in the build? the main branch code is always pointing at unreleased code, but on release, those SNAPSHOT versions are changed to the current release version. |
Yes, if it can be done this way, I think it's ok. |
<spark.version>4.0.0-preview2</spark.version> | ||
<protobuf.version>4.29.3</protobuf.version> | ||
<spark.version>4.1.0-SNAPSHOT</spark.version> | ||
<connect.guava.version>33.4.0-jre</connect.guava.version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parent of this project should inherit from Spark's parent pom.xml, and the project version should be consistent with Spark's version, then spark.version
should use ${project.version}
.
Otherwise, the releasing script seems unable to auto change the project version to the official version during the release process now(4.1.0-SNAPSHOT -> 4.1.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we update the release script and add a rule/command to auto-change the project version in this pom file as well? This way, we can satisfy both continuous build compatibility with Spark and be somewhat independent (modulo the dependency on the ASF snapshot repo).
I'd like to avoid inheriting the parent pom as that would lead to the project pulling in Spark's default shading rules, version definitions etc. In this specific case, it wouldn't be favourable as it's intended to demonstrate the extension's development using a minimal set of dependencies (spark-sql-api, spark-connect-client, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If feasible, it's certainly ok. However, I have a few questions regarding this:
-
How should the versions of other dependencies be updated? Do they need to be consistent with Spark? For instance, the current Spark uses Scala 2.13.16, but this project is still using 2.13.15.
-
During the release process, after changing the Spark version (e.g., from 4.0.0-SNAPSHOT to 4.0.0), is it necessary to check the
build
of this project? -
Since it aims to be independent project, why don't we choose to maintain this examples project in a separate branch(no Spark code whatsoever), or even create a separate repository like
spark-connect-examples
? If it is an independent repository, would it be more convenient to also include examples for clients in other programming languages, such as Go or Swift?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we update the release script and add a rule/command to auto-change the project version in this pom file as well? This way, we can satisfy both continuous build compatibility with Spark and be somewhat independent (modulo the dependency on the ASF snapshot repo).
@vicennial Is there any progress on this pr? I think it would be best if we could resolve this issue in Spark 4.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vicennial Is there any progress on this hypothetical plan? Or can we remove this example module from branch-4.0 first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the questions, @LuciferYang. I had been AFK last week, back now
How should the versions of other dependencies be updated? Do they need to be consistent with Spark? For instance, the current Spark uses Scala 2.13.16, but this project is still using 2.13.15.
Some (but not all) dependencies need to be consistent, such as the protobuf version. These would require being updated as the Spark Connect code/deps evolves
During the release process, after changing the Spark version (e.g., from 4.0.0-SNAPSHOT to 4.0.0), is it necessary to check the build of this project?
Since we've decided to add CI tests, I think it would make sense to a final check at time of release as well.
Since it aims to be independent project, why don't we choose to maintain this examples project in a separate branch(no Spark code whatsoever), or even create a separate repository like spark-connect-examples? If it is an independent repository, would it be more convenient to also include examples for clients in other programming languages, such as Go or Swift
I am not opposed to a separate branch/repository and I could see it working but I must admit that I do not know the implications or pros/cons of creating a separate repository under ASF. Perhaps the more seasoned committers may know, any idea @hvanhovell / @HyukjinKwon / @cloud-fan ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid inheriting the parent pom as that would lead to the project pulling in Spark's default shading rules, version definitions etc. In this specific case, it wouldn't be favourable as it's intended to demonstrate the extension's development using a minimal set of dependencies (spark-sql-api, spark-connect-client, etc.).
After some consideration, if this project does not want to inherit Spark's parent pom.xml
, it might be necessary to first deploy the Spark codebase corresponding to the this commit to a local repository. Then, the current project would need to be built using the -Dmaven.repo.local=/path/to/local/repository
option.
Another possible approach is to configure the ASF snapshot repository, but in this case, the project will not obtain a timely snapshot but rather a nightly build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestions @LuciferYang , I am exploring the first option atm
AFAIK some downstream libraries (e.g. Delta Lake) also depend on the Spark nightly build for testing. I think it makes sense for connect-example to do the same. |
Yes, that would be ok. However, the connect examples within the tag for Spark 4.0.0 should be changed to depend on the release version of Spark 4.0.0 rather than the snapshot, right? This seems to require some additional changes and needs to be completed before the release of version 4.0, correct? |
What changes were proposed in this pull request?
Adds a build step for the
connect-examples
directory (specifically theserver-library-example
project) in the CI steps (build_and_test.yml
)Why are the changes needed?
No existing automated CI build atm
Does this PR introduce any user-facing change?
No
How was this patch tested?
CI run in this PR
Was this patch authored or co-authored using generative AI tooling?
Generated-by: o3-mini-high