Skip to content

[SPARK-51338][INFRA] Add automated CI build for connect-examples #50187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

vicennial
Copy link
Contributor

What changes were proposed in this pull request?

Adds a build step for the connect-examples directory (specifically the server-library-example project) in the CI steps (build_and_test.yml)

Why are the changes needed?

No existing automated CI build atm

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI run in this PR

Was this patch authored or co-authored using generative AI tooling?

Generated-by: o3-mini-high

@github-actions github-actions bot added the INFRA label Mar 6, 2025
@hvanhovell
Copy link
Contributor

@HyukjinKwon can you take a look?

@github-actions github-actions bot added the BUILD label Mar 6, 2025
@vicennial vicennial marked this pull request as draft March 6, 2025 21:15
@@ -92,6 +92,7 @@ jobs:
pyspark_pandas_modules=`cd dev && python -c "import sparktestsupport.modules as m; print(','.join(m.name for m in m.all_modules if m.name.startswith('pyspark-pandas')))"`
pyspark=`./dev/is-changed.py -m $pyspark_modules`
pandas=`./dev/is-changed.py -m $pyspark_pandas_modules`
connect_examples=`./dev/is-changed.py -m "connect-examples"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to run this with Java 21 (and other scheduled builds), we would need to add this in https://github.com/apache/spark/blob/master/.github/workflows/build_java21.yml too as an example, .e.g., "connect_examples": "true"

- name: Build server-library-example
run: |
cd connect-examples/server-library-example
mvn clean package
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use build/sbt instead? We use SBT in the PR builder

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good otherwise

@HyukjinKwon HyukjinKwon changed the title [SPARK-51338] Add automated CI build for connect-examples [SPARK-51338][INFRA] Add automated CI build for connect-examples Mar 6, 2025
@LuciferYang
Copy link
Contributor

I've noticed a rather peculiar issue here. It seems that the connect-examples project is dependent on a released version of Spark, which means we can only update to a new version after it has been officially released.

For example, when Spark 4.0.0 is released, the version of this project in the published v4.0.0 tag won't be Spark 4.0.0, but rather some previously released version. Am I right about this? If I'm wrong, please correct me.

If I'm right, doesn't that seem a bit strange? Therefore, I still believe that connect-examples module should be placed outside of the Spark Repo, so that we can update and release it in another codebase after Spark 4.0.0(Or other official versions) is out.

<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.apache.connect.examples.serverlibrary</groupId>
<artifactId>spark-server-library-example</artifactId>
<version>1.0.0</version>
<packaging>pom</packaging>
<modules>
<module>common</module>
<module>server</module>
<module>client</module>
</modules>
<properties>
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.binary>2.13</scala.binary>
<scala.version>2.13.15</scala.version>
<protobuf.version>3.25.4</protobuf.version>
<spark.version>4.0.0-preview2</spark.version>
</properties>
</project>

What do you think about this? @HyukjinKwon @hvanhovell @dongjoon-hyun @cloud-fan @srowen @yaooqinn

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should discuss #50187 (comment) clearly before merging

@srowen
Copy link
Member

srowen commented Mar 8, 2025

Can the examples module simply point to SNAPSHOT versions like everything else in the build? the main branch code is always pointing at unreleased code, but on release, those SNAPSHOT versions are changed to the current release version.

@LuciferYang
Copy link
Contributor

Can the examples module simply point to SNAPSHOT versions like everything else in the build? the main branch code is always pointing at unreleased code, but on release, those SNAPSHOT versions are changed to the current release version.

Yes, if it can be done this way, I think it's ok.

@github-actions github-actions bot added the DOCS label Mar 11, 2025
<spark.version>4.0.0-preview2</spark.version>
<protobuf.version>4.29.3</protobuf.version>
<spark.version>4.1.0-SNAPSHOT</spark.version>
<connect.guava.version>33.4.0-jre</connect.guava.version>
Copy link
Contributor

@LuciferYang LuciferYang Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parent of this project should inherit from Spark's parent pom.xml, and the project version should be consistent with Spark's version, then spark.version should use ${project.version}.

Otherwise, the releasing script seems unable to auto change the project version to the official version during the release process now(4.1.0-SNAPSHOT -> 4.1.0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we update the release script and add a rule/command to auto-change the project version in this pom file as well? This way, we can satisfy both continuous build compatibility with Spark and be somewhat independent (modulo the dependency on the ASF snapshot repo).

I'd like to avoid inheriting the parent pom as that would lead to the project pulling in Spark's default shading rules, version definitions etc. In this specific case, it wouldn't be favourable as it's intended to demonstrate the extension's development using a minimal set of dependencies (spark-sql-api, spark-connect-client, etc.).

Copy link
Contributor

@LuciferYang LuciferYang Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If feasible, it's certainly ok. However, I have a few questions regarding this:

  1. How should the versions of other dependencies be updated? Do they need to be consistent with Spark? For instance, the current Spark uses Scala 2.13.16, but this project is still using 2.13.15.

  2. During the release process, after changing the Spark version (e.g., from 4.0.0-SNAPSHOT to 4.0.0), is it necessary to check the build of this project?

  3. Since it aims to be independent project, why don't we choose to maintain this examples project in a separate branch(no Spark code whatsoever), or even create a separate repository like spark-connect-examples? If it is an independent repository, would it be more convenient to also include examples for clients in other programming languages, such as Go or Swift?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we update the release script and add a rule/command to auto-change the project version in this pom file as well? This way, we can satisfy both continuous build compatibility with Spark and be somewhat independent (modulo the dependency on the ASF snapshot repo).

@vicennial Is there any progress on this pr? I think it would be best if we could resolve this issue in Spark 4.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vicennial Is there any progress on this hypothetical plan? Or can we remove this example module from branch-4.0 first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the questions, @LuciferYang. I had been AFK last week, back now

How should the versions of other dependencies be updated? Do they need to be consistent with Spark? For instance, the current Spark uses Scala 2.13.16, but this project is still using 2.13.15.

Some (but not all) dependencies need to be consistent, such as the protobuf version. These would require being updated as the Spark Connect code/deps evolves

During the release process, after changing the Spark version (e.g., from 4.0.0-SNAPSHOT to 4.0.0), is it necessary to check the build of this project?

Since we've decided to add CI tests, I think it would make sense to a final check at time of release as well.

Since it aims to be independent project, why don't we choose to maintain this examples project in a separate branch(no Spark code whatsoever), or even create a separate repository like spark-connect-examples? If it is an independent repository, would it be more convenient to also include examples for clients in other programming languages, such as Go or Swift

I am not opposed to a separate branch/repository and I could see it working but I must admit that I do not know the implications or pros/cons of creating a separate repository under ASF. Perhaps the more seasoned committers may know, any idea @hvanhovell / @HyukjinKwon / @cloud-fan ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid inheriting the parent pom as that would lead to the project pulling in Spark's default shading rules, version definitions etc. In this specific case, it wouldn't be favourable as it's intended to demonstrate the extension's development using a minimal set of dependencies (spark-sql-api, spark-connect-client, etc.).

After some consideration, if this project does not want to inherit Spark's parent pom.xml, it might be necessary to first deploy the Spark codebase corresponding to the this commit to a local repository. Then, the current project would need to be built using the -Dmaven.repo.local=/path/to/local/repository option.

Another possible approach is to configure the ASF snapshot repository, but in this case, the project will not obtain a timely snapshot but rather a nightly build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestions @LuciferYang , I am exploring the first option atm

@cloud-fan
Copy link
Contributor

AFAIK some downstream libraries (e.g. Delta Lake) also depend on the Spark nightly build for testing. I think it makes sense for connect-example to do the same.

@LuciferYang
Copy link
Contributor

AFAIK some downstream libraries (e.g. Delta Lake) also depend on the Spark nightly build for testing. I think it makes sense for connect-example to do the same.

Yes, that would be ok. However, the connect examples within the tag for Spark 4.0.0 should be changed to depend on the release version of Spark 4.0.0 rather than the snapshot, right? This seems to require some additional changes and needs to be completed before the release of version 4.0, correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants