Skip to content

[BUG][Spark] [Cosmos-Changefeed Connector] #46074

@pammusankolli

Description

@pammusankolli

Describe the bug
The Spark connector for Azure Cosmos DB (azure-cosmos-spark_3-1_2-12) creates an increasing number of Spark tasks over time, even when:

The Cosmos container has a fixed number of physical partitions (75)

The partitioning strategy is set to Custom

The number of partitions is explicitly capped using
spark.cosmos.partitioning.targetedCount = 75

Despite this configuration, the number of Spark tasks continues to grow between job runs (e.g., 75 → 150 → 300...), even when nothing else has changed. This results in performance issues and unpredictable scaling.

Exception or Stack Trace
There is no exception. However, the Spark UI shows increasing task counts:

Stage 0 (first run): 75 tasks
Stage 1 (second run): 150 tasks
Stage 2 (third run): 300 tasks
...
This occurs without repartitioning or shuffles in the job.

To Reproduce
Use a Cosmos DB container with 75 physical partitions

Run the following Spark job repeatedly (or across multiple tenants)

Observe the increasing number of tasks in the Spark UI over time

val df = spark.read
  .format("cosmos.oltp")
  .option("spark.cosmos.accountEndpoint", "<>")
  .option("spark.cosmos.accountKey", "<>")
  .option("spark.cosmos.database", "<>")
  .option("spark.cosmos.container", "<>")
  .option("spark.cosmos.read.partitioning.strategy", "Custom")
  .option("spark.cosmos.partitioning.targetedCount", "75")
  .load()

println(s"Spark partition count: ${df.rdd.getNumPartitions}")
This consistently prints more partitions over time, even though the Cosmos container's physical partition count stays at 75.

Expected behavior
Spark should consistently generate 75 input partitions, matching the spark.cosmos.partitioning.targetedCount setting. This number should remain constant unless the Cosmos container's physical partition count changes or the configuration is explicitly updated.

Screenshots
Attach Spark UI screenshots showing increasing task counts across stages and runs, if possible.

Setup (please complete the following information):
OS: Ubuntu 20.04 (also tested in Databricks Runtime)

IDE: IntelliJ IDEA (optional)

Library: com.azure.cosmos.spark:azure-cosmos-spark_3-1_2-12:4.37.1

Java version: OpenJDK 11

Environment: Apache Spark 3.1.2 on Databricks Runtime 10.x

Frameworks: Plain Spark batch job (no streaming)

Additional Notes
The Cosmos container has 75 physical partitions — verified in Azure Portal.

There are no .repartition(), .groupBy(), or union operations applied.

.coalesce(75) has been tested as a workaround, confirming the issue is in the connector’s partition management logic.

Using strategy Restrictive also leads to similar task count inflation.

This behavior breaks expected scaling guarantees and causes resource waste.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ClientThis issue points to a problem in the data-plane of the library.CosmosService AttentionWorkflow: This issue is responsible by Azure service team.customer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamquestionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions