Skip to content

Commit c879f28

Browse files
authored
[SPARK] Set Auto Bucket Partitioner to be the default partitioning strategy (#252)
* default partitioner * formatting * tweak * move example * release note * typo
1 parent c8f8c7b commit c879f28

File tree

3 files changed

+80
-64
lines changed

3 files changed

+80
-64
lines changed

snooty.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ artifact-id-2-13 = "mongo-spark-connector_2.13"
2121
artifact-id-2-12 = "mongo-spark-connector_2.12"
2222
spark-core-version = "3.3.1"
2323
spark-sql-version = "3.3.1"
24+
mdb-server = "MongoDB Server"
2425

2526
[substitutions]
2627
copy = "unicode:: U+000A9"

source/batch-mode/batch-read-config.txt

Lines changed: 72 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ Batch Read Configuration Options
77
.. contents:: On this page
88
:local:
99
:backlinks: none
10-
:depth: 1
10+
:depth: 2
1111
:class: singlecol
1212

1313
.. facet::
@@ -178,26 +178,82 @@ dividing the data into partitions, you can run transformations in parallel.
178178
This section contains configuration information for the following
179179
partitioner:
180180

181+
- :ref:`AutoBucketPartitioner <conf-autobucketpartitioner>`
181182
- :ref:`SamplePartitioner <conf-samplepartitioner>`
182183
- :ref:`ShardedPartitioner <conf-shardedpartitioner>`
183184
- :ref:`PaginateBySizePartitioner <conf-paginatebysizepartitioner>`
184185
- :ref:`PaginateIntoPartitionsPartitioner <conf-paginateintopartitionspartitioner>`
185186
- :ref:`SinglePartitionPartitioner <conf-singlepartitionpartitioner>`
186-
- :ref:`AutoBucketPartitioner <conf-autobucketpartitioner>`
187187

188188
.. note:: Batch Reads Only
189189

190190
Because the data-stream-processing engine produces a single data stream,
191191
partitioners do not affect streaming reads.
192192

193+
.. _conf-autobucketpartitioner:
194+
195+
AutoBucketPartitioner Configuration (default)
196+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
197+
198+
The ``AutoBucketPartitioner`` is the default partitioner configuration. It
199+
samples the data to generate partitions and uses
200+
the :manual:`$bucketAuto </reference/operator/aggregation/bucketAuto/>`
201+
aggregation stage to paginate. By using this configuration, you can partition
202+
the data across single or multiple fields, including nested fields.
203+
204+
.. note:: Compound Keys
205+
206+
The ``AutoBucketPartitioner`` configuration requires {+mdb-server+} version
207+
7.0 or higher to support compound keys.
208+
209+
To use this configuration, set the ``partitioner`` configuration option to
210+
``com.mongodb.spark.sql.connector.read.partitioner.AutoBucketPartitioner``.
211+
212+
.. list-table::
213+
:header-rows: 1
214+
:widths: 35 65
215+
216+
* - Property name
217+
- Description
218+
219+
* - ``partitioner.options.partition.fieldList``
220+
- The list of fields to use for partitioning. The value can be either a single field
221+
name or a list of comma-separated fields.
222+
223+
**Default:** ``_id``
224+
225+
* - ``partitioner.options.partition.chunkSize``
226+
- The average size (MB) for each partition. Smaller partition sizes
227+
create more partitions containing fewer documents.
228+
Because this configuration uses the average document size to determine the number of
229+
documents per partition, partitions might not be the same size.
230+
231+
**Default:** ``64``
232+
233+
* - ``partitioner.options.partition.samplesPerPartition``
234+
- The number of samples to take per partition.
235+
236+
**Default:** ``100``
237+
238+
* - ``partitioner.options.partition.partitionKeyProjectionField``
239+
- The field name to use for a projected field that contains all the
240+
fields used to partition the collection.
241+
We recommend changing the value of this property only if each document already
242+
contains the ``__idx`` field.
243+
244+
**Default:** ``__idx``
245+
193246
.. _conf-mongosamplepartitioner:
194247
.. _conf-samplepartitioner:
195248

196-
``SamplePartitioner`` Configuration
197-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
249+
SamplePartitioner Configuration
250+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
198251

199-
``SamplePartitioner`` is the default partitioner configuration. This configuration
200-
lets you specify a partition field, partition size, and number of samples per partition.
252+
The ``SamplePartitioner`` configuration is similar to the
253+
:ref:`AutoBucketPartitioner <conf-autobucketpartitioner>` configuration, but
254+
does not use the ``$bucketAuto`` aggregation stage. This
255+
configuration lets you specify a partition field, partition size, and number of
256+
samples per partition.
201257

202258
To use this configuration, set the ``partitioner`` configuration option to
203259
``com.mongodb.spark.sql.connector.read.partitioner.SamplePartitioner``.
@@ -243,8 +299,8 @@ To use this configuration, set the ``partitioner`` configuration option to
243299
.. _conf-mongoshardedpartitioner:
244300
.. _conf-shardedpartitioner:
245301

246-
``ShardedPartitioner`` Configuration
247-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
302+
ShardedPartitioner Configuration
303+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
248304

249305
The ``ShardedPartitioner`` configuration automatically partitions the data
250306
based on your shard configuration.
@@ -262,8 +318,8 @@ To use this configuration, set the ``partitioner`` configuration option to
262318
.. _conf-mongopaginatebysizepartitioner:
263319
.. _conf-paginatebysizepartitioner:
264320

265-
``PaginateBySizePartitioner`` Configuration
266-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
321+
PaginateBySizePartitioner Configuration
322+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
267323

268324
The ``PaginateBySizePartitioner`` configuration paginates the data by using the
269325
average document size to split the collection into average-sized chunks.
@@ -292,8 +348,8 @@ To use this configuration, set the ``partitioner`` configuration option to
292348

293349
.. _conf-paginateintopartitionspartitioner:
294350

295-
``PaginateIntoPartitionsPartitioner`` Configuration
296-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
351+
PaginateIntoPartitionsPartitioner Configuration
352+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
297353

298354
The ``PaginateIntoPartitionsPartitioner`` configuration paginates the data by dividing
299355
the count of documents in the collection by the maximum number of allowable partitions.
@@ -320,63 +376,15 @@ To use this configuration, set the ``partitioner`` configuration option to
320376

321377
.. _conf-singlepartitionpartitioner:
322378

323-
``SinglePartitionPartitioner`` Configuration
324-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
379+
SinglePartitionPartitioner Configuration
380+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
325381

326382
The ``SinglePartitionPartitioner`` configuration creates a single partition.
327383

328384
To use this configuration, set the ``partitioner`` configuration option to
329385
``com.mongodb.spark.sql.connector.read.partitioner.SinglePartitionPartitioner``.
330386

331-
.. _conf-autobucketpartitioner:
332-
333-
``AutoBucketPartitioner`` Configuration
334-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
335-
336-
The ``AutoBucketPartitioner`` configuration is similar to the
337-
:ref:`SamplePartitioner <conf-samplepartitioner>`
338-
configuration, but uses the :manual:`$bucketAuto </reference/operator/aggregation/bucketAuto/>`
339-
aggregation stage to paginate the data. By using this configuration,
340-
you can partition the data across single or multiple fields, including nested fields.
341-
342-
To use this configuration, set the ``partitioner`` configuration option to
343-
``com.mongodb.spark.sql.connector.read.partitioner.AutoBucketPartitioner``.
344-
345-
.. list-table::
346-
:header-rows: 1
347-
:widths: 35 65
348-
349-
* - Property name
350-
- Description
351-
352-
* - ``partitioner.options.partition.fieldList``
353-
- The list of fields to use for partitioning. The value can be either a single field
354-
name or a list of comma-separated fields.
355-
356-
**Default:** ``_id``
357-
358-
* - ``partitioner.options.partition.chunkSize``
359-
- The average size (MB) for each partition. Smaller partition sizes
360-
create more partitions containing fewer documents.
361-
Because this configuration uses the average document size to determine the number of
362-
documents per partition, partitions might not be the same size.
363-
364-
**Default:** ``64``
365-
366-
* - ``partitioner.options.partition.samplesPerPartition``
367-
- The number of samples to take per partition.
368-
369-
**Default:** ``100``
370-
371-
* - ``partitioner.options.partition.partitionKeyProjectionField``
372-
- The field name to use for a projected field that contains all the
373-
fields used to partition the collection.
374-
We recommend changing the value of this property only if each document already
375-
contains the ``__idx`` field.
376-
377-
**Default:** ``__idx``
378-
379-
Specifying Properties in ``connection.uri``
380-
-------------------------------------------
387+
Specifying Properties in connection.uri
388+
---------------------------------------
381389

382390
.. include:: /includes/connection-read-config.rst

source/release-notes.txt

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,13 @@ Release Notes
1818
:depth: 1
1919
:class: singlecol
2020

21+
MongoDB Connector for Spark 10.5
22+
--------------------------------
23+
24+
The 10.5 connector release includes the following changes and new features:
25+
26+
- Changes the default batch read partitioner configuration to be ``AutoBucketPartitioner``
27+
2128
MongoDB Connector for Spark 10.4
2229
--------------------------------
2330

0 commit comments

Comments
 (0)