[libc++] Optimize ranges::{for_each, for_each_n} for segmented iterators #132896

winner245 · 2025-03-25T07:46:38Z

Previously, the segmented iterator optimization was limited to std::{for_each, for_each_n}. This patch aims to extend the optimization to std::ranges::for_each and std::ranges::for_each_n, ensuring consistent optimizations across these algorithms. This patch first generalizes the std algorithms by introducing a Projection parameter, which is set to __identity for the std algorithms. Then we let the ranges algorithms to directly call their std counterparts with a general __proj argument. Benchmarks demonstrate performance improvements of up to 21.4x for std::deque::iterator and 22.3x for join_view of vector<vector<char>>.

Addresses a subtask of #102817.

Summary of speedups for `deque` iterators

-------------------------------------------------------------------------------
Benchmark                        deque<char>    deque<short>    deque<int>
-------------------------------------------------------------------------------
rng::for_each                       14.4x          21.4x           4.6x
rng::for_each_n                     12.9x          15.5x           4.1x
-------------------------------------------------------------------------------

Summary of speedups for `join_view` iterators

-----------------------------------------------------------------------------------------
Benchmark          vector<vector<char>>    vector<vector<short>>    vector<vector<int>>
-----------------------------------------------------------------------------------------
rng::for_each             19.0x                   22.3x                    4.8x
rng::for_each_n           16.3x                   20.1x                    3.9x
-----------------------------------------------------------------------------------------

Benchmarks:

`std::ranges::for_each` with `deque` iterators

--------------------------------------------------------------------------
Benchmark                                    Before        After   Speedup
--------------------------------------------------------------------------
rng::for_each(deque<char>)/8                 8.39 ns      2.63 ns    3.2x
rng::for_each(deque<char>)/32               28.70 ns      3.05 ns    9.4x
rng::for_each(deque<char>)/50               42.00 n       4.53 ns    9.3x
rng::for_each(deque<char>)/1024            657.00 ns     45.60 ns   14.4x
rng::for_each(deque<char>)/4096           2272.00 ns    169.00 ns   13.4x
rng::for_each(deque<char>)/8192           4525.00 ns    355.00 ns   12.7x
rng::for_each(deque<char>)/16384          9445.00 ns    722.00 ns   13.1x
rng::for_each(deque<char>)/65536         36880.00 ns   2902.00 ns   12.7x
rng::for_each(deque<char>)/262144       157774.00 ns  11577.00 ns   13.6x
rng::for_each(deque<short>)/8                5.70 ns      1.62 ns    3.5x
rng::for_each(deque<short>)/32              26.80 ns      1.69 ns   15.9x
rng::for_each(deque<short>)/50              38.40 ns      3.06 ns   12.5x
rng::for_each(deque<short>)/1024           700.00 ns     40.40 ns   17.3x
rng::for_each(deque<short>)/4096          2782.00 ns    133.00 ns   20.9x
rng::for_each(deque<short>)/8192          5554.00 ns    260.00 ns   21.4x
rng::for_each(deque<short>)/16384        11093.00 ns    521.00 ns   21.3x
rng::for_each(deque<short>)/65536        44035.00 ns   2495.00 ns   17.6x
rng::for_each(deque<short>)/262144      177784.00 ns   9915.00 ns   17.9x
rng::for_each(deque<int>)/8                 5.43 ns       3.00 ns    1.8x
rng::for_each(deque<int>)/32               25.50 ns       5.60 ns    4.6x
rng::for_each(deque<int>)/50               38.50 ns       8.61 ns    4.5x
rng::for_each(deque<int>)/1024            706.00 ns     169.00 ns    4.2x
rng::for_each(deque<int>)/4096           2789.00 ns     670.00 ns    4.2x
rng::for_each(deque<int>)/8192           5547.00 ns    1330.00 ns    4.2x
rng::for_each(deque<int>)/16384         11098.00 ns    2711.00 ns    4.1x
rng::for_each(deque<int>)/65536         44404.00 ns   10709.00 ns    4.1x
rng::for_each(deque<int>)/262144       180739.00 ns   43645.00 ns    4.1x

`std::ranges::for_each_n` with `deque` iterators

--------------------------------------------------------------------------
Benchmark                                    Before        After   Speedup
--------------------------------------------------------------------------
rng::for_each_n(deque<char>)/8              8.22 ns       3.28 ns     2.5x
rng::for_each_n(deque<char>)/32             28.5 ns       3.66 ns     7.8x
rng::for_each_n(deque<char>)/50             37.6 ns       6.15 ns     6.1x
rng::for_each_n(deque<char>)/1024            590 ns       47.0 ns    12.6x
rng::for_each_n(deque<char>)/4096           2151 ns        167 ns    12.9x
rng::for_each_n(deque<char>)/8192           4199 ns        344 ns    12.2x
rng::for_each_n(deque<char>)/16384          8626 ns        701 ns    12.3x
rng::for_each_n(deque<char>)/65536         33613 ns       2845 ns    11.8x
rng::for_each_n(deque<char>)/262144       132493 ns      11291 ns    11.7x
rng::for_each_n(deque<short>)/8             6.53 ns       3.72 ns     1.8x
rng::for_each_n(deque<short>)/32            23.2 ns       3.75 ns     6.2x
rng::for_each_n(deque<short>)/50            32.7 ns       5.54 ns     5.9x
rng::for_each_n(deque<short>)/1024           560 ns       37.4 ns    15.0x
rng::for_each_n(deque<short>)/4096          2105 ns        136 ns    15.5x
rng::for_each_n(deque<short>)/8192          3981 ns        264 ns    15.1x
rng::for_each_n(deque<short>)/16384         7736 ns        525 ns    14.7x
rng::for_each_n(deque<short>)/65536        30359 ns       2459 ns    12.3x
rng::for_each_n(deque<short>)/262144      121006 ns       9852 ns    12.3x
rng::for_each_n(deque<int>)/8               5.59 ns       4.16 ns     1.3x
rng::for_each_n(deque<int>)/32              19.9 ns       6.89 ns     2.9x
rng::for_each_n(deque<int>)/50              32.6 ns       10.1 ns     3.2x
rng::for_each_n(deque<int>)/1024             605 ns        180 ns     3.4x
rng::for_each_n(deque<int>)/4096            2517 ns        715 ns     3.5x
rng::for_each_n(deque<int>)/8192            4942 ns       1431 ns     3.5x
rng::for_each_n(deque<int>)/16384           9809 ns       2906 ns     3.4x
rng::for_each_n(deque<int>)/65536          40199 ns      11316 ns     3.6x
rng::for_each_n(deque<int>)/262144        181371 ns      44000 ns     4.1x

`std::ranges::for_each` with `join_view` iterators

----------------------------------------------------------------------------------------------------
Benchmark                                                       Before           After       Speedup
----------------------------------------------------------------------------------------------------
rng::for_each(join_view(vector<vector<char>>)/8                7.02 ns         2.58 ns         2.7x
rng::for_each(join_view(vector<vector<char>>)/32               32.1 ns         3.03 ns        10.6x
rng::for_each(join_view(vector<vector<char>>)/50               45.2 ns         5.34 ns         8.5x
rng::for_each(join_view(vector<vector<char>>)/1024              782 ns         43.4 ns        18.0x
rng::for_each(join_view(vector<vector<char>>)/4096             3113 ns          168 ns        18.5x
rng::for_each(join_view(vector<vector<char>>)/8192             6231 ns          339 ns        18.4x
rng::for_each(join_view(vector<vector<char>>)/16384           12783 ns          691 ns        18.5x
rng::for_each(join_view(vector<vector<char>>)/65536           53732 ns         2829 ns        19.0x
rng::for_each(join_view(vector<vector<char>>)/262144         210286 ns        11241 ns        18.7x
rng::for_each(join_view(vector<vector<short>>)/8               7.46 ns         2.40 ns         3.1x
rng::for_each(join_view(vector<vector<short>>)/32              33.4 ns         2.81 ns        11.9x
rng::for_each(join_view(vector<vector<short>>)/50              46.1 ns         5.66 ns         8.1x
rng::for_each(join_view(vector<vector<short>>)/1024             791 ns         37.0 ns        21.4x
rng::for_each(join_view(vector<vector<short>>)/4096            3183 ns          149 ns        21.4x
rng::for_each(join_view(vector<vector<short>>)/8192            6360 ns          292 ns        21.8x
rng::for_each(join_view(vector<vector<short>>)/16384          12825 ns          574 ns        22.3x
rng::for_each(join_view(vector<vector<short>>)/65536          51638 ns         2745 ns        18.8x
rng::for_each(join_view(vector<vector<short>>)/262144        210929 ns        10964 ns        19.2x
rng::for_each(join_view(vector<vector<int>>)/8                 7.04 ns         3.02 ns         2.3x
rng::for_each(join_view(vector<vector<int>>)/32                24.4 ns         6.62 ns         3.7x
rng::for_each(join_view(vector<vector<int>>)/50                47.6 ns         9.91 ns         4.8x
rng::for_each(join_view(vector<vector<int>>)/1024               727 ns          180 ns         4.0x
rng::for_each(join_view(vector<vector<int>>)/4096              3110 ns          748 ns         4.2x
rng::for_each(join_view(vector<vector<int>>)/8192              6193 ns         1480 ns         4.2x
rng::for_each(join_view(vector<vector<int>>)/16384            12391 ns         2993 ns         4.1x
rng::for_each(join_view(vector<vector<int>>)/65536            49505 ns        11950 ns         4.1x
rng::for_each(join_view(vector<vector<int>>)/262144          199253 ns        47921 ns         4.2x

`std::ranges::for_each_n` with `join_view` iterators

----------------------------------------------------------------------------------------------------
Benchmark                                                       Before           After       Speedup
----------------------------------------------------------------------------------------------------
rng::for_each_n(join_view(vector<vector<char>>)/8              7.97 ns         2.82 ns         2.8x
rng::for_each_n(join_view(vector<vector<char>>)/32             28.7 ns         3.29 ns         8.7x
rng::for_each_n(join_view(vector<vector<char>>)/50             42.8 ns         6.24 ns         6.9x
rng::for_each_n(join_view(vector<vector<char>>)/1024            728 ns         45.5 ns        16.0x
rng::for_each_n(join_view(vector<vector<char>>)/4096           2891 ns          177 ns        16.3x
rng::for_each_n(join_view(vector<vector<char>>)/8192           5769 ns          359 ns        16.1x
rng::for_each_n(join_view(vector<vector<char>>)/16384         11576 ns          720 ns        16.1x
rng::for_each_n(join_view(vector<vector<char>>)/65536         46525 ns         2889 ns        16.1x
rng::for_each_n(join_view(vector<vector<char>>)/262144       186093 ns        11640 ns        16.0x
rng::for_each_n(join_view(vector<vector<short>>)/8             6.95 ns         3.32 ns         2.1x
rng::for_each_n(join_view(vector<vector<short>>)/32            29.4 ns         3.30 ns         8.9x
rng::for_each_n(join_view(vector<vector<short>>)/50            40.8 ns         5.58 ns         7.3x
rng::for_each_n(join_view(vector<vector<short>>)/1024           719 ns         35.9 ns        20.0x
rng::for_each_n(join_view(vector<vector<short>>)/4096          2875 ns          144 ns        20.0x
rng::for_each_n(join_view(vector<vector<short>>)/8192          5632 ns          283 ns        19.9x
rng::for_each_n(join_view(vector<vector<short>>)/16384        11481 ns          570 ns        20.1x
rng::for_each_n(join_view(vector<vector<short>>)/65536        45355 ns         2616 ns        17.3x
rng::for_each_n(join_view(vector<vector<short>>)/262144      181890 ns        10958 ns        16.6x
rng::for_each_n(join_view(vector<vector<int>>)/8               6.61 ns         3.49 ns         1.9x
rng::for_each_n(join_view(vector<vector<int>>)/32              27.5 ns         7.09 ns         3.9x
rng::for_each_n(join_view(vector<vector<int>>)/50              40.4 ns         10.5 ns         3.8x
rng::for_each_n(join_view(vector<vector<int>>)/1024             674 ns          188 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/4096            2717 ns          766 ns         3.5x
rng::for_each_n(join_view(vector<vector<int>>)/8192            5422 ns         1524 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/16384          11024 ns         3037 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/65536          44197 ns        12159 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/262144        175819 ns        48274 ns         3.6x

libcxx/include/__algorithm/for_each_n.h

llvmbot · 2025-03-25T15:59:57Z

@llvm/pr-subscribers-libcxx

Author: Peng Liu (winner245)

Changes

This patch extends segmented iterator optimizations, previously applied to std::for_each, to std::for_each_n, std::ranges::for_each, and std::ranges::for_each_n by forwarding to std::for_each. New tests validate these optimizations for segmented iterators (e.g., deque<int> and join_view iterators). Benchmarks demonstrate up to 3.9x performance improvement for deque<int> iterators, aligning their performance with contiguous iterators (e.g., vector<int>). The vector<int> performance serves as a baseline for contiguous iterators, representing the upper bound for segmented deque<int> inputs.

Addresses a subtask of #102817.

`for_each_n`

--------------------------------------------------------------------------------
Benchmark                                       Before          After    Speedup
--------------------------------------------------------------------------------
std::for_each_n(deque&lt;int&gt;)/8                  5.31 ns         3.39 ns      1.6x
std::for_each_n(deque&lt;int&gt;)/32                 20.1 ns         6.89 ns      2.9x
std::for_each_n(deque&lt;int&gt;)/1024                612 ns          171 ns      3.6x
std::for_each_n(deque&lt;int&gt;)/8192               4892 ns         1350 ns      3.6x
std::for_each_n(deque&lt;int&gt;)/16384              9786 ns         2774 ns      3.5x
std::for_each_n(deque&lt;int&gt;)/65536             39026 ns        11339 ns      3.4x
std::for_each_n(deque&lt;int&gt;)/262144           157897 ns        45166 ns      3.5x
std::for_each_n(deque&lt;int&gt;)/1048576          643836 ns       184999 ns      3.5x
rng::for_each_n(deque&lt;int&gt;)/8                  4.85 ns         4.94 ns      1.0x
rng::for_each_n(deque&lt;int&gt;)/32                 18.1 ns         8.47 ns      2.1x
rng::for_each_n(deque&lt;int&gt;)/1024                622 ns          171 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/8192               5008 ns         1363 ns      3.7x
rng::for_each_n(deque&lt;int&gt;)/16384              9952 ns         2744 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/65536             40204 ns        10841 ns      3.7x
rng::for_each_n(deque&lt;int&gt;)/262144           157713 ns        43386 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/1048576          637549 ns       177042 ns      3.6x
std::for_each_n(vector&lt;int&gt;)/8                 2.91 ns         2.94 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/32                5.42 ns         5.54 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/1024               161 ns          165 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/8192              1271 ns         1292 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/16384             2556 ns         2619 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/65536            10125 ns        10659 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/262144           44572 ns        44372 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/1048576         180804 ns       183389 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/8                 3.05 ns         3.05 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/32                5.71 ns         5.85 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/1024               167 ns          183 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/8192              1298 ns         1429 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/16384             2691 ns         2870 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/65536            10632 ns        11465 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/262144           53031 ns        45948 ns      1.2x
rng::for_each_n(vector&lt;int&gt;)/1048576         174328 ns       184270 ns      0.9x

`for_each`

--------------------------------------------------------------------------------
Benchmark                                     Before           After     Speedup
--------------------------------------------------------------------------------
std::for_each(deque&lt;int&gt;)/8                  3.18 ns         2.96 ns        1.1x
std::for_each(deque&lt;int&gt;)/32                 5.70 ns         5.54 ns        1.0x
std::for_each(deque&lt;int&gt;)/1024                183 ns          180 ns        1.0x
std::for_each(deque&lt;int&gt;)/8192               1435 ns         1422 ns        1.0x
std::for_each(deque&lt;int&gt;)/16384              2885 ns         2879 ns        1.0x
std::for_each(deque&lt;int&gt;)/65536             11423 ns        11378 ns        1.0x
std::for_each(deque&lt;int&gt;)/262144            45203 ns        43686 ns        1.0x
std::for_each(deque&lt;int&gt;)/1048576          181832 ns       173832 ns        1.0x
rng::for_each(deque&lt;int&gt;)/8                  5.10 ns         3.75 ns        1.4x
rng::for_each(deque&lt;int&gt;)/32                 23.5 ns         7.49 ns        3.1x
rng::for_each(deque&lt;int&gt;)/1024                693 ns          184 ns        3.8x
rng::for_each(deque&lt;int&gt;)/8192               5522 ns         1430 ns        3.9x
rng::for_each(deque&lt;int&gt;)/16384             11112 ns         2930 ns        3.8x
rng::for_each(deque&lt;int&gt;)/65536             44390 ns        11656 ns        3.8x
rng::for_each(deque&lt;int&gt;)/262144           179419 ns        46582 ns        3.9x
rng::for_each(deque&lt;int&gt;)/1048576          711406 ns       189658 ns        3.8x
std::for_each(vector&lt;int&gt;)/8                 2.96 ns         2.91 ns        1.0x
std::for_each(vector&lt;int&gt;)/32                5.54 ns         5.49 ns        1.0x
std::for_each(vector&lt;int&gt;)/1024               165 ns          162 ns        1.0x
std::for_each(vector&lt;int&gt;)/8192              1269 ns         1257 ns        1.0x
std::for_each(vector&lt;int&gt;)/16384             2636 ns         2567 ns        1.0x
std::for_each(vector&lt;int&gt;)/65536            10231 ns        10215 ns        1.0x
std::for_each(vector&lt;int&gt;)/262144           41544 ns        40719 ns        1.0x
std::for_each(vector&lt;int&gt;)/1048576         173667 ns       167878 ns        1.0x
rng::for_each(vector&lt;int&gt;)/8                 3.09 ns         3.06 ns        1.0x
rng::for_each(vector&lt;int&gt;)/32                5.85 ns         5.77 ns        1.0x
rng::for_each(vector&lt;int&gt;)/1024               179 ns          168 ns        1.1x
rng::for_each(vector&lt;int&gt;)/8192              1346 ns         1309 ns        1.0x
rng::for_each(vector&lt;int&gt;)/16384             2714 ns         2664 ns        1.0x
rng::for_each(vector&lt;int&gt;)/65536            10979 ns        10523 ns        1.0x
rng::for_each(vector&lt;int&gt;)/262144           42994 ns        42535 ns        1.0x
rng::for_each(vector&lt;int&gt;)/1048576         175633 ns       173933 ns        1.0x

Full diff: https://github.com/llvm/llvm-project/pull/132896.diff

8 Files Affected:

(modified) libcxx/include/__algorithm/for_each_n.h (+24-1)
(modified) libcxx/include/__algorithm/ranges_for_each.h (+11-3)
(modified) libcxx/include/__algorithm/ranges_for_each_n.h (+11-4)
(added) libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp (+57)
(modified) libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp (+1-1)
(modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp (+82-38)
(modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp (+41-5)
(modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp (+44-2)

diff --git a/libcxx/include/__algorithm/for_each_n.h b/libcxx/include/__algorithm/for_each_n.h
index fce380b49df3e..3d91124432f56 100644
--- a/libcxx/include/__algorithm/for_each_n.h
+++ b/libcxx/include/__algorithm/for_each_n.h
@@ -10,7 +10,11 @@
 #ifndef _LIBCPP___ALGORITHM_FOR_EACH_N_H
 #define _LIBCPP___ALGORITHM_FOR_EACH_N_H
 
+#include <__algorithm/for_each.h>
 #include <__config>
+#include <__iterator/iterator_traits.h>
+#include <__iterator/segmented_iterator.h>
+#include <__type_traits/enable_if.h>
 #include <__utility/convert_to_integral.h>
 
 #if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
@@ -21,7 +25,13 @@ _LIBCPP_BEGIN_NAMESPACE_STD
 
 #if _LIBCPP_STD_VER >= 17
 
-template <class _InputIterator, class _Size, class _Function>
+template <class _InputIterator,
+          class _Size,
+          class _Function,
+          __enable_if_t<!__is_segmented_iterator<_InputIterator>::value ||
+                            (__has_input_iterator_category<_InputIterator>::value &&
+                             !__has_random_access_iterator_category<_InputIterator>::value),
+                        int> = 0>
 inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _InputIterator
 for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
   typedef decltype(std::__convert_to_integral(__orig_n)) _IntegralSize;
@@ -34,6 +44,19 @@ for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
   return __first;
 }
 
+template <class _InputIterator,
+          class _Size,
+          class _Function,
+          __enable_if_t<__is_segmented_iterator<_InputIterator>::value &&
+                            __has_random_access_iterator_category<_InputIterator>::value,
+                        int> = 0>
+inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _InputIterator
+for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
+  _InputIterator __last = __first + __orig_n;
+  std::for_each(__first, __last, __f);
+  return __last;
+}
+
 #endif
 
 _LIBCPP_END_NAMESPACE_STD
diff --git a/libcxx/include/__algorithm/ranges_for_each.h b/libcxx/include/__algorithm/ranges_for_each.h
index de39bc5522753..475f85366188e 100644
--- a/libcxx/include/__algorithm/ranges_for_each.h
+++ b/libcxx/include/__algorithm/ranges_for_each.h
@@ -9,6 +9,7 @@
 #ifndef _LIBCPP___ALGORITHM_RANGES_FOR_EACH_H
 #define _LIBCPP___ALGORITHM_RANGES_FOR_EACH_H
 
+#include <__algorithm/for_each.h>
 #include <__algorithm/in_fun_result.h>
 #include <__config>
 #include <__functional/identity.h>
@@ -41,9 +42,16 @@ struct __for_each {
   template <class _Iter, class _Sent, class _Proj, class _Func>
   _LIBCPP_HIDE_FROM_ABI constexpr static for_each_result<_Iter, _Func>
   __for_each_impl(_Iter __first, _Sent __last, _Func& __func, _Proj& __proj) {
-    for (; __first != __last; ++__first)
-      std::invoke(__func, std::invoke(__proj, *__first));
-    return {std::move(__first), std::move(__func)};
+    if constexpr (random_access_iterator<_Iter> && sized_sentinel_for<_Sent, _Iter>) {
+      auto __n   = __last - __first;
+      auto __end = __first + __n;
+      std::for_each(__first, __end, [&](auto&& __val) { std::invoke(__func, std::invoke(__proj, __val)); });
+      return {std::move(__end), std::move(__func)};
+    } else {
+      for (; __first != __last; ++__first)
+        std::invoke(__func, std::invoke(__proj, *__first));
+      return {std::move(__first), std::move(__func)};
+    }
   }
 
 public:
diff --git a/libcxx/include/__algorithm/ranges_for_each_n.h b/libcxx/include/__algorithm/ranges_for_each_n.h
index 603cb723233c8..3108d66001295 100644
--- a/libcxx/include/__algorithm/ranges_for_each_n.h
+++ b/libcxx/include/__algorithm/ranges_for_each_n.h
@@ -9,6 +9,7 @@
 #ifndef _LIBCPP___ALGORITHM_RANGES_FOR_EACH_N_H
 #define _LIBCPP___ALGORITHM_RANGES_FOR_EACH_N_H
 
+#include <__algorithm/for_each.h>
 #include <__algorithm/in_fun_result.h>
 #include <__config>
 #include <__functional/identity.h>
@@ -40,11 +41,17 @@ struct __for_each_n {
   template <input_iterator _Iter, class _Proj = identity, indirectly_unary_invocable<projected<_Iter, _Proj>> _Func>
   _LIBCPP_HIDE_FROM_ABI constexpr for_each_n_result<_Iter, _Func>
   operator()(_Iter __first, iter_difference_t<_Iter> __count, _Func __func, _Proj __proj = {}) const {
-    while (__count-- > 0) {
-      std::invoke(__func, std::invoke(__proj, *__first));
-      ++__first;
+    if constexpr (random_access_iterator<_Iter>) {
+      auto __last = __first + __count;
+      std::for_each(__first, __last, [&](auto&& __val) { std::invoke(__func, std::invoke(__proj, __val)); });
+      return {std::move(__last), std::move(__func)};
+    } else {
+      while (__count-- > 0) {
+        std::invoke(__func, std::invoke(__proj, *__first));
+        ++__first;
+      }
+      return {std::move(__first), std::move(__func)};
     }
-    return {std::move(__first), std::move(__func)};
   }
 };
 
diff --git a/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp b/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp
new file mode 100644
index 0000000000000..af46371881577
--- /dev/null
+++ b/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp
@@ -0,0 +1,57 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// UNSUPPORTED: c++03, c++11, c++14, c++17
+
+#include <algorithm>
+#include <cstddef>
+#include <deque>
+#include <list>
+#include <string>
+#include <vector>
+
+#include <benchmark/benchmark.h>
+
+int main(int argc, char** argv) {
+  auto std_for_each_n = [](auto first, auto n, auto f) { return std::for_each_n(first, n, f); };
+
+  // {std,ranges}::for_each_n
+  {
+    auto bm = []<class Container>(std::string name, auto for_each_n) {
+      benchmark::RegisterBenchmark(
+          name,
+          [for_each_n](auto& st) {
+            std::size_t const n = st.range(0);
+            Container c(n, 1);
+            auto first = c.begin();
+
+            for ([[maybe_unused]] auto _ : st) {
+              benchmark::DoNotOptimize(c);
+              auto result = for_each_n(first, n, [](int& x) { x = std::clamp(x, 10, 100); });
+              benchmark::DoNotOptimize(result);
+            }
+          })
+          ->Arg(8)
+          ->Arg(32)
+          ->Arg(50) // non power-of-two
+          ->Arg(8192)
+          ->Arg(1 << 20);
+    };
+    bm.operator()<std::vector<int>>("std::for_each_n(vector<int>)", std_for_each_n);
+    bm.operator()<std::deque<int>>("std::for_each_n(deque<int>)", std_for_each_n);
+    bm.operator()<std::list<int>>("std::for_each_n(list<int>)", std_for_each_n);
+    bm.operator()<std::vector<int>>("rng::for_each_n(vector<int>)", std::ranges::for_each_n);
+    bm.operator()<std::deque<int>>("rng::for_each_n(deque<int>)", std::ranges::for_each_n);
+    bm.operator()<std::list<int>>("rng::for_each_n(list<int>)", std::ranges::for_each_n);
+  }
+
+  benchmark::Initialize(&argc, argv);
+  benchmark::RunSpecifiedBenchmarks();
+  benchmark::Shutdown();
+  return 0;
+}
diff --git a/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp b/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
index dd026444330ea..beb4c7f675a6e 100644
--- a/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
+++ b/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
@@ -258,7 +258,7 @@ constexpr bool all_the_algorithms()
 int main(int, char**)
 {
     all_the_algorithms();
-    static_assert(all_the_algorithms());
+    // static_assert(all_the_algorithms());
 
     return 0;
 }
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
index 371f6c92f1ed1..42f1a41a27096 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
@@ -13,69 +13,113 @@
 //    constexpr InputIterator      // constexpr after C++17
 //    for_each_n(InputIterator first, Size n, Function f);
 
-
 #include <algorithm>
 #include <cassert>
+#include <deque>
 #include <functional>
+#include <iterator>
+#include <ranges>
+#include <vector>
 
 #include "test_macros.h"
 #include "test_iterators.h"
 
-#if TEST_STD_VER > 17
-TEST_CONSTEXPR bool test_constexpr() {
-    int ia[] = {1, 3, 6, 7};
-    int expected[] = {3, 5, 8, 9};
-    const std::size_t N = 4;
+struct for_each_test {
+  TEST_CONSTEXPR for_each_test(int c) : count(c) {}
+  int count;
+  TEST_CONSTEXPR_CXX14 void operator()(int& i) {
+    ++i;
+    ++count;
+  }
+};
 
-    auto it = std::for_each_n(std::begin(ia), N, [](int &a) { a += 2; });
-    return it == (std::begin(ia) + N)
-        && std::equal(std::begin(ia), std::end(ia), std::begin(expected))
-        ;
-    }
-#endif
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
 
-struct for_each_test
-{
-    for_each_test(int c) : count(c) {}
-    int count;
-    void operator()(int& i) {++i; ++count;}
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
 };
 
-int main(int, char**)
-{
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::for_each_n(d.begin(), d.size(), deque_test(d, index));
+  }
+}
+
+TEST_CONSTEXPR_CXX20 bool test() {
+  {
     typedef cpp17_input_iterator<int*> Iter;
-    int ia[] = {0, 1, 2, 3, 4, 5};
-    const unsigned s = sizeof(ia)/sizeof(ia[0]);
+    int ia[]         = {0, 1, 2, 3, 4, 5};
+    const unsigned s = sizeof(ia) / sizeof(ia[0]);
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), 0, std::ref(f));
-    assert(it == Iter(ia));
-    assert(f.count == 0);
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), 0, std::ref(f));
+      assert(it == Iter(ia));
+      assert(f.count == 0);
     }
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), s, std::ref(f));
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), s, std::ref(f));
 
-    assert(it == Iter(ia+s));
-    assert(f.count == s);
-    for (unsigned i = 0; i < s; ++i)
-        assert(ia[i] == static_cast<int>(i+1));
+      assert(it == Iter(ia + s));
+      assert(f.count == s);
+      for (unsigned i = 0; i < s; ++i)
+        assert(ia[i] == static_cast<int>(i + 1));
     }
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), 1, std::ref(f));
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), 1, std::ref(f));
 
-    assert(it == Iter(ia+1));
-    assert(f.count == 1);
-    for (unsigned i = 0; i < 1; ++i)
-        assert(ia[i] == static_cast<int>(i+2));
+      assert(it == Iter(ia + 1));
+      assert(f.count == 1);
+      for (unsigned i = 0; i < 1; ++i)
+        assert(ia[i] == static_cast<int>(i + 2));
     }
+  }
+
+#if TEST_STD_VER > 11
+  {
+    int ia[]            = {1, 3, 6, 7};
+    int expected[]      = {3, 5, 8, 9};
+    const std::size_t N = 4;
+
+    auto it = std::for_each_n(std::begin(ia), N, [](int& a) { a += 2; });
+    assert(it == (std::begin(ia) + N) && std::equal(std::begin(ia), std::end(ia), std::begin(expected)));
+  }
+#endif
+
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+#if TEST_STD_VER >= 20
+  { // Make sure that the segmented iterator optimization works during constant evaluation
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::for_each_n(v.begin(), std::ranges::distance(v), [i = 0](int& a) mutable { assert(a == i++); });
+  }
+#endif
+
+  return true;
+}
 
+int main(int, char**) {
+  assert(test());
 #if TEST_STD_VER > 17
-    static_assert(test_constexpr());
+  static_assert(test());
 #endif
 
   return 0;
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
index 8b9b6e82cbcb2..2f4bfb9db6dba 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
@@ -20,7 +20,10 @@
 
 #include <algorithm>
 #include <array>
+#include <cassert>
+#include <deque>
 #include <ranges>
+#include <vector>
 
 #include "almost_satisfies_types.h"
 #include "test_iterators.h"
@@ -30,7 +33,7 @@ struct Callable {
 };
 
 template <class Iter, class Sent = Iter>
-concept HasForEachIt = requires (Iter iter, Sent sent) { std::ranges::for_each(iter, sent, Callable{}); };
+concept HasForEachIt = requires(Iter iter, Sent sent) { std::ranges::for_each(iter, sent, Callable{}); };
 
 static_assert(HasForEachIt<int*>);
 static_assert(!HasForEachIt<InputIteratorNotDerivedFrom>);
@@ -47,7 +50,7 @@ static_assert(!HasForEachItFunc<IndirectUnaryPredicateNotPredicate>);
 static_assert(!HasForEachItFunc<IndirectUnaryPredicateNotCopyConstructible>);
 
 template <class Range>
-concept HasForEachR = requires (Range range) { std::ranges::for_each(range, Callable{}); };
+concept HasForEachR = requires(Range range) { std::ranges::for_each(range, Callable{}); };
 
 static_assert(HasForEachR<UncheckedRange<int*>>);
 static_assert(!HasForEachR<InputRangeNotDerivedFrom>);
@@ -68,7 +71,7 @@ constexpr void test_iterator() {
   { // simple test
     {
       auto func = [i = 0](int& a) mutable { a += i++; };
-      int a[] = {1, 6, 3, 4};
+      int a[]   = {1, 6, 3, 4};
       std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> decltype(auto) ret =
           std::ranges::for_each(Iter(a), Sent(Iter(a + 4)), func);
       assert(a[0] == 1);
@@ -81,8 +84,8 @@ constexpr void test_iterator() {
       assert(i == 4);
     }
     {
-      auto func = [i = 0](int& a) mutable { a += i++; };
-      int a[] = {1, 6, 3, 4};
+      auto func  = [i = 0](int& a) mutable { a += i++; };
+      int a[]    = {1, 6, 3, 4};
       auto range = std::ranges::subrange(Iter(a), Sent(Iter(a + 4)));
       std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> decltype(auto) ret =
           std::ranges::for_each(range, func);
@@ -110,6 +113,30 @@ constexpr void test_iterator() {
   }
 }
 
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
+
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
+};
+
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::ranges::for_each(d, deque_test(d, index));
+  }
+}
+
 constexpr bool test() {
   test_iterator<cpp17_input_iterator<int*>, sentinel_wrapper<cpp17_input_iterator<int*>>>();
   test_iterator<cpp20_input_iterator<int*>, sentinel_wrapper<cpp20_input_iterator<int*>>>();
@@ -146,6 +173,15 @@ constexpr bool test() {
     }
   }
 
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+  {
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::ranges::for_each(v, [i = 0](int x) mutable { assert(x == 2 * i++); }, [](int x) { return 2 * x; });
+  }
+
   return true;
 }
 
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
index d4b2d053d08ce..ad1447b7348f5 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
@@ -17,7 +17,12 @@
 
 #include <algorithm>
 #include <array>
+#include <cassert>
+#include <deque>
+#include <iterator>
 #include <ranges>
+#include <ranges>
+#include <vector>
 
 #include "almost_satisfies_types.h"
 #include "test_iterators.h"
@@ -27,7 +32,7 @@ struct Callable {
 };
 
 template <class Iter>
-concept HasForEachN = requires (Iter iter) { std::ranges::for_each_n(iter, 0, Callable{}); };
+concept HasForEachN = requires(Iter iter) { std::ranges::for_each_n(iter, 0, Callable{}); };
 
 static_assert(HasForEachN<int*>);
 static_assert(!HasForEachN<InputIteratorNotDerivedFrom>);
@@ -45,7 +50,7 @@ template <class Iter>
 constexpr void test_iterator() {
   { // simple test
     auto func = [i = 0](int& a) mutable { a += i++; };
-    int a[] = {1, 6, 3, 4};
+    int a[]   = {1, 6, 3, 4};
     std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> auto ret =
         std::ranges::for_each_n(Iter(a), 4, func);
     assert(a[0] == 1);
@@ -64,6 +69,30 @@ constexpr void test_iterator() {
   }
 }
 
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
+
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
+};
+
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::ranges::for_each_n(d.begin(), d.size(), deque_test(d, index));
+  }
+}
+
 constexpr bool test() {
   test_iterator<cpp17_input_iterator<int*>>();
   test_iterator<cpp20_input_iterator<int*>>();
@@ -89,6 +118,19 @@ constexpr bool test() {
     assert(a[2].other == 6);
   }
 
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+  {
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::ranges::for_each_n(
+        v.begin(),
+        std::ranges::distance(v),
+        [i = 0](int x) mutable { assert(x == 2 * i++); },
+        [](int x) { return 2 * x; });
+  }
+
   return true;
 }

libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp

libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp

libcxx/include/__algorithm/ranges_for_each.h

libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp

libcxx/include/__algorithm/for_each.h

ldionne

Thanks for the patch! I left some comments but I think this is going to be a nice optimization.

libcxx/test/libcxx/transitive_includes/cxx11.csv

libcxx/include/__algorithm/for_each.h

libcxx/include/__algorithm/ranges_for_each.h

libcxx/include/__algorithm/ranges_for_each_n.h

libcxx/include/__algorithm/for_each_n.h

libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp

libcxx/docs/ReleaseNotes/21.rst

libcxx/include/__algorithm/for_each_n_segment.h

libcxx/include/__algorithm/for_each.h

libcxx/test/benchmarks/algorithms/nonmodifying/for_each_join_view.bench.cpp

libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp

libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp

libcxx/include/__algorithm/ranges_for_each_n.h

libcxx/include/__algorithm/for_each_n.h

philnik777

I feel like the scope of this patch is getting a bit out of hand. The title says that you're optimizing ranges::for_each{,_n}, but you're also back-porting the std::for_each optimization to C++03, adding and adding an optimization to std::for_each_n. Could we split this up to make it clear what changes are required for what optimizations? Also, why do we want to back-port the std::for_each optimization now? Do we think the extra complexity is worth the improved performance?

winner245 · 2025-04-05T14:11:38Z

I feel like the scope of this patch is getting a bit out of hand. The title says that you're optimizing ranges::for_each{,_n}, but you're also back-porting the std::for_each optimization to C++03, adding and adding an optimization to std::for_each_n. Could we split this up to make it clear what changes are required for what optimizations? Also, why do we want to back-port the std::for_each optimization now? Do we think the extra complexity is worth the improved performance?

Thank you for your feedback! I agree that the scope of the patch has expanded beyond its original intent. Initially, the goal was simple: only to extend the optimization for std::for_each to its variants ranges::for_each{,_n}. However, as the review and revision progressed, I aimed to address the inconsistent segmented iterator optimization support between for_each_n and for_each, as the optimization for for_each_n includes C++03. I think back-porting the optimization for std::for_each to C++03 could be useful as we may be able to extend the optimization to other algorithms by letting them simply forward to std::for_each (as per your comment in another PR).

However, I agree that this made the patch diverge from its original purpose and may complicate the review process. Following your suggestion, I will work on splitting it to make it clear what this patch focuses on.

-------------- Update --------------
As per your suggestion, I have split this into the following PRs, each focusing on an independent and self-contained subtask for the classical algorithms:

This separation allows the current PR to focus exclusively on the optimization of the ranges algorithms. I will rebase my current patch on the above split pieces once they are landed.

github-actions · 2025-05-22T21:53:28Z

✅ With the latest revision this PR passed the C/C++ code formatter.

winner245 · 2025-06-02T10:29:10Z

With std::for_each backported to C++11 in #134960 and std::for_each_n carved out into #135468, this PR is now much cleaner, focusing exclusively on std::ranges::{for_each, for_each_n}.

ldionne

LGTM once comments are addressed. Thanks a lot for this series of refactorings / optimizations!

ldionne · 2025-06-04T16:37:51Z

libcxx/docs/ReleaseNotes/21.rst

+  resulting in performance improvements of up to 21.3x for ``std::deque::iterator`` and 24.9x for ``join_view`` of
+  ``vector<vector<char>>``.


We should report this optimization on the same line as the std::for_each optimization above -- I don't think there is much to be gained from having nearly-duplicate release notes since these algorithms are very similar. While we aim for a good level of completeness in our release notes, we also want to make them as useful to users as possible.

I've rerun the benchmarks multiple times, and I got similar and consistent speedups for the ranges algorithms. It is a bit strange these numbers seem greater than those reported earlier for the classical std algorithms. Ideally, these numbers should match. I haven't identified a clear reason why this is not the case. My guess is that the numbers reported earlier for the classical std algorithms were obtained from comparison between std::for_each with/without segmented iterator optimization, while the numbers in this patch compare the ranges algorithm std::ranges::for_each with/without optimization. The difference here is that the comparisons for std::for_each did not have the noise such as the std::invoke call and projection call, whereas the comparisons for the ranges algorithms do. This noise might account for the difference. This is the only difference I could possibly think of at this moment.

To avoid confusion, I will not report these numbers in this patch. Instead, I will stick to the previously reported and smaller numbers (which suffice to show the performance improvements).

ldionne · 2025-06-04T16:53:56Z

libcxx/include/__algorithm/ranges_for_each.h

-    for (; __first != __last; ++__first)
-      std::invoke(__func, std::invoke(__proj, *__first));
-    return {std::move(__first), std::move(__func)};
+    if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {


Suggested change

if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {

// In the case where we have different iterator and sentinel types, the segmented iterator optimization

// in std::for_each will not kick in. Therefore, we prefer std::for_each_n in that case (whenever we can

// obtain the `n`).

if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {

ldionne · 2025-06-04T16:56:36Z

libcxx/test/benchmarks/algorithms/nonmodifying/for_each.bench.cpp

+          ->Arg(1024)
+          ->Arg(4096)
          ->Arg(8192)
-          ->Arg(1 << 20);
+          ->Arg(1 << 14)
+          ->Arg(1 << 16)
+          ->Arg(1 << 18);


I believe it would be better to leave the old benchmark values in place. They are less comprehensive but we need to achieve a tradeoff between comprehensiveness and the time it takes to run these benchmarks.

Previously, we were running a test case with a very large n = (1 << 20). To save some time, I replaced this large test case with 3 smaller test cases with n = (1 << 14), 1 << 16, 1 << 18. I think the total execution time of these three test cases is actually lower than running a single test case with n = (1 << 20). Please let me know if I misunderstood you.

ldionne · 2025-06-04T16:56:45Z

libcxx/test/benchmarks/algorithms/nonmodifying/for_each.bench.cpp

+          ->Arg(8)
+          ->Arg(32)
+          ->Arg(50) // non power-of-two
+          ->Arg(1024)
+          ->Arg(4096)
+          ->Arg(8192)
+          ->Arg(1 << 14)
+          ->Arg(1 << 16)
+          ->Arg(1 << 18);


Same here for the benchmark sizes.

ldionne · 2025-06-04T16:57:46Z

libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp

    bm.operator()<std::list<int>>("std::for_each_n(list<int>)", std_for_each_n);
+    bm.operator()<std::vector<int>>("rng::for_each_n(vector<int>)", std::ranges::for_each_n);


Let's use the same numbers as for the std::for_each benchmarks.

frederick-vs-ja reviewed Mar 25, 2025

View reviewed changes

libcxx/include/__algorithm/for_each_n.h Show resolved Hide resolved

winner245 force-pushed the for-each-segment branch from 49011aa to ba1d5d4 Compare March 25, 2025 15:31

winner245 marked this pull request as ready for review March 25, 2025 15:59

winner245 requested a review from a team as a code owner March 25, 2025 15:59

llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Mar 25, 2025

winner245 added the performance label Mar 25, 2025

ldionne reviewed Mar 25, 2025

View reviewed changes

winner245 mentioned this pull request Mar 26, 2025

[libc++] P3372R3: constexpr deque #128656

Open

winner245 force-pushed the for-each-segment branch from ba1d5d4 to c113266 Compare March 26, 2025 02:03

frederick-vs-ja reviewed Mar 26, 2025

View reviewed changes

libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp Outdated Show resolved Hide resolved

winner245 force-pushed the for-each-segment branch from a7041cc to a2e451d Compare March 26, 2025 15:11

winner245 commented Mar 26, 2025

View reviewed changes

libcxx/include/__algorithm/for_each.h Outdated Show resolved Hide resolved

winner245 force-pushed the for-each-segment branch 2 times, most recently from 16438be to 047acfd Compare March 27, 2025 01:08

ldionne requested changes Mar 27, 2025

View reviewed changes

winner245 force-pushed the for-each-segment branch 3 times, most recently from 0aad396 to 5a7b6eb Compare March 29, 2025 03:59

winner245 mentioned this pull request Mar 29, 2025

[libc++] Optimize {std,ranges}::{fill,fill_n} for segmented iterators #132665

Open

ldionne requested changes Apr 2, 2025

View reviewed changes

winner245 force-pushed the for-each-segment branch from 198fe3b to f5d13ab Compare April 3, 2025 16:28

winner245 mentioned this pull request Apr 3, 2025

[libc++] Fix __segmented_iterator_traits for implicit template instantiation in SFINAE #134304

Closed

winner245 commented Apr 3, 2025

View reviewed changes

libcxx/include/__algorithm/for_each_n.h Outdated Show resolved Hide resolved

winner245 force-pushed the for-each-segment branch 3 times, most recently from d14bde4 to 8a5bcdc Compare April 5, 2025 02:43

philnik777 requested changes Apr 5, 2025

View reviewed changes

winner245 force-pushed the for-each-segment branch from 8a5bcdc to 5a225dd Compare May 22, 2025 21:50

winner245 force-pushed the for-each-segment branch from 5a225dd to b366e93 Compare May 22, 2025 22:36

winner245 force-pushed the for-each-segment branch from b366e93 to 216b957 Compare June 2, 2025 01:48

ldionne approved these changes Jun 4, 2025

View reviewed changes

winner245 force-pushed the for-each-segment branch from 275c254 to df7ac69 Compare June 7, 2025 16:38

winner245 added 12 commits June 7, 2025 14:37

Optimize ranges::{for_each, for_each_n} for segmented iterators

a5148ec

Address ldionne's review comments

90c826b

Fix test and ADL call

fae4de0

Make for_each segmented iterator optimization valid for C++03

37d68a3

Allow transitive include of <optional> in affected headers

2a83548

Remove unnecessary _AlgoPolicy template parameter

5cc4af8

Apply optimization for join_view segmented iterators

b74e188

Consistently extend segmented iterator optimization to ranges::for_each

1f7ad34

Fix review comments

ca54b95

Fix invoke call by using std::__invoke

100521b

Refactor to simplify logic of for_each_n_segment.h

05161a1

Address ldionne's comments

b525b74

winner245 force-pushed the for-each-segment branch from df7ac69 to b525b74 Compare June 7, 2025 18:39

		resulting in performance improvements of up to 21.3x for ``std::deque::iterator`` and 24.9x for ``join_view`` of
		``vector<vector<char>>``.

-    if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {
+    // In the case where we have different iterator and sentinel types, the segmented iterator optimization
+    // in std::for_each will not kick in. Therefore, we prefer std::for_each_n in that case (whenever we can
+    // obtain the `n`).
+    if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {

		bm.operator()<std::list<int>>("std::for_each_n(list<int>)", std_for_each_n);
		bm.operator()<std::vector<int>>("rng::for_each_n(vector<int>)", std::ranges::for_each_n);

[libc++] Optimize ranges::{for_each, for_each_n} for segmented iterators #132896

Are you sure you want to change the base?

[libc++] Optimize ranges::{for_each, for_each_n} for segmented iterators #132896

Uh oh!

Conversation

winner245 commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of speedups for deque iterators

Summary of speedups for join_view iterators

Benchmarks:

std::ranges::for_each with deque iterators

std::ranges::for_each_n with deque iterators

std::ranges::for_each with join_view iterators

std::ranges::for_each_n with join_view iterators

Uh oh!

Uh oh!

llvmbot commented Mar 25, 2025

for_each_n

for_each

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ldionne left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

philnik777 left a comment

Choose a reason for hiding this comment

Uh oh!

winner245 commented Apr 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

winner245 commented Jun 2, 2025

Uh oh!

ldionne left a comment

Choose a reason for hiding this comment

Uh oh!

ldionne Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

winner245 Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

ldionne Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

ldionne Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

winner245 Jun 7, 2025

Choose a reason for hiding this comment

Uh oh!

ldionne Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

ldionne Jun 4, 2025

winner245 commented Mar 25, 2025 •

edited

Loading

Summary of speedups for `deque` iterators

Summary of speedups for `join_view` iterators

`std::ranges::for_each` with `deque` iterators

`std::ranges::for_each_n` with `deque` iterators

`std::ranges::for_each` with `join_view` iterators

`std::ranges::for_each_n` with `join_view` iterators

`for_each_n`

`for_each`

winner245 commented Apr 5, 2025 •

edited

Loading

github-actions bot commented May 22, 2025 •

edited

Loading