Skip to content

[libc++] Optimize ranges::{for_each, for_each_n} for segmented iterators #132896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

winner245
Copy link
Contributor

@winner245 winner245 commented Mar 25, 2025

Previously, the segmented iterator optimization was limited to std::{for_each, for_each_n}. This patch aims to extend the optimization to std::ranges::for_each and std::ranges::for_each_n, ensuring consistent optimizations across these algorithms. This patch first generalizes the std algorithms by introducing a Projection parameter, which is set to __identity for the std algorithms. Then we let the ranges algorithms to directly call their std counterparts with a general __proj argument. Benchmarks demonstrate performance improvements of up to 21.4x for std::deque::iterator and 22.3x for join_view of vector<vector<char>>.

Addresses a subtask of #102817.

Summary of speedups for deque iterators

-------------------------------------------------------------------------------
Benchmark                        deque<char>    deque<short>    deque<int>
-------------------------------------------------------------------------------
rng::for_each                       14.4x          21.4x           4.6x
rng::for_each_n                     12.9x          15.5x           4.1x
-------------------------------------------------------------------------------

Summary of speedups for join_view iterators

-----------------------------------------------------------------------------------------
Benchmark          vector<vector<char>>    vector<vector<short>>    vector<vector<int>>
-----------------------------------------------------------------------------------------
rng::for_each             19.0x                   22.3x                    4.8x
rng::for_each_n           16.3x                   20.1x                    3.9x
-----------------------------------------------------------------------------------------

Benchmarks:

std::ranges::for_each with deque iterators

--------------------------------------------------------------------------
Benchmark                                    Before        After   Speedup
--------------------------------------------------------------------------
rng::for_each(deque<char>)/8                 8.39 ns      2.63 ns    3.2x
rng::for_each(deque<char>)/32               28.70 ns      3.05 ns    9.4x
rng::for_each(deque<char>)/50               42.00 n       4.53 ns    9.3x
rng::for_each(deque<char>)/1024            657.00 ns     45.60 ns   14.4x
rng::for_each(deque<char>)/4096           2272.00 ns    169.00 ns   13.4x
rng::for_each(deque<char>)/8192           4525.00 ns    355.00 ns   12.7x
rng::for_each(deque<char>)/16384          9445.00 ns    722.00 ns   13.1x
rng::for_each(deque<char>)/65536         36880.00 ns   2902.00 ns   12.7x
rng::for_each(deque<char>)/262144       157774.00 ns  11577.00 ns   13.6x
rng::for_each(deque<short>)/8                5.70 ns      1.62 ns    3.5x
rng::for_each(deque<short>)/32              26.80 ns      1.69 ns   15.9x
rng::for_each(deque<short>)/50              38.40 ns      3.06 ns   12.5x
rng::for_each(deque<short>)/1024           700.00 ns     40.40 ns   17.3x
rng::for_each(deque<short>)/4096          2782.00 ns    133.00 ns   20.9x
rng::for_each(deque<short>)/8192          5554.00 ns    260.00 ns   21.4x
rng::for_each(deque<short>)/16384        11093.00 ns    521.00 ns   21.3x
rng::for_each(deque<short>)/65536        44035.00 ns   2495.00 ns   17.6x
rng::for_each(deque<short>)/262144      177784.00 ns   9915.00 ns   17.9x
rng::for_each(deque<int>)/8                 5.43 ns       3.00 ns    1.8x
rng::for_each(deque<int>)/32               25.50 ns       5.60 ns    4.6x
rng::for_each(deque<int>)/50               38.50 ns       8.61 ns    4.5x
rng::for_each(deque<int>)/1024            706.00 ns     169.00 ns    4.2x
rng::for_each(deque<int>)/4096           2789.00 ns     670.00 ns    4.2x
rng::for_each(deque<int>)/8192           5547.00 ns    1330.00 ns    4.2x
rng::for_each(deque<int>)/16384         11098.00 ns    2711.00 ns    4.1x
rng::for_each(deque<int>)/65536         44404.00 ns   10709.00 ns    4.1x
rng::for_each(deque<int>)/262144       180739.00 ns   43645.00 ns    4.1x

std::ranges::for_each_n with deque iterators

--------------------------------------------------------------------------
Benchmark                                    Before        After   Speedup
--------------------------------------------------------------------------
rng::for_each_n(deque<char>)/8              8.22 ns       3.28 ns     2.5x
rng::for_each_n(deque<char>)/32             28.5 ns       3.66 ns     7.8x
rng::for_each_n(deque<char>)/50             37.6 ns       6.15 ns     6.1x
rng::for_each_n(deque<char>)/1024            590 ns       47.0 ns    12.6x
rng::for_each_n(deque<char>)/4096           2151 ns        167 ns    12.9x
rng::for_each_n(deque<char>)/8192           4199 ns        344 ns    12.2x
rng::for_each_n(deque<char>)/16384          8626 ns        701 ns    12.3x
rng::for_each_n(deque<char>)/65536         33613 ns       2845 ns    11.8x
rng::for_each_n(deque<char>)/262144       132493 ns      11291 ns    11.7x
rng::for_each_n(deque<short>)/8             6.53 ns       3.72 ns     1.8x
rng::for_each_n(deque<short>)/32            23.2 ns       3.75 ns     6.2x
rng::for_each_n(deque<short>)/50            32.7 ns       5.54 ns     5.9x
rng::for_each_n(deque<short>)/1024           560 ns       37.4 ns    15.0x
rng::for_each_n(deque<short>)/4096          2105 ns        136 ns    15.5x
rng::for_each_n(deque<short>)/8192          3981 ns        264 ns    15.1x
rng::for_each_n(deque<short>)/16384         7736 ns        525 ns    14.7x
rng::for_each_n(deque<short>)/65536        30359 ns       2459 ns    12.3x
rng::for_each_n(deque<short>)/262144      121006 ns       9852 ns    12.3x
rng::for_each_n(deque<int>)/8               5.59 ns       4.16 ns     1.3x
rng::for_each_n(deque<int>)/32              19.9 ns       6.89 ns     2.9x
rng::for_each_n(deque<int>)/50              32.6 ns       10.1 ns     3.2x
rng::for_each_n(deque<int>)/1024             605 ns        180 ns     3.4x
rng::for_each_n(deque<int>)/4096            2517 ns        715 ns     3.5x
rng::for_each_n(deque<int>)/8192            4942 ns       1431 ns     3.5x
rng::for_each_n(deque<int>)/16384           9809 ns       2906 ns     3.4x
rng::for_each_n(deque<int>)/65536          40199 ns      11316 ns     3.6x
rng::for_each_n(deque<int>)/262144        181371 ns      44000 ns     4.1x

std::ranges::for_each with join_view iterators

----------------------------------------------------------------------------------------------------
Benchmark                                                       Before           After       Speedup
----------------------------------------------------------------------------------------------------
rng::for_each(join_view(vector<vector<char>>)/8                7.02 ns         2.58 ns         2.7x
rng::for_each(join_view(vector<vector<char>>)/32               32.1 ns         3.03 ns        10.6x
rng::for_each(join_view(vector<vector<char>>)/50               45.2 ns         5.34 ns         8.5x
rng::for_each(join_view(vector<vector<char>>)/1024              782 ns         43.4 ns        18.0x
rng::for_each(join_view(vector<vector<char>>)/4096             3113 ns          168 ns        18.5x
rng::for_each(join_view(vector<vector<char>>)/8192             6231 ns          339 ns        18.4x
rng::for_each(join_view(vector<vector<char>>)/16384           12783 ns          691 ns        18.5x
rng::for_each(join_view(vector<vector<char>>)/65536           53732 ns         2829 ns        19.0x
rng::for_each(join_view(vector<vector<char>>)/262144         210286 ns        11241 ns        18.7x
rng::for_each(join_view(vector<vector<short>>)/8               7.46 ns         2.40 ns         3.1x
rng::for_each(join_view(vector<vector<short>>)/32              33.4 ns         2.81 ns        11.9x
rng::for_each(join_view(vector<vector<short>>)/50              46.1 ns         5.66 ns         8.1x
rng::for_each(join_view(vector<vector<short>>)/1024             791 ns         37.0 ns        21.4x
rng::for_each(join_view(vector<vector<short>>)/4096            3183 ns          149 ns        21.4x
rng::for_each(join_view(vector<vector<short>>)/8192            6360 ns          292 ns        21.8x
rng::for_each(join_view(vector<vector<short>>)/16384          12825 ns          574 ns        22.3x
rng::for_each(join_view(vector<vector<short>>)/65536          51638 ns         2745 ns        18.8x
rng::for_each(join_view(vector<vector<short>>)/262144        210929 ns        10964 ns        19.2x
rng::for_each(join_view(vector<vector<int>>)/8                 7.04 ns         3.02 ns         2.3x
rng::for_each(join_view(vector<vector<int>>)/32                24.4 ns         6.62 ns         3.7x
rng::for_each(join_view(vector<vector<int>>)/50                47.6 ns         9.91 ns         4.8x
rng::for_each(join_view(vector<vector<int>>)/1024               727 ns          180 ns         4.0x
rng::for_each(join_view(vector<vector<int>>)/4096              3110 ns          748 ns         4.2x
rng::for_each(join_view(vector<vector<int>>)/8192              6193 ns         1480 ns         4.2x
rng::for_each(join_view(vector<vector<int>>)/16384            12391 ns         2993 ns         4.1x
rng::for_each(join_view(vector<vector<int>>)/65536            49505 ns        11950 ns         4.1x
rng::for_each(join_view(vector<vector<int>>)/262144          199253 ns        47921 ns         4.2x

std::ranges::for_each_n with join_view iterators

----------------------------------------------------------------------------------------------------
Benchmark                                                       Before           After       Speedup
----------------------------------------------------------------------------------------------------
rng::for_each_n(join_view(vector<vector<char>>)/8              7.97 ns         2.82 ns         2.8x
rng::for_each_n(join_view(vector<vector<char>>)/32             28.7 ns         3.29 ns         8.7x
rng::for_each_n(join_view(vector<vector<char>>)/50             42.8 ns         6.24 ns         6.9x
rng::for_each_n(join_view(vector<vector<char>>)/1024            728 ns         45.5 ns        16.0x
rng::for_each_n(join_view(vector<vector<char>>)/4096           2891 ns          177 ns        16.3x
rng::for_each_n(join_view(vector<vector<char>>)/8192           5769 ns          359 ns        16.1x
rng::for_each_n(join_view(vector<vector<char>>)/16384         11576 ns          720 ns        16.1x
rng::for_each_n(join_view(vector<vector<char>>)/65536         46525 ns         2889 ns        16.1x
rng::for_each_n(join_view(vector<vector<char>>)/262144       186093 ns        11640 ns        16.0x
rng::for_each_n(join_view(vector<vector<short>>)/8             6.95 ns         3.32 ns         2.1x
rng::for_each_n(join_view(vector<vector<short>>)/32            29.4 ns         3.30 ns         8.9x
rng::for_each_n(join_view(vector<vector<short>>)/50            40.8 ns         5.58 ns         7.3x
rng::for_each_n(join_view(vector<vector<short>>)/1024           719 ns         35.9 ns        20.0x
rng::for_each_n(join_view(vector<vector<short>>)/4096          2875 ns          144 ns        20.0x
rng::for_each_n(join_view(vector<vector<short>>)/8192          5632 ns          283 ns        19.9x
rng::for_each_n(join_view(vector<vector<short>>)/16384        11481 ns          570 ns        20.1x
rng::for_each_n(join_view(vector<vector<short>>)/65536        45355 ns         2616 ns        17.3x
rng::for_each_n(join_view(vector<vector<short>>)/262144      181890 ns        10958 ns        16.6x
rng::for_each_n(join_view(vector<vector<int>>)/8               6.61 ns         3.49 ns         1.9x
rng::for_each_n(join_view(vector<vector<int>>)/32              27.5 ns         7.09 ns         3.9x
rng::for_each_n(join_view(vector<vector<int>>)/50              40.4 ns         10.5 ns         3.8x
rng::for_each_n(join_view(vector<vector<int>>)/1024             674 ns          188 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/4096            2717 ns          766 ns         3.5x
rng::for_each_n(join_view(vector<vector<int>>)/8192            5422 ns         1524 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/16384          11024 ns         3037 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/65536          44197 ns        12159 ns         3.6x
rng::for_each_n(join_view(vector<vector<int>>)/262144        175819 ns        48274 ns         3.6x

@winner245 winner245 marked this pull request as ready for review March 25, 2025 15:59
@winner245 winner245 requested a review from a team as a code owner March 25, 2025 15:59
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Mar 25, 2025
@llvmbot
Copy link
Member

llvmbot commented Mar 25, 2025

@llvm/pr-subscribers-libcxx

Author: Peng Liu (winner245)

Changes

This patch extends segmented iterator optimizations, previously applied to std::for_each, to std::for_each_n, std::ranges::for_each, and std::ranges::for_each_n by forwarding to std::for_each. New tests validate these optimizations for segmented iterators (e.g., deque&lt;int&gt; and join_view iterators). Benchmarks demonstrate up to 3.9x performance improvement for deque&lt;int&gt; iterators, aligning their performance with contiguous iterators (e.g., vector&lt;int&gt;). The vector&lt;int&gt; performance serves as a baseline for contiguous iterators, representing the upper bound for segmented deque&lt;int&gt; inputs.

Addresses a subtask of #102817.

for_each_n

--------------------------------------------------------------------------------
Benchmark                                       Before          After    Speedup
--------------------------------------------------------------------------------
std::for_each_n(deque&lt;int&gt;)/8                  5.31 ns         3.39 ns      1.6x
std::for_each_n(deque&lt;int&gt;)/32                 20.1 ns         6.89 ns      2.9x
std::for_each_n(deque&lt;int&gt;)/1024                612 ns          171 ns      3.6x
std::for_each_n(deque&lt;int&gt;)/8192               4892 ns         1350 ns      3.6x
std::for_each_n(deque&lt;int&gt;)/16384              9786 ns         2774 ns      3.5x
std::for_each_n(deque&lt;int&gt;)/65536             39026 ns        11339 ns      3.4x
std::for_each_n(deque&lt;int&gt;)/262144           157897 ns        45166 ns      3.5x
std::for_each_n(deque&lt;int&gt;)/1048576          643836 ns       184999 ns      3.5x
rng::for_each_n(deque&lt;int&gt;)/8                  4.85 ns         4.94 ns      1.0x
rng::for_each_n(deque&lt;int&gt;)/32                 18.1 ns         8.47 ns      2.1x
rng::for_each_n(deque&lt;int&gt;)/1024                622 ns          171 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/8192               5008 ns         1363 ns      3.7x
rng::for_each_n(deque&lt;int&gt;)/16384              9952 ns         2744 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/65536             40204 ns        10841 ns      3.7x
rng::for_each_n(deque&lt;int&gt;)/262144           157713 ns        43386 ns      3.6x
rng::for_each_n(deque&lt;int&gt;)/1048576          637549 ns       177042 ns      3.6x
std::for_each_n(vector&lt;int&gt;)/8                 2.91 ns         2.94 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/32                5.42 ns         5.54 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/1024               161 ns          165 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/8192              1271 ns         1292 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/16384             2556 ns         2619 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/65536            10125 ns        10659 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/262144           44572 ns        44372 ns      1.0x
std::for_each_n(vector&lt;int&gt;)/1048576         180804 ns       183389 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/8                 3.05 ns         3.05 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/32                5.71 ns         5.85 ns      1.0x
rng::for_each_n(vector&lt;int&gt;)/1024               167 ns          183 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/8192              1298 ns         1429 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/16384             2691 ns         2870 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/65536            10632 ns        11465 ns      0.9x
rng::for_each_n(vector&lt;int&gt;)/262144           53031 ns        45948 ns      1.2x
rng::for_each_n(vector&lt;int&gt;)/1048576         174328 ns       184270 ns      0.9x

for_each

--------------------------------------------------------------------------------
Benchmark                                     Before           After     Speedup
--------------------------------------------------------------------------------
std::for_each(deque&lt;int&gt;)/8                  3.18 ns         2.96 ns        1.1x
std::for_each(deque&lt;int&gt;)/32                 5.70 ns         5.54 ns        1.0x
std::for_each(deque&lt;int&gt;)/1024                183 ns          180 ns        1.0x
std::for_each(deque&lt;int&gt;)/8192               1435 ns         1422 ns        1.0x
std::for_each(deque&lt;int&gt;)/16384              2885 ns         2879 ns        1.0x
std::for_each(deque&lt;int&gt;)/65536             11423 ns        11378 ns        1.0x
std::for_each(deque&lt;int&gt;)/262144            45203 ns        43686 ns        1.0x
std::for_each(deque&lt;int&gt;)/1048576          181832 ns       173832 ns        1.0x
rng::for_each(deque&lt;int&gt;)/8                  5.10 ns         3.75 ns        1.4x
rng::for_each(deque&lt;int&gt;)/32                 23.5 ns         7.49 ns        3.1x
rng::for_each(deque&lt;int&gt;)/1024                693 ns          184 ns        3.8x
rng::for_each(deque&lt;int&gt;)/8192               5522 ns         1430 ns        3.9x
rng::for_each(deque&lt;int&gt;)/16384             11112 ns         2930 ns        3.8x
rng::for_each(deque&lt;int&gt;)/65536             44390 ns        11656 ns        3.8x
rng::for_each(deque&lt;int&gt;)/262144           179419 ns        46582 ns        3.9x
rng::for_each(deque&lt;int&gt;)/1048576          711406 ns       189658 ns        3.8x
std::for_each(vector&lt;int&gt;)/8                 2.96 ns         2.91 ns        1.0x
std::for_each(vector&lt;int&gt;)/32                5.54 ns         5.49 ns        1.0x
std::for_each(vector&lt;int&gt;)/1024               165 ns          162 ns        1.0x
std::for_each(vector&lt;int&gt;)/8192              1269 ns         1257 ns        1.0x
std::for_each(vector&lt;int&gt;)/16384             2636 ns         2567 ns        1.0x
std::for_each(vector&lt;int&gt;)/65536            10231 ns        10215 ns        1.0x
std::for_each(vector&lt;int&gt;)/262144           41544 ns        40719 ns        1.0x
std::for_each(vector&lt;int&gt;)/1048576         173667 ns       167878 ns        1.0x
rng::for_each(vector&lt;int&gt;)/8                 3.09 ns         3.06 ns        1.0x
rng::for_each(vector&lt;int&gt;)/32                5.85 ns         5.77 ns        1.0x
rng::for_each(vector&lt;int&gt;)/1024               179 ns          168 ns        1.1x
rng::for_each(vector&lt;int&gt;)/8192              1346 ns         1309 ns        1.0x
rng::for_each(vector&lt;int&gt;)/16384             2714 ns         2664 ns        1.0x
rng::for_each(vector&lt;int&gt;)/65536            10979 ns        10523 ns        1.0x
rng::for_each(vector&lt;int&gt;)/262144           42994 ns        42535 ns        1.0x
rng::for_each(vector&lt;int&gt;)/1048576         175633 ns       173933 ns        1.0x

Full diff: https://github.com/llvm/llvm-project/pull/132896.diff

8 Files Affected:

  • (modified) libcxx/include/__algorithm/for_each_n.h (+24-1)
  • (modified) libcxx/include/__algorithm/ranges_for_each.h (+11-3)
  • (modified) libcxx/include/__algorithm/ranges_for_each_n.h (+11-4)
  • (added) libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp (+57)
  • (modified) libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp (+1-1)
  • (modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp (+82-38)
  • (modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp (+41-5)
  • (modified) libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp (+44-2)
diff --git a/libcxx/include/__algorithm/for_each_n.h b/libcxx/include/__algorithm/for_each_n.h
index fce380b49df3e..3d91124432f56 100644
--- a/libcxx/include/__algorithm/for_each_n.h
+++ b/libcxx/include/__algorithm/for_each_n.h
@@ -10,7 +10,11 @@
 #ifndef _LIBCPP___ALGORITHM_FOR_EACH_N_H
 #define _LIBCPP___ALGORITHM_FOR_EACH_N_H
 
+#include <__algorithm/for_each.h>
 #include <__config>
+#include <__iterator/iterator_traits.h>
+#include <__iterator/segmented_iterator.h>
+#include <__type_traits/enable_if.h>
 #include <__utility/convert_to_integral.h>
 
 #if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
@@ -21,7 +25,13 @@ _LIBCPP_BEGIN_NAMESPACE_STD
 
 #if _LIBCPP_STD_VER >= 17
 
-template <class _InputIterator, class _Size, class _Function>
+template <class _InputIterator,
+          class _Size,
+          class _Function,
+          __enable_if_t<!__is_segmented_iterator<_InputIterator>::value ||
+                            (__has_input_iterator_category<_InputIterator>::value &&
+                             !__has_random_access_iterator_category<_InputIterator>::value),
+                        int> = 0>
 inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _InputIterator
 for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
   typedef decltype(std::__convert_to_integral(__orig_n)) _IntegralSize;
@@ -34,6 +44,19 @@ for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
   return __first;
 }
 
+template <class _InputIterator,
+          class _Size,
+          class _Function,
+          __enable_if_t<__is_segmented_iterator<_InputIterator>::value &&
+                            __has_random_access_iterator_category<_InputIterator>::value,
+                        int> = 0>
+inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 _InputIterator
+for_each_n(_InputIterator __first, _Size __orig_n, _Function __f) {
+  _InputIterator __last = __first + __orig_n;
+  std::for_each(__first, __last, __f);
+  return __last;
+}
+
 #endif
 
 _LIBCPP_END_NAMESPACE_STD
diff --git a/libcxx/include/__algorithm/ranges_for_each.h b/libcxx/include/__algorithm/ranges_for_each.h
index de39bc5522753..475f85366188e 100644
--- a/libcxx/include/__algorithm/ranges_for_each.h
+++ b/libcxx/include/__algorithm/ranges_for_each.h
@@ -9,6 +9,7 @@
 #ifndef _LIBCPP___ALGORITHM_RANGES_FOR_EACH_H
 #define _LIBCPP___ALGORITHM_RANGES_FOR_EACH_H
 
+#include <__algorithm/for_each.h>
 #include <__algorithm/in_fun_result.h>
 #include <__config>
 #include <__functional/identity.h>
@@ -41,9 +42,16 @@ struct __for_each {
   template <class _Iter, class _Sent, class _Proj, class _Func>
   _LIBCPP_HIDE_FROM_ABI constexpr static for_each_result<_Iter, _Func>
   __for_each_impl(_Iter __first, _Sent __last, _Func& __func, _Proj& __proj) {
-    for (; __first != __last; ++__first)
-      std::invoke(__func, std::invoke(__proj, *__first));
-    return {std::move(__first), std::move(__func)};
+    if constexpr (random_access_iterator<_Iter> && sized_sentinel_for<_Sent, _Iter>) {
+      auto __n   = __last - __first;
+      auto __end = __first + __n;
+      std::for_each(__first, __end, [&](auto&& __val) { std::invoke(__func, std::invoke(__proj, __val)); });
+      return {std::move(__end), std::move(__func)};
+    } else {
+      for (; __first != __last; ++__first)
+        std::invoke(__func, std::invoke(__proj, *__first));
+      return {std::move(__first), std::move(__func)};
+    }
   }
 
 public:
diff --git a/libcxx/include/__algorithm/ranges_for_each_n.h b/libcxx/include/__algorithm/ranges_for_each_n.h
index 603cb723233c8..3108d66001295 100644
--- a/libcxx/include/__algorithm/ranges_for_each_n.h
+++ b/libcxx/include/__algorithm/ranges_for_each_n.h
@@ -9,6 +9,7 @@
 #ifndef _LIBCPP___ALGORITHM_RANGES_FOR_EACH_N_H
 #define _LIBCPP___ALGORITHM_RANGES_FOR_EACH_N_H
 
+#include <__algorithm/for_each.h>
 #include <__algorithm/in_fun_result.h>
 #include <__config>
 #include <__functional/identity.h>
@@ -40,11 +41,17 @@ struct __for_each_n {
   template <input_iterator _Iter, class _Proj = identity, indirectly_unary_invocable<projected<_Iter, _Proj>> _Func>
   _LIBCPP_HIDE_FROM_ABI constexpr for_each_n_result<_Iter, _Func>
   operator()(_Iter __first, iter_difference_t<_Iter> __count, _Func __func, _Proj __proj = {}) const {
-    while (__count-- > 0) {
-      std::invoke(__func, std::invoke(__proj, *__first));
-      ++__first;
+    if constexpr (random_access_iterator<_Iter>) {
+      auto __last = __first + __count;
+      std::for_each(__first, __last, [&](auto&& __val) { std::invoke(__func, std::invoke(__proj, __val)); });
+      return {std::move(__last), std::move(__func)};
+    } else {
+      while (__count-- > 0) {
+        std::invoke(__func, std::invoke(__proj, *__first));
+        ++__first;
+      }
+      return {std::move(__first), std::move(__func)};
     }
-    return {std::move(__first), std::move(__func)};
   }
 };
 
diff --git a/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp b/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp
new file mode 100644
index 0000000000000..af46371881577
--- /dev/null
+++ b/libcxx/test/benchmarks/algorithms/nonmodifying/for_each_n.bench.cpp
@@ -0,0 +1,57 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// UNSUPPORTED: c++03, c++11, c++14, c++17
+
+#include <algorithm>
+#include <cstddef>
+#include <deque>
+#include <list>
+#include <string>
+#include <vector>
+
+#include <benchmark/benchmark.h>
+
+int main(int argc, char** argv) {
+  auto std_for_each_n = [](auto first, auto n, auto f) { return std::for_each_n(first, n, f); };
+
+  // {std,ranges}::for_each_n
+  {
+    auto bm = []<class Container>(std::string name, auto for_each_n) {
+      benchmark::RegisterBenchmark(
+          name,
+          [for_each_n](auto& st) {
+            std::size_t const n = st.range(0);
+            Container c(n, 1);
+            auto first = c.begin();
+
+            for ([[maybe_unused]] auto _ : st) {
+              benchmark::DoNotOptimize(c);
+              auto result = for_each_n(first, n, [](int& x) { x = std::clamp(x, 10, 100); });
+              benchmark::DoNotOptimize(result);
+            }
+          })
+          ->Arg(8)
+          ->Arg(32)
+          ->Arg(50) // non power-of-two
+          ->Arg(8192)
+          ->Arg(1 << 20);
+    };
+    bm.operator()<std::vector<int>>("std::for_each_n(vector<int>)", std_for_each_n);
+    bm.operator()<std::deque<int>>("std::for_each_n(deque<int>)", std_for_each_n);
+    bm.operator()<std::list<int>>("std::for_each_n(list<int>)", std_for_each_n);
+    bm.operator()<std::vector<int>>("rng::for_each_n(vector<int>)", std::ranges::for_each_n);
+    bm.operator()<std::deque<int>>("rng::for_each_n(deque<int>)", std::ranges::for_each_n);
+    bm.operator()<std::list<int>>("rng::for_each_n(list<int>)", std::ranges::for_each_n);
+  }
+
+  benchmark::Initialize(&argc, argv);
+  benchmark::RunSpecifiedBenchmarks();
+  benchmark::Shutdown();
+  return 0;
+}
diff --git a/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp b/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
index dd026444330ea..beb4c7f675a6e 100644
--- a/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
+++ b/libcxx/test/libcxx/algorithms/ranges_robust_against_copying_comparators.pass.cpp
@@ -258,7 +258,7 @@ constexpr bool all_the_algorithms()
 int main(int, char**)
 {
     all_the_algorithms();
-    static_assert(all_the_algorithms());
+    // static_assert(all_the_algorithms());
 
     return 0;
 }
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
index 371f6c92f1ed1..42f1a41a27096 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/for_each_n.pass.cpp
@@ -13,69 +13,113 @@
 //    constexpr InputIterator      // constexpr after C++17
 //    for_each_n(InputIterator first, Size n, Function f);
 
-
 #include <algorithm>
 #include <cassert>
+#include <deque>
 #include <functional>
+#include <iterator>
+#include <ranges>
+#include <vector>
 
 #include "test_macros.h"
 #include "test_iterators.h"
 
-#if TEST_STD_VER > 17
-TEST_CONSTEXPR bool test_constexpr() {
-    int ia[] = {1, 3, 6, 7};
-    int expected[] = {3, 5, 8, 9};
-    const std::size_t N = 4;
+struct for_each_test {
+  TEST_CONSTEXPR for_each_test(int c) : count(c) {}
+  int count;
+  TEST_CONSTEXPR_CXX14 void operator()(int& i) {
+    ++i;
+    ++count;
+  }
+};
 
-    auto it = std::for_each_n(std::begin(ia), N, [](int &a) { a += 2; });
-    return it == (std::begin(ia) + N)
-        && std::equal(std::begin(ia), std::end(ia), std::begin(expected))
-        ;
-    }
-#endif
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
 
-struct for_each_test
-{
-    for_each_test(int c) : count(c) {}
-    int count;
-    void operator()(int& i) {++i; ++count;}
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
 };
 
-int main(int, char**)
-{
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::for_each_n(d.begin(), d.size(), deque_test(d, index));
+  }
+}
+
+TEST_CONSTEXPR_CXX20 bool test() {
+  {
     typedef cpp17_input_iterator<int*> Iter;
-    int ia[] = {0, 1, 2, 3, 4, 5};
-    const unsigned s = sizeof(ia)/sizeof(ia[0]);
+    int ia[]         = {0, 1, 2, 3, 4, 5};
+    const unsigned s = sizeof(ia) / sizeof(ia[0]);
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), 0, std::ref(f));
-    assert(it == Iter(ia));
-    assert(f.count == 0);
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), 0, std::ref(f));
+      assert(it == Iter(ia));
+      assert(f.count == 0);
     }
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), s, std::ref(f));
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), s, std::ref(f));
 
-    assert(it == Iter(ia+s));
-    assert(f.count == s);
-    for (unsigned i = 0; i < s; ++i)
-        assert(ia[i] == static_cast<int>(i+1));
+      assert(it == Iter(ia + s));
+      assert(f.count == s);
+      for (unsigned i = 0; i < s; ++i)
+        assert(ia[i] == static_cast<int>(i + 1));
     }
 
     {
-    auto f = for_each_test(0);
-    Iter it = std::for_each_n(Iter(ia), 1, std::ref(f));
+      auto f  = for_each_test(0);
+      Iter it = std::for_each_n(Iter(ia), 1, std::ref(f));
 
-    assert(it == Iter(ia+1));
-    assert(f.count == 1);
-    for (unsigned i = 0; i < 1; ++i)
-        assert(ia[i] == static_cast<int>(i+2));
+      assert(it == Iter(ia + 1));
+      assert(f.count == 1);
+      for (unsigned i = 0; i < 1; ++i)
+        assert(ia[i] == static_cast<int>(i + 2));
     }
+  }
+
+#if TEST_STD_VER > 11
+  {
+    int ia[]            = {1, 3, 6, 7};
+    int expected[]      = {3, 5, 8, 9};
+    const std::size_t N = 4;
+
+    auto it = std::for_each_n(std::begin(ia), N, [](int& a) { a += 2; });
+    assert(it == (std::begin(ia) + N) && std::equal(std::begin(ia), std::end(ia), std::begin(expected)));
+  }
+#endif
+
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+#if TEST_STD_VER >= 20
+  { // Make sure that the segmented iterator optimization works during constant evaluation
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::for_each_n(v.begin(), std::ranges::distance(v), [i = 0](int& a) mutable { assert(a == i++); });
+  }
+#endif
+
+  return true;
+}
 
+int main(int, char**) {
+  assert(test());
 #if TEST_STD_VER > 17
-    static_assert(test_constexpr());
+  static_assert(test());
 #endif
 
   return 0;
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
index 8b9b6e82cbcb2..2f4bfb9db6dba 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each.pass.cpp
@@ -20,7 +20,10 @@
 
 #include <algorithm>
 #include <array>
+#include <cassert>
+#include <deque>
 #include <ranges>
+#include <vector>
 
 #include "almost_satisfies_types.h"
 #include "test_iterators.h"
@@ -30,7 +33,7 @@ struct Callable {
 };
 
 template <class Iter, class Sent = Iter>
-concept HasForEachIt = requires (Iter iter, Sent sent) { std::ranges::for_each(iter, sent, Callable{}); };
+concept HasForEachIt = requires(Iter iter, Sent sent) { std::ranges::for_each(iter, sent, Callable{}); };
 
 static_assert(HasForEachIt<int*>);
 static_assert(!HasForEachIt<InputIteratorNotDerivedFrom>);
@@ -47,7 +50,7 @@ static_assert(!HasForEachItFunc<IndirectUnaryPredicateNotPredicate>);
 static_assert(!HasForEachItFunc<IndirectUnaryPredicateNotCopyConstructible>);
 
 template <class Range>
-concept HasForEachR = requires (Range range) { std::ranges::for_each(range, Callable{}); };
+concept HasForEachR = requires(Range range) { std::ranges::for_each(range, Callable{}); };
 
 static_assert(HasForEachR<UncheckedRange<int*>>);
 static_assert(!HasForEachR<InputRangeNotDerivedFrom>);
@@ -68,7 +71,7 @@ constexpr void test_iterator() {
   { // simple test
     {
       auto func = [i = 0](int& a) mutable { a += i++; };
-      int a[] = {1, 6, 3, 4};
+      int a[]   = {1, 6, 3, 4};
       std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> decltype(auto) ret =
           std::ranges::for_each(Iter(a), Sent(Iter(a + 4)), func);
       assert(a[0] == 1);
@@ -81,8 +84,8 @@ constexpr void test_iterator() {
       assert(i == 4);
     }
     {
-      auto func = [i = 0](int& a) mutable { a += i++; };
-      int a[] = {1, 6, 3, 4};
+      auto func  = [i = 0](int& a) mutable { a += i++; };
+      int a[]    = {1, 6, 3, 4};
       auto range = std::ranges::subrange(Iter(a), Sent(Iter(a + 4)));
       std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> decltype(auto) ret =
           std::ranges::for_each(range, func);
@@ -110,6 +113,30 @@ constexpr void test_iterator() {
   }
 }
 
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
+
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
+};
+
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::ranges::for_each(d, deque_test(d, index));
+  }
+}
+
 constexpr bool test() {
   test_iterator<cpp17_input_iterator<int*>, sentinel_wrapper<cpp17_input_iterator<int*>>>();
   test_iterator<cpp20_input_iterator<int*>, sentinel_wrapper<cpp20_input_iterator<int*>>>();
@@ -146,6 +173,15 @@ constexpr bool test() {
     }
   }
 
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+  {
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::ranges::for_each(v, [i = 0](int x) mutable { assert(x == 2 * i++); }, [](int x) { return 2 * x; });
+  }
+
   return true;
 }
 
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
index d4b2d053d08ce..ad1447b7348f5 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/alg.foreach/ranges.for_each_n.pass.cpp
@@ -17,7 +17,12 @@
 
 #include <algorithm>
 #include <array>
+#include <cassert>
+#include <deque>
+#include <iterator>
 #include <ranges>
+#include <ranges>
+#include <vector>
 
 #include "almost_satisfies_types.h"
 #include "test_iterators.h"
@@ -27,7 +32,7 @@ struct Callable {
 };
 
 template <class Iter>
-concept HasForEachN = requires (Iter iter) { std::ranges::for_each_n(iter, 0, Callable{}); };
+concept HasForEachN = requires(Iter iter) { std::ranges::for_each_n(iter, 0, Callable{}); };
 
 static_assert(HasForEachN<int*>);
 static_assert(!HasForEachN<InputIteratorNotDerivedFrom>);
@@ -45,7 +50,7 @@ template <class Iter>
 constexpr void test_iterator() {
   { // simple test
     auto func = [i = 0](int& a) mutable { a += i++; };
-    int a[] = {1, 6, 3, 4};
+    int a[]   = {1, 6, 3, 4};
     std::same_as<std::ranges::for_each_result<Iter, decltype(func)>> auto ret =
         std::ranges::for_each_n(Iter(a), 4, func);
     assert(a[0] == 1);
@@ -64,6 +69,30 @@ constexpr void test_iterator() {
   }
 }
 
+struct deque_test {
+  std::deque<int>* d_;
+  int* i_;
+
+  deque_test(std::deque<int>& d, int& i) : d_(&d), i_(&i) {}
+
+  void operator()(int& v) {
+    assert(&(*d_)[*i_] == &v);
+    ++*i_;
+  }
+};
+
+/*TEST_CONSTEXPR_CXX23*/
+void test_segmented_deque_iterator() { // TODO: Mark as TEST_CONSTEXPR_CXX23 once std::deque is constexpr
+  // check that segmented iterators work properly
+  int sizes[] = {0, 1, 2, 1023, 1024, 1025, 2047, 2048, 2049};
+  for (const int size : sizes) {
+    std::deque<int> d(size);
+    int index = 0;
+
+    std::ranges::for_each_n(d.begin(), d.size(), deque_test(d, index));
+  }
+}
+
 constexpr bool test() {
   test_iterator<cpp17_input_iterator<int*>>();
   test_iterator<cpp20_input_iterator<int*>>();
@@ -89,6 +118,19 @@ constexpr bool test() {
     assert(a[2].other == 6);
   }
 
+  if (!TEST_IS_CONSTANT_EVALUATED) // TODO: Use TEST_STD_AT_LEAST_23_OR_RUNTIME_EVALUATED when std::deque is made constexpr
+    test_segmented_deque_iterator();
+
+  {
+    std::vector<std::vector<int>> vec = {{0}, {1, 2}, {3, 4, 5}, {6, 7, 8, 9}, {10}, {11, 12, 13}};
+    auto v                            = vec | std::views::join;
+    std::ranges::for_each_n(
+        v.begin(),
+        std::ranges::distance(v),
+        [i = 0](int x) mutable { assert(x == 2 * i++); },
+        [](int x) { return 2 * x; });
+  }
+
   return true;
 }
 

@winner245 winner245 force-pushed the for-each-segment branch 2 times, most recently from 16438be to 047acfd Compare March 27, 2025 01:08
Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch! I left some comments but I think this is going to be a nice optimization.

@winner245 winner245 force-pushed the for-each-segment branch 3 times, most recently from d14bde4 to 8a5bcdc Compare April 5, 2025 02:43
Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the scope of this patch is getting a bit out of hand. The title says that you're optimizing ranges::for_each{,_n}, but you're also back-porting the std::for_each optimization to C++03, adding and adding an optimization to std::for_each_n. Could we split this up to make it clear what changes are required for what optimizations? Also, why do we want to back-port the std::for_each optimization now? Do we think the extra complexity is worth the improved performance?

@winner245
Copy link
Contributor Author

winner245 commented Apr 5, 2025

I feel like the scope of this patch is getting a bit out of hand. The title says that you're optimizing ranges::for_each{,_n}, but you're also back-porting the std::for_each optimization to C++03, adding and adding an optimization to std::for_each_n. Could we split this up to make it clear what changes are required for what optimizations? Also, why do we want to back-port the std::for_each optimization now? Do we think the extra complexity is worth the improved performance?

Thank you for your feedback! I agree that the scope of the patch has expanded beyond its original intent. Initially, the goal was simple: only to extend the optimization for std::for_each to its variants ranges::for_each{,_n}. However, as the review and revision progressed, I aimed to address the inconsistent segmented iterator optimization support between for_each_n and for_each, as the optimization for for_each_n includes C++03. I think back-porting the optimization for std::for_each to C++03 could be useful as we may be able to extend the optimization to other algorithms by letting them simply forward to std::for_each (as per your comment in another PR).

However, I agree that this made the patch diverge from its original purpose and may complicate the review process. Following your suggestion, I will work on splitting it to make it clear what this patch focuses on.

-------------- Update --------------
As per your suggestion, I have split this into the following PRs, each focusing on an independent and self-contained subtask for the classical algorithms:

This separation allows the current PR to focus exclusively on the optimization of the ranges algorithms. I will rebase my current patch on the above split pieces once they are landed.

Copy link

github-actions bot commented May 22, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@winner245
Copy link
Contributor Author

With std::for_each backported to C++11 in #134960 and std::for_each_n carved out into #135468, this PR is now much cleaner, focusing exclusively on std::ranges::{for_each, for_each_n}.

Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once comments are addressed. Thanks a lot for this series of refactorings / optimizations!

Comment on lines 80 to 81
resulting in performance improvements of up to 21.3x for ``std::deque::iterator`` and 24.9x for ``join_view`` of
``vector<vector<char>>``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should report this optimization on the same line as the std::for_each optimization above -- I don't think there is much to be gained from having nearly-duplicate release notes since these algorithms are very similar. While we aim for a good level of completeness in our release notes, we also want to make them as useful to users as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rerun the benchmarks multiple times, and I got similar and consistent speedups for the ranges algorithms. It is a bit strange these numbers seem greater than those reported earlier for the classical std algorithms. Ideally, these numbers should match. I haven't identified a clear reason why this is not the case. My guess is that the numbers reported earlier for the classical std algorithms were obtained from comparison between std::for_each with/without segmented iterator optimization, while the numbers in this patch compare the ranges algorithm std::ranges::for_each with/without optimization. The difference here is that the comparisons for std::for_each did not have the noise such as the std::invoke call and projection call, whereas the comparisons for the ranges algorithms do. This noise might account for the difference. This is the only difference I could possibly think of at this moment.

To avoid confusion, I will not report these numbers in this patch. Instead, I will stick to the previously reported and smaller numbers (which suffice to show the performance improvements).

for (; __first != __last; ++__first)
std::invoke(__func, std::invoke(__proj, *__first));
return {std::move(__first), std::move(__func)};
if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {
// In the case where we have different iterator and sentinel types, the segmented iterator optimization
// in std::for_each will not kick in. Therefore, we prefer std::for_each_n in that case (whenever we can
// obtain the `n`).
if constexpr (!std::assignable_from<_Iter&, _Sent> && sized_sentinel_for<_Sent, _Iter>) {

Comment on lines +45 to +50
->Arg(1024)
->Arg(4096)
->Arg(8192)
->Arg(1 << 20);
->Arg(1 << 14)
->Arg(1 << 16)
->Arg(1 << 18);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would be better to leave the old benchmark values in place. They are less comprehensive but we need to achieve a tradeoff between comprehensiveness and the time it takes to run these benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we were running a test case with a very large n = (1 << 20). To save some time, I replaced this large test case with 3 smaller test cases with n = (1 << 14), 1 << 16, 1 << 18. I think the total execution time of these three test cases is actually lower than running a single test case with n = (1 << 20). Please let me know if I misunderstood you.

Comment on lines +87 to +95
->Arg(8)
->Arg(32)
->Arg(50) // non power-of-two
->Arg(1024)
->Arg(4096)
->Arg(8192)
->Arg(1 << 14)
->Arg(1 << 16)
->Arg(1 << 18);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for the benchmark sizes.

Comment on lines 53 to +54
bm.operator()<std::list<int>>("std::for_each_n(list<int>)", std_for_each_n);
bm.operator()<std::vector<int>>("rng::for_each_n(vector<int>)", std::ranges::for_each_n);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the same numbers as for the std::for_each benchmarks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants