Doc. no. | P0024R2 |
Date: | 2016-03-04 |
Project: | Programming Language C++ |
Reply to: | Jared Hoberock <[email protected]> |
We survey implementation experience with the recently published C++ Technical Specification for Extensions for Parallelism and conclude that ample experience with its functionality exists to justify standardization in C++17. This paper describes various existing and pre-existing implementations of the TS's content and describes the additions to be made to the current C++ working paper (N4527) to integrate execution policies and parallel algorithms into the C++ Standard Library.
The technical content of the Parallelism TS was developed by domain experts in parallelism over the course of a few years. In 2012, representatives from NVIDIA (N3408) as well as representiatives from Microsoft and Intel (N3429) independently proposed library approaches to parallelism within the C++ Standard Library. At the suggestion of SG1 the authors of these proposals submitted a design in a joint proposal (N3554) to parallelize the existing standard algorithms library. This proposal was refined into the Parallelism TS over the course of two years. During that refinement process, the authors of the Parallelism TS incorporated feedback from experimental implementations into the final design which was published in 2015. In total, the C++ Standardization Committee has three years of experience with the TS's design.
Several different implementations of the Parallelism TS emerged during its preparation. We are aware of the following publically documented implementations.
These implementations implement the functionality of the Parallelism TS to
varying degrees and in different ways. For example, Microsoft's
implementation appears complete and is implemented via Windows-specific
tasking facilities. Thibaut Lutz' version also appears complete and is
implemented by manipulating std::thread
in a standard way.
NVIDIA's implementation is partial and is implemented as a thin wrapper
around Thrust, a pre-existing library similar in content to the Parallelism
TS. This variety of implementation approaches exists by design: the
abstractions of the Parallelism TS are intended to maximize flexibility of
implementation.
The design of the Parallelism TS's functionality was inspired by several pre-existing libraries. Each of the following parallel algorithms libraries expose an iterator-based algorithm interface based on the conventions of the original Standard Template Library. We believe these libraries are a reasonable proxy for the content of the Parallelism TS.
These libraries have existed for several years, and some are widely deployed in production. Accordingly, we believe the features of the Parallelism TS are proven abstractions that represent standard practice and solve real challenges faced by real C++ programmers. These challenges exist because parallel architectures are so pervasive, and programming them correctly with existing low-level standard components is difficult. As a remedy, we believe that the high-level abstractions of the Parallelism TS must be standardized as soon as possible. C++ programmers should not have to wait beyond 2017 for standard parallel algorithms.
The parallel algorithms and execution policies of the Parallelism TS are only a starting point. Already we anticipate opportunities for extending the Parallelism TS's functionality to increase programmer flexibility and expressivity. A fully-realized executors feature (N4414, N4406) will yield new, flexible ways of creating execution, including the execution of parallel algorithms. For example, executors will provide a programmatic means of specifying where execution is allowed to occur during parallel algorithm execution and will open the door for user-defined execution policies in addition to the Parallelism TS's closed set of standard policies. If the first version of the Parallelism TS is standardized in 2017, such additional features for parallelism will be well-positioned for 2020.
We propose to standardize the functionality of the Parallelism TS as specified. In summary:
exception_list
as a new subclause to Clause 19.<execution_policy>
to Table 14.Thanks to Jens Maurer for wording suggestions.
seq
to sequential
.
Add the following entry to Table 44:
20.15 | Execution policies | <execution_policy> |
<execution_policy>
to Table 14 and update the reported number of headers.
Add a new subclause to Clause 20:
This subclause describes classes that are execution policy types. An object of an execution policy type indicates the kinds of parallelism allowed in the execution of an algorithm and expresses the consequent requirements on the element access functions.
using namespace std;
vector<int> v = ...
// standard sequential sort
sort(v.begin(), v.end());
// explicitly sequential sort
sort(sequential, v.begin(), v.end());
// permitting parallel execution
sort(par, v.begin(), v.end());
// permitting vectorization as well
sort(par_vec, v.begin(), v.end());
<execution_policy>
synopsisnamespace std {
// 20.15.3, Execution policy type trait
template<class T> struct is_execution_policy;
template<class T> constexpr bool is_execution_policy_v = is_execution_policy<T>::value;
// 20.15.4, Sequential execution policy
class sequential_execution_policy;
// 20.15.5, Parallel execution policy
class parallel_execution_policy;
// 20.15.6, Parallel+Vector execution policy
class parallel_vector_execution_policy;
}
template<class T> struct is_execution_policy { see below };
is_execution_policy
can be used to detect execution policies for the purpose of excluding
function signatures from otherwise ambiguous overload resolution
participation.
is_execution_policy<T>
shall be a UnaryTypeTrait with a BaseCharacteristic of true_type
if T
is the type of a standard or implementation-defined execution policy, otherwise false_type
.
The behavior of a program that adds specializations for is_execution_policy
is undefined.
class sequential_execution_policy { unspecified };
The class sequential_execution_policy
is an execution policy type used as a unique type to disambiguate
parallel algorithm overloading and require that a parallel algorithm's
execution may not be parallelized.
class parallel_execution_policy { unspecified };
The class parallel_execution_policy
is an execution policy type used as a unique type to disambiguate
parallel algorithm overloading and indicate that a parallel algorithm's
execution may be parallelized.
class parallel_vector_execution_policy { unspecified };
The class parallel_vector_execution_policy
is an execution policy type used as a unique type to disambiguate
parallel algorithm overloading and indicate that a parallel algorithm's
execution may be parallelized and vectorized.
constexpr sequential_execution_policy sequential{ unspecified }; constexpr parallel_execution_policy par{ unspecified }; constexpr parallel_vector_execution_policy par_vec{ unspecified };
The header <execution_policy>
declares global objects associated with each type of execution policy.
exception_list
Add the following entry to Table 41:
18.8.? | Exception list | <exception_list> |
<exception_list>
to Table 14 and adjust the reported number of headers.
Add a new section to Clause 18 after exception_ptr
:
exception_list
namespace std { class exception_list : public exception { public: typedef unspecified iterator; size_t size() const noexcept; iterator begin() const noexcept; iterator end() const noexcept; virtual const char* what() const noexcept; }; }
The class exception_list
owns a sequence of exception_ptr
objects. Parallel
algorithms use the exception_list
to communicate uncaught exceptions encountered during parallel execution to the
caller of the algorithm.
The type exception_list::iterator
is a constant iterator that meets the requirements of
ForwardIterator
.
size_t size() const noexcept;
exception_ptr
objects contained within the exception_list
.
iterator begin() const noexcept;
exception_ptr
object contained within the exception_list
.
iterator end() const noexcept;
virtual const char* what() const noexcept;
Insert the following subclause after Section 25.1:
A parallel algorithm is a function template listed in this standard with a template parameter named ExecutionPolicy
.
Parallel algorithms access objects indirectly accessible via their arguments by invoking the following functions:
sort
function may invoke the following element access functions:
RandomAccessIterator
.
swap
function on the elements of the sequence (as per 25.4.1.1 [sort]/2).
Compare
function object.
Function objects passed into parallel algorithms as objects of type Predicate
, BinaryPredicate
,
Compare
, and BinaryOperation
shall not directly or indirectly modify
objects via their arguments.
Parallel algorithms have template parameters named ExecutionPolicy
which describe
the manner in which the execution of these algorithms may be parallelized and the manner in
which they apply the element access functions.
The invocations of element access functions in parallel algorithms invoked with an execution
policy object of type sequential_execution_policy
are indeterminately sequenced (1.9) [intro.execution] in the calling thread.
The invocations of element access functions in parallel algorithms invoked with an execution
policy object of type parallel_execution_policy
are permitted to execute in either the invoking thread or in a thread implicitly created by the library
to support parallel algorithm execution. Any such invocations executing in the same thread are
indeterminately sequenced (1.9) [intro.execution] with respect to each other.
int a[] = {0,1}; std::vector<int> v; std::for_each(std::par, std::begin(a), std::end(a), [&](int i) { v.push_back(i*2+1); // Error: data race });The program above has a data race because of the unsynchronized access to the container
v
.
— end example ]
std::atomic<int> x{0}; int a[] = {1,2}; std::for_each(std::par, std::begin(a), std::end(a), [&](int) { x.fetch_add(1, std::memory_order_relaxed); // spin wait for another iteration to change the value of x while (x.load(std::memory_order_relaxed) == 1) { } // Error: assumes execution order });The above example depends on the order of execution of the iterations, and is therefore undefined (may deadlock). — end example ]
int x = 0; std::mutex m; int a[] = {1,2}; std::for_each(std::par, std::begin(a), std::end(a), [&](int) { std::lock_guard<mutex> guard(m); ++x; });The above example synchronizes access to object
x
ensuring that it is
incremented correctly.
— end example ]
The invocations of element access functions in parallel algorithms invoked with an execution
policy of type parallel_vector_execution_policy
are permitted to execute in an unordered fashion in unspecified threads, and unsequenced
with respect to one another within each thread.
parallel_vector_execution_policy
allows the execution of element access functions to be
interleaved on a single thread, synchronization, including the use of mutexes, risks deadlock. Thus the
synchronization with parallel_vector_execution_policy
is restricted as follows:
A standard library function is vectorization-unsafe if it is specified to synchronize with
another function invocation, or another function invocation is specified to synchronize with it, and if
it is not a memory allocation or deallocation function. Vectorization-unsafe standard library functions
may not be invoked by user code called from parallel_vector_execution_policy
algorithms.
int x = 0; std::mutex m; int a[] = {1,2}; std::for_each(std::par_vec, std::begin(a), std::end(a), [&](int) { std::lock_guard<mutex> guard(m); // Error: lock_guard constructor calls m.lock() ++x; });The above program has undefined behavior because the applications of the function object are not guaranteed to run on different threads. This may result in two consecutive calls to
m.lock()
on the same thread, which may deadlock.
— end example ]
parallel_execution_policy
or the
parallel_vector_execution_policy
invocation allow the implementation to fall back to
sequential execution if the system cannot parallelize an algorithm invocation due to lack of
resources.
— end note ]
The semantics of parallel algorithms invoked with an execution policy object of implementation-defined type are implementation-defined.
During the execution of a parallel algorithm,
if temporary memory resources are required for parallelization and none are available,
the algorithm throws a bad_alloc
exception.
During the execution of a parallel algorithm, if the invocation of an element access function exits via an uncaught exception, the behavior of the program is determined by the type of execution policy used to invoke the algorithm:
parallel_vector_execution_policy
,
terminate()
is called.
sequential_execution_policy
or
parallel_execution_policy
, the execution of the algorithm exits via an exception. The
exception will be an exception_list
containing all uncaught exceptions thrown during
the invocations of element access functions, or optionally the uncaught exception if there was only one.
for_each
is executed sequentially, if an invocation of the user-provided
function object throws an exception, for_each
can exit via the uncaught exception, or throw an
exception_list
containing the original exception.
— end note ]
bad_alloc
, all exceptions thrown during the
execution of the algorithm are communicated to the
caller. It is unspecified whether an algorithm
implementation will "forge ahead" after encountering and capturing
a user exception.
— end note ]
bad_alloc
exception even
if one or more user-provided
function objects have exited via an exception. For
example, this can happen when an algorithm fails to allocate memory
while
creating or adding elements to the exception_list
object.
— end note ]
ExecutionPolicy
algorithm overloads
Parallel algorithms are algorithm overloads. Each parallel algorithm overload has an
additional template type parameter named ExecutionPolicy
, which is the first template parameter.
Additionally, each parallel algorithm overload has an additional function parameter
of type ExecutionPolicy&&
, which is the first function parameter.
Unless otherwise specified, the semantics of ExecutionPolicy
algorithm overloads
are identical to their overloads without.
Parallel algorithms shall not participate in overload resolution unless
is_execution_policy_v<decay_t<ExecutionPolicy>>
is true
.
adjacent_difference |
adjacent_find |
all_of |
any_of |
copy |
copy_if |
copy_n |
count |
count_if |
equal |
exclusive_scan |
fill |
fill_n |
find |
find_end |
find_first_of |
find_if |
find_if_not |
for_each |
for_each_n |
generate |
generate_n |
includes |
inclusive_scan |
inner_product |
inplace_merge |
is_heap |
is_heap_until |
is_partitioned |
is_sorted |
is_sorted_until |
lexicographical_compare |
max_element |
merge |
min_element |
minmax_element |
mismatch |
move |
none_of |
nth_element |
partial_sort |
partial_sort_copy |
partition |
partition_copy |
reduce |
remove |
remove_copy |
remove_copy_if |
remove_if |
replace |
replace_copy |
replace_copy_if |
replace_if |
reverse |
reverse_copy |
rotate |
rotate_copy |
search |
search_n |
set_difference |
set_intersection |
set_symmetric_difference |
set_union |
sort |
stable_partition |
stable_sort |
swap_ranges |
transform |
transform_exclusive_scan |
transform_inclusive_scan |
transform_reduce |
uninitialized_copy |
uninitialized_copy_n |
uninitialized_fill |
uninitialized_fill_n |
unique |
unique_copy |
for_each_n
should follow for_each
reduce
should follow accumulate
transform_reduce
should follow reduce
exclusive_scan
should follow partial_sum
inclusive_scan
should follow exclusive_scan
transform_exclusive_scan
should follow inclusive_scan
transform_inclusive_scan
should follow transform_exclusive_scan
for_each
with ExecutionPolicy
, sequential for_each_n
, and for_each_n
with ExecutionPolicy
to (25.2.4):
template<class ExecutionPolicy,
class InputIterator, class Function>
void for_each(ExecutionPolicy&& exec,
InputIterator first, InputIterator last,
Function f);
Function
shall meet the requirements of CopyConstructible
.
f
to the result of dereferencing every iterator in the range [first,last)
.
first
satisfies the requirements of a mutable iterator, f
may
apply nonconstant functions through the dereferenced iterator.
— end note ]
f
exactly last - first
times.
f
returns a result, the result is ignored.
Function
parameter, since parallelization may not permit efficient state
accumulation.
template<class InputIterator, class Size, class Function>
InputIterator for_each_n(InputIterator first, Size n,
Function f);
Function
shall meet the requirements of MoveConstructible
Function
need not meet the requirements of CopyConstructible
.
— end note ]
n >= 0
.
f
to the result of dereferencing every iterator in the range
[first,first + n)
in order.
first
satisfies the requirements of a mutable iterator,
f
may apply nonconstant functions through the dereferenced iterator.
— end note ]
first + n
.
f
returns a result, the result is ignored.
template<class ExecutionPolicy,
class InputIterator, class Size, class Function>
InputIterator for_each_n(ExecutionPolicy&& exec,
InputIterator first, Size n,
Function f);
Function
shall meet the requirements of CopyConstructible
.
n >= 0
.
f
to the result of dereferencing every iterator in the range
[first,first + n)
.
first
satisfies the requirements of a mutable iterator,
f
may apply nonconstant functions through the dereferenced iterator.
— end note ]
first + n
.
f
returns a result, the result is ignored.
Insert the following entry to Table 113:
26.2 | Definitions |
Insert the following subclause to Clause 26:
Define GENERALIZED_SUM(op, a1, ..., aN)
as follows:
a1
when N
is 1
op(GENERALIZED_SUM(op, b1, ..., bK)
, GENERALIZED_SUM(op, bM, ..., bN))
where
b1, ..., bN
may be any permutation of a1, ..., aN
and1 < K+1 = M ≤ N
.
Define GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aN)
as follows:
a1
when N
is 1
op(GENERALIZED_NONCOMMUTATIVE_SUM(op, a1, ..., aK), GENERALIZED_NONCOMMUTATIVE_SUM(op, aM,
..., aN)
where 1 < K+1 = M ≤ N
.
reduce
, exclusive_scan
, inclusive_scan
, transform_reduce
, transform_exclusive_scan
, and transform_inclusive_scan
to Clause 26.7.
template<class InputIterator>
typename iterator_traits<InputIterator>::value_type
reduce(InputIterator first, InputIterator last);
return reduce(first, last, typename iterator_traits<InputIterator>::value_type{})
.
template<class InputIterator, class T>
T reduce(InputIterator first, InputIterator last, T init);
return reduce(first, last, init, plus<>())
.
template<class InputIterator, class T, class BinaryOperation>
T reduce(InputIterator first, InputIterator last, T init,
BinaryOperation binary_op);
GENERALIZED_SUM(binary_op, init, *i, ...)
for every i
in [first, last)
.
binary_op
shall neither invalidate iterators or subranges, nor modify elements in the
range [first,last)
.
last - first
) applications of binary_op
.
reduce
and accumulate
is that reduce
applies binary_op
in an unspecified order, which yields a non-deterministic result for non-associative or non-commutative binary_op
such as floating-point addition.
template<class InputIterator, class OutputIterator, class T>
OutputIterator exclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
T init);
return exclusive_scan(first, last, result, init, plus<>())
.
template<class InputIterator, class OutputIterator, class T, class BinaryOperation>
OutputIterator exclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
T init, BinaryOperation binary_op);
i
in [result,result + (last - first))
the value of
init
, if i
is result
, otherwise
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *j, ...)
for every j
in [first,first + (i - result) - 1)
.
result
.
binary_op
shall neither invalidate iterators or subranges, nor modify elements in the
ranges [first,last)
or [result,result + (last - first))
.
last - first
) applications of binary_op
.
exclusive_scan
and inclusive_scan
is that
exclusive_scan
excludes the i
th input element from the i
th
sum. If binary_op
is not mathematically associative, the behavior of
exclusive_scan
may be non-deterministic.
result
may be equal to first
.
template<class InputIterator, class OutputIterator>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result);
return inclusive_scan(first, last, result, plus<>())
.
template<class InputIterator, class OutputIterator, class BinaryOperation>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
BinaryOperation binary_op); template<class InputIterator, class OutputIterator, class BinaryOperation>
OutputIterator inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
BinaryOperation binary_op, T init);
i
in [result,result + (last - first))
the value of
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, *j, ...)
for every j
in [first,first + (i - result))
or
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, *j, ...)
for every j
in [first,first + (i - result))
if init
is provided.
result
.
binary_op
shall not invalidate iterators or subranges, nor modify elements in the
ranges [first,last)
or [result,result + (last - first))
.
last - first
) applications of binary_op
.
result
may be equal to first
.
exclusive_scan
and inclusive_scan
is that
inclusive_scan
includes the i
th input element in the i
th sum.
If binary_op
is not mathematically associative, the behavior of
inclusive_scan
may be non-deterministic.
template<class InputIterator, class UnaryFunction, class T, class BinaryOperation>
T transform_reduce(InputIterator first, InputIterator last,
UnaryOperation unary_op, T init, BinaryOperation binary_op);
GENERALIZED_SUM(binary_op, init, unary_op(*i), ...)
for every i
in [first,last)
.
unary_op
nor binary_op
shall invalidate subranges, or modify elements in the range [first,last)
last - first
) applications each of unary_op
and binary_op
.
transform_reduce
does not apply unary_op
to init
.
template<class InputIterator, class OutputIterator,
class UnaryOperation,
class T, class BinaryOperation>
OutputIterator transform_exclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
UnaryOperation unary_op,
T init, BinaryOperation binary_op);
i
in [result,result + (last - first))
the value of
init
, if i
is result
, otherwise
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, unary_op(*j), ...)
for every j
in [first,first + (i - result) - 1)
.
result
.
unary_op
nor binary_op
shall invalidate iterators or subranges, or modify elements in the
ranges [first,last)
or [result,result + (last - first))
.
last - first
) applications each of unary_op
and binary_op
.
result
may be equal to first
.
transform_exclusive_scan
and transform_inclusive_scan
is that transform_exclusive_scan
excludes the ith input element from the ith sum. If binary_op
is not mathematically associative, the behavior of
transform_exclusive_scan
may be non-deterministic. transform_exclusive_scan
does not apply unary_op
to init
.
template<class InputIterator, class OutputIterator,
class UnaryOperation,
class BinaryOperation>
OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
UnaryOperation unary_op,
BinaryOperation binary_op); template<class InputIterator, class OutputIterator,
class UnaryOperation,
class BinaryOperation, class T>
OutputIterator transform_inclusive_scan(InputIterator first, InputIterator last,
OutputIterator result,
UnaryOperation unary_op,
BinaryOperation binary_op, T init);
i
in [result,result + (last - first))
the value of
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, unary_op(*j), ...)
for every j
in [first,first + (i - result))
or
GENERALIZED_NONCOMMUTATIVE_SUM(binary_op, init, unary_op(*j), ...)
for every j
in [first,first + (i - result))
if init
is provided.
result
.
unary_op
nor binary_op
shall invalidate iterators or subranges, or modify elements in the ranges [first,last)
or [result,result + (last - first))
.
last - first
) applications each of unary_op
and binary_op
.
result
may be equal to first
.
transform_exclusive_scan
and transform_inclusive_scan
is that transform_inclusive_scan
includes the ith input element in the ith sum. If binary_op
is not mathematically associative, the behavior of
transform_inclusive_scan
may be non-deterministic. transform_inclusive_scan
does not apply unary_op
to init
.