Parallel primitives and algorithms¶
Reductions¶
An instance of vex::Reductor<T, ReduceOP=vex::SUM>
allows one to
reduce an arbitrary vector expression to a single value of type T. Supported
reduction operations are vex::SUM
, vex::MIN
, and
vex::MAX
. Reductor objects are steteful – they keep small
temporary buffers on compute devices and receive a list of command queues at
construction.
In the following example an inner product of two vectors is computed:
vex::Reductor<double, vex::SUM> sum(ctx);
double s = sum(x * y);
Also, see Random number generation for an example of estimating value of \(\pi\) with the Monte Carlo method.
Reduce operations may be combined with the
vex::CombineReductors
class. This way several
reduction operations will be fused into single compute kernel. The operations
should return the same scalar type, and the result of the combined reduction
operation will be appropriately sized OpenCL/CUDA vector type.
In the following example minimum and maximum values of the vector are computed at the same time:
vex::Reductor<double, vex::CombineReductors<vex::MIN, vex::MAX>> minmax(ctx);
cl_double2 m = minmax(x);
std::cout << "min(x) = " << m.s[0] << std::endl;
std::cout << "max(x) = " << m.s[1] << std::endl;
In fact, the operation is so common, that VexCL provides a convenience typedef
vex::MIN_MAX
.
-
template<typename ScalarType, class RDC = SUM>
class vex::Reductor¶ Parallel reduction of arbitrary expression.
Reduction uses small temporary buffer on each device present in the queue parameter. One Reductor class for each reduction kind is enough per thread of execution.
Public Functions
-
inline Reductor(const std::vector<backend::command_queue> &queue = current_context().queue())¶
Constructor.
-
inline Reductor(const std::vector<backend::command_queue> &queue = current_context().queue())¶
-
struct SUM¶
Summation.
Subclassed by vex::SUM_Kahan
-
struct MIN¶
Minimum element.
-
struct MAX¶
Maximum element.
-
template<class ...R>
struct CombineReductors¶ Combines several reduce operations.
Sparse matrix-vector products¶
One of the most common operations in linear algebra is the matrix-vector
product. An instance of vex::SpMat
class holds a representation of
a sparse matrix. Its constructor accepts a sparse matrix in common CRS format.
In the example below a vex::SpMat is constructed from an Eigen sparse
matrix:
Eigen::SparseMatrix<double, Eigen::RowMajor, int> E;
vex::SpMat<double, int> A(ctx, E.rows(), E.cols(),
E.outerIndexPtr(), E.innerIndexPtr(), E.valuePtr());
Matrix-vector products may be used in vector expressions. The only restriction is that the expressions have to be additive. This is due to the fact that the operation involves inter-device communication for multi-device contexts.
// Compute residual value for a system of linear equations:
Z = Y - A * X;
This restriction may be lifted for single-device contexts. In this case VexCL
does not need to worry about inter-device communication. Hence, it is possible
to inline matrix-vector product into a normal vector expression with the help of
vex::make_inline()
:
residual = sum(Y - vex::make_inline(A * X));
Z = sin(vex::make_inline(A * X));
-
template<typename val_t, typename col_t = size_t, typename idx_t = size_t>
class vex::SpMat¶ Sparse matrix in hybrid ELL-CSR format.
Public Functions
-
inline SpMat()¶
Empty constructor.
-
inline SpMat(const std::vector<backend::command_queue> &queue, size_t n, size_t m, const idx_t *row, const col_t *col, const val_t *val)¶
Constructor.
Constructs GPU representation of the \(n \times m\)matrix. Input matrix is in CSR format. GPU matrix utilizes ELL format and is split equally across all compute devices.
-
inline size_t rows() const¶
Number of rows.
-
inline size_t cols() const¶
Number of columns.
-
inline size_t nonzeros() const¶
Number of non-zero entries.
-
inline SpMat()¶
-
template<class MVProdExpr>
auto vex::make_inline(const MVProdExpr &expr)¶ Inlines a sparse matrix - vector product.
When applied to a matrix-vector product, the product becomes inlineable. That is, it may be used in any vector expression (not just additive expressions). The user has to guarantee the function is only used in single-device expressions.
Example:
// Get maximum residual value: eps = sum( fabs(f - vex::make_inline(A * x)) );
Sort, scan, reduce-by-key algorithms¶
VexCL provides several standalone parallel primitives that may not be used as
part of a vector expression. These are vex::inclusive_scan_by_key()
,
vex::exclusive_scan_by_key()
, vex::sort()
,
vex::sort_by_key()
, vex::reduce_by_key()
. All of these
functions take VexCL vectors both as input and output parameters.
Sort and scan functions take an optional function object used for comparison
and summing of elements. The functor should provide the same interface as, e.g.
std::less
for sorting or std::plus
for summing; additionally, it should
provide a VexCL function for device-side operations.
Here is an example of such an object comparing integer elements in such a way that even elements precede odd ones:
template <typename T>
struct even_first {
#define BODY \
char bit1 = 1 & a; \
char bit2 = 1 & b; \
if (bit1 == bit2) return a < b; \
return bit1 < bit2;
// Device version.
VEX_FUNCTION(bool, device, (int, a)(int, b), BODY);
// Host version.
bool operator()(int a, int b) const { BODY }
#undef BODY
};
Same functor could be created with the help of VEX_DUAL_FUNCTOR
macro, which takes return type, sequence of arguments (similar to the
VEX_FUNCTION
), and the body of the functor:
template <typename T>
struct even_first {
VEX_DUAL_FUNCTOR(bool, (T, a)(T, b),
char bit1 = 1 & a;
char bit2 = 1 & b;
if (bit1 == bit2) return a < b;
return bit1 < bit2;
)
};
Note that VexCL already provides vex::less<T>
,
vex::less_equal<T>
, vex::greater<T>
,
vex::greater_equal<T>
, and vex::plus<T>
.
The need to provide both host-side and device-side parts of the functor comes from the fact that multidevice vectors are first sorted partially on each of the compute devices and then merged on the host.
Sorting algorithms may also take tuples of keys/values (in fact, any
Boost.Fusion sequence will do). One will have to explicitly specify the
comparison functor in this case. Both host and device variants of the
comparison functor should take 2n
arguments, where n
is the number of
keys. The first n
arguments correspond to the left set of keys, and the
second n
arguments correspond to the right set of keys. Here is an example
that sorts values by a tuple of two keys:
vex::vector<int> keys1(ctx, n);
vex::vector<float> keys2(ctx, n);
vex::vector<double> vals (ctx, n);
struct {
VEX_FUNCTION(bool, device, (int, a1)(float, a2)(int, b1)(float, b2),
return (a1 == b1) ? (a2 < b2) : (a1 < b1);
);
bool operator()(int a1, float a2, int b1, float b2) const {
return std::make_tuple(a1, a2) < std::make_tuple(b1, b2);
}
} comp;
vex::sort_by_key(std::tie(keys1, keys2), vals, comp);
-
template<typename T, class Oper>
void vex::inclusive_scan(vector<T> const &input, vector<T> &output, T init, Oper oper)¶ Inclusive scan.
-
template<typename T, class Oper>
void vex::exclusive_scan(vector<T> const &input, vector<T> &output, T init, Oper oper)¶ Exclusive scan.
-
template<class K, typename V, class Comp, class Oper>
void vex::inclusive_scan_by_key(K &&keys, const vector<V> &ivals, vector<V> &ovals, Comp comp, Oper oper, V init = V())¶ Inclusive scan by key.
-
template<class K, typename V, class Comp, class Oper>
void vex::exclusive_scan_by_key(K &&keys, const vector<V> &ivals, vector<V> &ovals, Comp comp, Oper oper, V init = V())¶ Exclusive scan by key.
-
template<class K, class Comp>
void vex::sort(K &&keys, Comp comp)¶ Sorts the vector into ascending order.
-
template<class K, class V, class Comp>
void vex::sort_by_key(K &&keys, V &&vals, Comp comp)¶ Sorts the elements in keys and values into ascending key order.
-
VEX_DUAL_FUNCTOR(type, args, ...)¶
Defines both device and host versions of a function call operator.
The intended use is the creation of comparison and reduction functors for use with scan/sort/reduce algorithms.
Example:
template <typename T> struct less { VEX_DUAL_FUNCTOR(bool, (T, a)(T, b), return a < b; ) };
-
template<typename T>
struct less : public std::less<T>¶ Function object class for less-than inequality comparison.
The need for host-side and device-side parts comes from the fact that vectors are partially sorted on device and then final merge step is done on host.
-
template<typename T>
struct less_equal : public std::less_equal<T>¶ Function object class for less-than-or-equal inequality comparison.
-
template<typename T>
struct greater : public std::greater<T>¶ Function object class for greater-than inequality comparison.