Parallel primitives and algorithms

Reductions

An instance of vex::Reductor<T, ReduceOP=vex::SUM> allows one to reduce an arbitrary vector expression to a single value of type T. Supported reduction operations are vex::SUM, vex::MIN, and vex::MAX. Reductor objects are steteful – they keep small temporary buffers on compute devices and receive a list of command queues at construction.

In the following example an inner product of two vectors is computed:

vex::Reductor<double, vex::SUM> sum(ctx);
double s = sum(x * y);

Also, see Random number generation for an example of estimating value of \(\pi\) with the Monte Carlo method.

Reduce operations may be combined with the vex::CombineReductors class. This way several reduction operations will be fused into single compute kernel. The operations should return the same scalar type, and the result of the combined reduction operation will be appropriately sized OpenCL/CUDA vector type.

In the following example minimum and maximum values of the vector are computed at the same time:

vex::Reductor<double, vex::CombineReductors<vex::MIN, vex::MAX>> minmax(ctx);
cl_double2 m = minmax(x);
std::cout << "min(x) = " << m.s[0] << std::endl;
std::cout << "max(x) = " << m.s[1] << std::endl;

In fact, the operation is so common, that VexCL provides a convenience typedef vex::MIN_MAX.

template <typename ScalarType, class RDC = SUM>
class vex::Reductor

Parallel reduction of arbitrary expression.

Reduction uses small temporary buffer on each device present in the queue parameter. One Reductor class for each reduction kind is enough per thread of execution.

Public Functions

Reductor(const std::vector<backend::command_queue> &queue = current_context ().queue())

Constructor.

template <class Expr>
auto operator()(const Expr &expr) const

Compute reduction of a vector expression.

template <class Expr>
std::array<result_type, N> operator()(const Expr &expr) const

Compute reduction of a multivector expression.

struct vex::SUM

Summation.

Subclassed by vex::SUM_Kahan

struct vex::MIN

Minimum element.

struct vex::MAX

Maximum element.

template <class... R>
struct vex::CombineReductors

Combines several reduce operations.

typedef CombineReductors<MIN, MAX> vex::MIN_MAX

Combined MIN and MAX operation.

Sparse matrix-vector products

One of the most common operations in linear algebra is the matrix-vector product. An instance of vex::SpMat class holds a representation of a sparse matrix. Its constructor accepts a sparse matrix in common CRS format. In the example below a vex::SpMat is constructed from an Eigen sparse matrix:

Eigen::SparseMatrix<double, Eigen::RowMajor, int> E;

vex::SpMat<double, int> A(ctx, E.rows(), E.cols(),
    E.outerIndexPtr(), E.innerIndexPtr(), E.valuesPtr());

Matrix-vector products may be used in vector expressions. The only restriction is that the expressions have to be additive. This is due to the fact that the operation involves inter-device communication for multi-device contexts.

// Compute residual value for a system of linear equations:
Z = Y - A * X;

This restriction may be lifted for single-device contexts. In this case VexCL does not need to worry about inter-device communication. Hence, it is possible to inline matrix-vector product into a normal vector expression with the help of vex::make_inline():

residual = sum(Y - vex::make_inline(A * X));
Z = sin(vex::make_inline(A * X));
template <typename val_t, typename col_t = size_t, typename idx_t = size_t>
class vex::SpMat

Sparse matrix in hybrid ELL-CSR format.

Public Functions

SpMat()

Empty constructor.

SpMat(const std::vector<backend::command_queue> &queue, size_t n, size_t m, const idx_t *row, const col_t *col, const val_t *val)

Constructor.

Constructs GPU representation of the \(n \times m\)matrix. Input matrix is in CSR format. GPU matrix utilizes ELL format and is split equally across all compute devices.

size_t rows() const

Number of rows.

size_t cols() const

Number of columns.

size_t nonzeros() const

Number of non-zero entries.

template <class MVProdExpr>
auto vex::make_inline(const MVProdExpr &expr)

Inlines a sparse matrix - vector product.

When applied to a matrix-vector product, the product becomes inlineable. That is, it may be used in any vector expression (not just additive expressions). The user has to guarantee the function is only used in single-device expressions.

Example:

// Get maximum residual value:
eps = sum( fabs(f - vex::make_inline(A * x)) );

Sort, scan, reduce-by-key algorithms

VexCL provides several standalone parallel primitives that may not be used as part of a vector expression. These are vex::inclusive_scan_by_key(), vex::exclusive_scan_by_key(), vex::sort(), vex::sort_by_key(), vex::reduce_by_key(). All of these functions take VexCL vectors both as input and output parameters.

Sort and scan functions take an optional function object used for comparison and summing of elements. The functor should provide the same interface as, e.g. std::less for sorting or std::plus for summing; additionally, it should provide a VexCL function for device-side operations.

Here is an example of such an object comparing integer elements in such a way that even elements precede odd ones:

template <typename T>
struct even_first {
    #define BODY                        \
        char bit1 = 1 & a;              \
        char bit2 = 1 & b;              \
        if (bit1 == bit2) return a < b; \
        return bit1 < bit2;

    // Device version.
    VEX_FUNCTION(bool, device, (int, a)(int, b), BODY);

    // Host version.
    bool operator()(int a, int b) const { BODY }

    #undef BODY
};

Same functor could be created with the help of VEX_DUAL_FUNCTOR macro, which takes return type, sequence of arguments (similar to the VEX_FUNCTION), and the body of the functor:

template <typename T>
struct even_first {
    VEX_DUAL_FUNCTOR(bool, (T, a)(T, b),
        char bit1 = 1 & a;
        char bit2 = 1 & b;
        if (bit1 == bit2) return a < b;
        return bit1 < bit2;
    )
};

Note that VexCL already provides vex::less<T>, vex::less_equal<T>, vex::greater<T>, vex::greater_equal<T>, and vex::plus<T>.

The need to provide both host-side and device-side parts of the functor comes from the fact that multidevice vectors are first sorted partially on each of the compute devices and then merged on the host.

Sorting algorithms may also take tuples of keys/values (in fact, any Boost.Fusion sequence will do). One will have to explicitly specify the comparison functor in this case. Both host and device variants of the comparison functor should take 2n arguments, where n is the number of keys. The first n arguments correspond to the left set of keys, and the second n arguments correspond to the right set of keys. Here is an example that sorts values by a tuple of two keys:

vex::vector<int>    keys1(ctx, n);
vex::vector<float>  keys2(ctx, n);
vex::vector<double> vals (ctx, n);

struct {
    VEX_FUNCTION(bool, device, (int, a1)(float, a2)(int, b1)(float, b2),
            return (a1 == b1) ? (a2 < b2) : (a1 < b1);
            );
    bool operator()(int a1, float a2, int b1, float b2) const {
        return std::make_tuple(a1, a2) < std::make_tuple(b1, b2);
    }
} comp;

vex::sort_by_key(std::tie(keys1, keys2), vals, comp);
template <typename T, class Oper>
void vex::inclusive_scan(vector<T> const &input, vector<T> &output, T init, Oper oper)

Inclusive scan.

template <typename T, class Oper>
void vex::exclusive_scan(vector<T> const &input, vector<T> &output, T init, Oper oper)

Exclusive scan.

template <class K, typename V, class Comp, class Oper>
void vex::inclusive_scan_by_key(K &&keys, const vector<V> &ivals, vector<V> &ovals, Comp comp, Oper oper, V init = V())

Inclusive scan by key.

template <class K, typename V, class Comp, class Oper>
void vex::exclusive_scan_by_key(K &&keys, const vector<V> &ivals, vector<V> &ovals, Comp comp, Oper oper, V init = V())

Exclusive scan by key.

template <class K, class Comp>
void vex::sort(K &&keys, Comp comp)

Sorts the vector into ascending order.

template <class K, class V, class Comp>
void vex::sort_by_key(K &&keys, V &&vals, Comp comp)

Sorts the elements in keys and values into ascending key order.

VEX_DUAL_FUNCTOR(type, args, ...) VEX_FUNCTION(type, device, args, __VA_ARGS__); \ VEX_DUAL_FUNCTOR_SINK(type, \ BOOST_PP_SEQ_SIZE(VEXCL_FUNCTION_MAKE_SEQ(args)), \ VEXCL_FUNCTION_MAKE_SEQ(args), __VA_ARGS__)

Defines both device and host versions of a function call operator.

The intended use is the creation of comparison and reduction functors for use with scan/sort/reduce algorithms.

Example:

template <typename T>
struct less {
    VEX_DUAL_FUNCTOR(bool, (T, a)(T, b),
        return a < b;
        )
};

template <typename T>
struct vex::less

Function object class for less-than inequality comparison.

The need for host-side and device-side parts comes from the fact that vectors are partially sorted on device and then final merge step is done on host.

Inherits from std::less< T >

template <typename T>
struct vex::less_equal

Function object class for less-than-or-equal inequality comparison.

Inherits from std::less_equal< T >

template <typename T>
struct vex::greater

Function object class for greater-than inequality comparison.

Inherits from std::greater< T >

template <typename T>
struct vex::greater_equal

Function object class for greater-than-or-equal inequality comparison.

Inherits from std::greater_equal< T >

template <typename T>
struct vex::plus

Binary function object class whose call returns the result of adding its two arguments.

Inherits from std::plus< T >