Added single-dimensional sum for CudaTensor

i.e. summation of the form y:sum(x,1). This is supported for up to
4-dimensional tensors in a single kernel call. More dimensions could be
added if needed by looping over this kernel.

Internally two generic reduction kernels are used which reduce either
the innermost or one of the outer dimensions. In either case global
memory accesses are fully coelesced.
2 files changed