Optimized indexSelect kernel for contiguous inputs

Adds an optimized indexSelect kernel for contiguous inputs indexed in
the first dimension. This will remove the need for a separate CUDA
LookupTable forward pass, since it can just use indexSelect.
1 file changed