Code generator for and high-performance emebding look-up kernels, supporting

Summary:
Code generator for and high-performance emebding look-up kernels, supporting
Sum, WeightedSum, and Mean reducers.
Achieve at least 1.5x speedup on float and over 2x speedup for float16, compared to existing code
These are results on Broadwell, using sparse_lengths_sum_benchmar.par benchmark

Old
==============
[root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par  --iteration 10000
Preparing lookup table. 2017-08-08 00:10:23.101848
Preparation finished. 2017-08-08 00:10:27.955680
I0808 00:10:27.955732 30700 net.cc:177] Starting benchmark.
I0808 00:10:27.955759 30700 net.cc:178] Running warmup runs.
I0808 00:10:27.956367 30700 net.cc:188] Main runs.
I0808 00:10:31.839035 30700 net.cc:199] Main run finished. Milliseconds per iter: 0.388264. Iters per second: 2575.56
I0808 00:10:35.704169 30700 net.cc:233] Operator #0 (indices, Python) 0.0583264 ms/iter
I0808 00:10:35.704210 30700 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.327694 ms/iter
I0808 00:10:35.704213 30700 net.cc:237] Time per operator type:
I0808 00:10:35.704217 30700 net.cc:246]        0.327694 SparseLengthsSum
I0808 00:10:35.704221 30700 net.cc:246]       0.0583264 Python
[root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par  --iteration 10000 --dtype float16
Preparing lookup table. 2017-08-08 00:10:59.047159
Preparation finished. 2017-08-08 00:11:05.140565
I0808 00:11:05.140612 31725 net.cc:177] Starting benchmark.
I0808 00:11:05.140635 31725 net.cc:178] Running warmup runs.
I0808 00:11:05.141104 31725 net.cc:188] Main runs.
I0808 00:11:08.371510 31725 net.cc:199] Main run finished. Milliseconds per iter: 0.323039. Iters per second: 3095.6
I0808 00:11:11.671450 31725 net.cc:233] Operator #0 (indices, Python) 0.0609876 ms/iter
I0808 00:11:11.671489 31725 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.26856 ms/iter
I0808 00:11:11.671494 31725 net.cc:237] Time per operator type:
I0808 00:11:11.671497 31725 net.cc:246]         0.26856 SparseLengthsSum
I0808 00:11:11.671500 31725 net.cc:246]       0.0609876 Python

New (Misha's)
==============
[root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par  --iteration 10000
Preparing lookup table. 2017-08-07 23:44:55.897748
Preparation finished. 2017-08-07 23:45:00.708896
I0807 23:45:00.708945 4178361 net.cc:177] Starting benchmark.
I0807 23:45:00.708971 4178361 net.cc:178] Running warmup runs.
I0807 23:45:00.709444 4178361 net.cc:188] Main runs.
I0807 23:45:03.608551 4178361 net.cc:199] Main run finished. Milliseconds per iter: 0.289909. Iters per second: 3449.36
I0807 23:45:06.536182 4178361 net.cc:233] Operator #0 (indices, Python) 0.0572399 ms/iter
I0807 23:45:06.536224 4178361 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.23512 ms/iter
I0807 23:45:06.536228 4178361 net.cc:237] Time per operator type:
I0807 23:45:06.536232 4178361 net.cc:246]         0.23512 SparseLengthsSum
I0807 23:45:06.536236 4178361 net.cc:246]       0.0572399 Python
[root@fblearner001.01.ftw1 /home/msmelyan]# numactl -m 0 -C 0 ./sparse_lengths_sum_benchmark.par  --iteration 10000 --dtype float16
Preparing lookup table. 2017-08-07 23:45:17.191579
Preparation finished. 2017-08-07 23:45:23.173668
I0807 23:45:23.173715 4179316 net.cc:177] Starting benchmark.
I0807 23:45:23.173743 4179316 net.cc:178] Running warmup runs.
I0807 23:45:23.174090 4179316 net.cc:188] Main runs.
I0807 23:45:24.939749 4179316 net.cc:199] Main run finished. Milliseconds per iter: 0.176564. Iters per second: 5663.67
I0807 23:45:26.698885 4179316 net.cc:233] Operator #0 (indices, Python) 0.0557303 ms/iter
I0807 23:45:26.698923 4179316 net.cc:233] Operator #1 (Y, SparseLengthsSum) 0.119794 ms/iter
I0807 23:45:26.698927 4179316 net.cc:237] Time per operator type:
I0807 23:45:26.698931 4179316 net.cc:246]        0.119794 SparseLengthsSum
I0807 23:45:26.698935 4179316 net.cc:246]       0.0557303 Python

Reviewed By: salexspb

Differential Revision: D5582172

fbshipit-source-id: d71f5a55580b734a51b8f30852b75f379acfdaf2
3 files changed
tree: df9e4922f288955b27aa4fe4609d678729f52ca2
  1. .travis/
  2. caffe/
  3. caffe2/
  4. cmake/
  5. conda/
  6. docker/
  7. docs/
  8. scripts/
  9. third_party/
  10. .Doxyfile
  11. .Doxyfile-c
  12. .Doxyfile-python
  13. .gitignore
  14. .gitmodules
  15. .travis.yml
  16. appveyor.yml
  17. cmake_uninstall.cmake.in
  18. CMakeLists.txt
  19. LICENSE
  20. Makefile
  21. PATENTS
  22. README.md
  23. release-notes.md
README.md

Caffe2

TravisCI Build Status Appveyor Build Status

Caffe2 is a lightweight, modular, and scalable deep learning framework. Building on the original Caffe, Caffe2 is designed with expression, speed, and modularity in mind.

News and Events

Caffe2 research award competition request for proposals

Questions and Feedback

Please use Github issues (https://github.com/caffe2/caffe2/issues) to ask questions, report bugs, and request new features.

Please participate in our survey (https://www.surveymonkey.com/r/caffe2). We will send you information about new releases and special developer events/webinars.

License and Citation

Caffe2 is released under the BSD 2-Clause license.

Further Resources on Caffe2.ai