Optimize SumSqrElementsOp for CUDA

Summary: The old version used one block with 128 threads. Throughput was too low for the NMT use case (calculating squared gradient norms for every parameter), so this increases the throughput. Shaves 7% off CNN model training time per step

Reviewed By: wickedfoo

Differential Revision: D5263748

fbshipit-source-id: adc3bacd11e49ea00c60381d613d993050e899be
2 files changed
tree: b059af9ba0756fa757dc1a49aa95904020629167
  1. .travis/
  2. caffe/
  3. caffe2/
  4. cmake/
  5. docs/
  6. scripts/
  7. third_party/
  8. .Doxyfile
  9. .Doxyfile-c
  10. .Doxyfile-python
  11. .gitignore
  12. .gitmodules
  13. .travis.yml
  14. appveyor.yml
  15. CMakeLists.txt
  16. LICENSE
  17. Makefile
  18. PATENTS
  19. README.md
  20. release-notes.md
README.md

Caffe2

TravisCI Build Status Appveyor Build Status

Caffe2 is a lightweight, modular, and scalable deep learning framework. Building on the original Caffe, Caffe2 is designed with expression, speed, and modularity in mind.

Events

Caffe2 Bay Area Meetup at NVIDIA, May 31 6-8:30, Santa Clara, CA: https://www.meetup.com/Caffe2-Bay-Area/events/239836290/

User Groups

Caffe2 Community Facebook Group: join to ask questions, talk to other users, and keep informed of important Caffe2 updates.

Questions and Feedback

Please use Github issues (https://github.com/caffe2/caffe2/issues) to ask questions, report bugs, and request new features.

Please participate in our survey (https://www.surveymonkey.com/r/caffe2). We will send you information about new releases and special developer events/webinars.

License and Citation

Caffe2 is released under the BSD 2-Clause license.

Further Resources on Caffe2.ai