commit | b06c00047813a45c4a8c19b2eb471cc76a61e286 | [log] [tgz] |
---|---|---|
author | Adam Paszke <adam.paszke@gmail.com> | Tue Aug 16 21:02:46 2016 -0700 |
committer | Adam Paszke <adam.paszke@gmail.com> | Tue Aug 16 21:11:10 2016 -0700 |
tree | 9cb6bfef4f67f845d5f186b73e21288dcd90e8ee | |
parent | eaa24dc7c8b8e218e7672eab8197d7a85834766b [diff] |
Fix <3.5 compatibility and travis configuration
The project is still under active development and is likely to drastically change in short periods of time. We will be announcing API changes and important developments via a newsletter, github issues and post a link to the issues on slack. Please remember that at this stage, this is an invite-only closed alpha, and please don't distribute code further. This is done so that we can control development tightly and rapidly during the initial phases with feedback from you.
pip install -r requirements.txt pip install .
We will run the alpha releases weekly for 6 weeks. After that, we will reevaluate progress, and if we are ready, we will hit beta-0. If not, we will do another two weeks of alpha.
The beta phases will be leaning more towards working with all of you, convering your use-cases, active development on non-core aspects.
We‘ve decided that it’s time to rewrite/update parts of the old torch API, even if it means losing some of backward compatibility (we can hack up a model converter that converts correctly). This section lists the biggest changes, and suggests how to shift from torch to pytorch.
For now there‘s no pytorch documentation. Since all currently implemented modules are very similar to the old ones, it’s best to use torch7 docs for now (having in mind several differences described below).
All core modules are merged into a single repository. Most of them will be rewritten and will be completely new (more on this below), but we're providing a Python version of old packages under torch.legacy namespace.
pytorch uses 0-based indexing everywhere. This includes arguments to index*
functions and nn criterion weights.
Under the hood, on the C side, we've changed logic on TH / THC / THNN / THCUNN to introduce a TH_INDEX_BASE compile-time definition to switch between 0 and 1 indexing logic.
All methods operating on tensors are now out-of-place by default.
This means that although a.add(b)
used to have a side-effect of mutating the elements in a, it will now return a new Tensor, holding the result. All methods that mutate the Tensor/Storage are now marked with a trailing underscore (including copy
-> copy_
, fill
-> fill_
, set
-> set_
, etc.). Most of math methods have their in-place counterparts, so an equivalent to a.add(b)
in Lua is now a.add_(b)
(or torch.add(a, a, b)
, which is not recommended in this case)
All tensors have their CUDA counterparts in torch.cuda module.
There is no torch.cuda.setDevice
anymore. By default always the 0th device is selected, but code can be placed in a with
statement to change it:
with torch.cuda.device(1): a = torch.cuda.FloatTensor(10) # a is allocated on GPU1
Calling .cuda()
on tensors no longer converts it to a GPU float tensor, but to a CUDA tensor of the same type located on a currently selected device. So, for example: a = torch.LongTensor(10).cuda() # a is a CudaLongTensor
Calling .cuda(3)
will send it to the third device. .cuda()
can be also used to transfer CUDA tensors between devices (calling it on a GPU tensor, with a different device selected will copy it into the current device).
a = torch.LongTensor(10) b = a.cuda() # b is a torch.cuda.LongTensor placed on GPU0 c = a.cuda(2) # c is a torch.cuda.LongTensor placed on GPU2 with torch.cuda.device(1): d = b.cuda() # d is a copy of b, but on GPU1 e = d.cuda() # a no-op, d is already on current GPU, e is d == True
Also, setting device is now only important to specify where to allocate new Tensors. You can perform operations on CUDA Tensors irrespective of currently selected device (but all arguments have to be on the same device) - result will be also allocated there. See below for an example:
a = torch.randn(2, 2).cuda() b = torch.randn(2, 2).cuda() with torch.cuda.device(1): c = a + b # c is on GPU0 d = torch.randn(2, 2).cuda() # d is on GPU1
In the near future, we also plan to use a CUDA allocator, which allows to alleviate problems with cudaMalloc/cudaFree being a sync point. This will help us to not worry about using buffers for every intermediate computation in a module if one wants to do multi-GPU training, for example. See: https://github.com/torch/cutorch/pull/443
Because numpy is a core numerical package in Python, and is used by many other libraries like matplotlib, we've implemented a two-way bridge between pytorch and numpy.
a = torch.randn(2, 2) b = a.numpy() # b is a numpy array of type corresponding to a # no memory copy is performed, they share the same storage c = numpy.zeros(5, 5) d = torch.DoubleTensor(c) # it's possible to construct Tensors from numpy arrays # d shares memory with b - there's no copy
After looking at several framework designs, looking at the current design of nn
and thinking through a few original design ideas, this is what we've converged to:
class Network(nn.Container): def __init__(self): super(Network, self).__init__( conv1=nn.SpatialConvolution(3, 16, 3, 3, 1, 1), relu1=nn.ReLU(True), lstm=nn.LSTM(), ) def __call__(self, input): y = self.conv(input) y = self.relu1(y) y = self.lstm(y) return y model = Network() input = nn.Variable(torch.zeros(256, 3, 224, 224)) output = model(input) loss = 0 for i in range(ITERS): input, target = ... # That's all you need for an RNN for t in range(TIMESTEPS): loss += loss_fn(model(input), target) loss.backward()
Proposed solutions need to address:
Rough solution:
# This is an example of a network that has a data parallel part inside # # B is data parallel # +->A+-->B+-+ # +--+ +->D # +->C+------+ class Network(nn.Container): __init__(self): super(Network, self).__init__( A = ..., B = GPUReplicate(B, [0, 1, 2, 3]), # Copies the module onto a list of GPUs C = ..., D = ... ) __call__(self, x): a = self.A(x) c = self.C(x) a_split = Split(a) # a_split is a list of Tensors placed on different devices b = ParallelApply(self.B, a_split) # self.B is a list-like object containing copies of B d_input = Join(b + [c]) # gathers Tensors on a single GPU return self.D(d_input)
Each module is assigned to a single GPU.
For Kernel Launch Latency:
For parameter reductions ASAP:
We plan to make it as straightforward as possible, to use pytorch in a multiprocessing environment. For this, we plan to implement a .share() method for tensors that will enable them to be shared across processes seamlessly. One can use python multiprocessing seamlessly.