---
title: "PyTorch Intro I: SSH, Jupyter and Cuda"
author: Tom Weber
---

## Preliminaries

Make sure we are only using our reserved GPUs.

``` code
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # order devices by bus id
os.environ["CUDA_VISIBLE_DEVICES"]="0,2" # only make device 0 visible
```

## Training a Standard Vision Classifier

### Bulding a Model with Sequential()

Let's do a standard image classification task.

``` code
import torch
import torch.nn as nn
```

Sequential works very similar to the Keras concept. A container wraps around individual layers in the order they are given.

``` code
net = nn.Sequential(nn.Conv2d(3, 6, 5), # 3 input channels, 6 filters each 5x5
                    nn.ReLU(), # non-linearity
                    nn.MaxPool2d((2,2)), # pooling
                    nn.Conv2d(6, 16, 5), # 16 filters this time
                    nn.ReLU(), # non-linearity
                    nn.MaxPool2d((2,2)), # pooling
                    nn.Flatten(), # flatten feature maps
                    nn.Linear(16*5*5, 100), # 16x5x5 input neurons, 100 output neurons
                    nn.Linear(100, 10)
)

net = net.cuda() # put the model on the GPU
```

### Creating dataloaders

For simplicity sake, I will just take a premade dataset that is supplied with torch.
The dataset is part of the torchvision module, which we don't have yet.

``` code
!pip install torchvision
```

``` code
import torchvision
```

Datasets can easily created with custom data buy subclassing torch.nn.Dataset, see next jupypter notebook.
(The datasets and preprocessing options used here are torchvision specific.)

``` code
transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor(),
                                            torchvision.transforms.Normalize((0.4914, 0.4822, 0.4465), (0.247, 0.243, 0.261))])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=torchvision.transforms.ToTensor())
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=torchvision.transforms.ToTensor())
```


A dataloader takes a dataset and bunch of other arguments and provides convenient data access to feed to the network.

``` code
trainloader = torch.utils.data.DataLoader(trainset, batch_size=32,
                                          shuffle=True, num_workers=2)

testloader = torch.utils.data.DataLoader(testset, batch_size=32,
                                         shuffle=False, num_workers=2)
```

### Inspect the model with tensorboard

Tensorboard, while originally from TensorFlow, also works with PyTorch pretty well.

``` code
!pip install tensorboard
```

``` code
from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter('runs') # initialize the writer with folder "./runs"
imgs, _ = next(iter(trainloader)) # get some input to trace the graph
writer.add_graph(net, imgs.cuda()) # trace the graph once and store it
```

Now we can start tensorboard in the same location where the notebook is located with `tensorboard --logdir=runs`
and open it in our browser at [localhost:6006](localhost:6006)

``` code
!tensorboard --logdir=runs --port=6007
```


### Prepare training function

We still need a loss and an optimizer

``` code
import numpy as np # for later use
loss = nn.CrossEntropyLoss() # takes logits as predictions and int label
optimizer = torch.optim.SGD(net.parameters(), lr=0.001, momentum=0.9) # optimizer needs to be supplied with the parameters to optimize
```

Build a function that trains the model on the data for one epoch

``` code
def train(net, dataloader, optimizer, loss):
    epoch_loss = [] # save a running loss
    net.train() # tell the model that it's training time
    for img, lbl in dataloader:
        img, lbl = img.cuda(), lbl.cuda() # put data on GPU
        optimizer.zero_grad() # free the optimizer from previous gradients
        out = net(img) # compute image lbls
        batch_loss = loss(out, lbl) # compute loss
        batch_loss.backward() # compute gradients
        optimizer.step() # update weights
        epoch_loss.append(batch_loss.item()) # record the batch loss
    return np.mean(epoch_loss) # return the epoch loss
```


### Train the model

Train the model for a couple of epochs and save checkpoints periodically

``` code
for epoch in range(5):
    epoch_loss = train(net, trainloader, optimizer, loss)
    print("Epoch ",epoch+1," finished, Loss: ", epoch_loss)
    writer.add_scalar("epoch loss", epoch_loss, epoch+1)
    if (epoch+1) % 5 == 0:
        torch.save(net.state_dict(), "../saved_models/net_{}_epochs.pth".format(epoch+1))
```

``` code
!tensorboard --logdir=runs --port=6007
```


### Evaluate the Model

Since the images are small we can run the evaluation just fine on the CPU. The model has to be brought back to the CPU for that purpose.

Each model has .train() and .eval() flags that specify the behaviour of certain layers.

``` code
net = net.cpu() # bring the network back from the GPU
net.eval() # tell the network that it's testing time
correct = 0
total = 0
for img, lbl in testloader:
    out = net(img)
    logits, indices = torch.max(out, 1)
    correct += torch.sum(indices == lbl).item()
    total += len(lbl)
print("The model correctly classified ", correct/total*100, "% of the images.")
```

### Train the model on multiple GPUs

Create the network again, but then generate an instance of it with nn.DataParallel.

``` code
net_parallel = nn.Sequential(
		     nn.Conv2d(3, 6, 5), # 3 input channels, 6 filters each 5x5
                     nn.ReLU(), # non-linearity
                     nn.MaxPool2d(2,2), # pooling
                     nn.Conv2d(6, 16, 5), # 16 filters this time
                     nn.ReLU(), # non-linearity
                     nn.MaxPool2d(2,2), # pooling
                     nn.Flatten(),
                     nn.Linear(16*5*5, 100), # 16x5x5 input neurons, 100 output neurons
                     nn.Linear(100, 10)
)
net_parallel = torch.nn.DataParallel(net_parallel, device_ids=[0,1])
net_parallel = net_parallel.cuda() # put the model on the first GPU
optimizer_parallel = torch.optim.SGD(net_parallel.parameters(),
				     lr=0.001, momentum=0.9) # dont forget to inform the optimizer
```

Take it for a test drive. Keep your eyes peeled at a terminal with e.g. `watch -d nvidia-smi`. There will be no speed increase in this case as it is a relatively small model. On the contrary, the overhead of copying the model to the other GPUs will probably result in a net training time loss.

``` code
for epoch in range(10):
    epoch_loss = train(net_parallel, trainloader, optimizer_parallel, loss)
    print("Epoch ",epoch+1," finished, Loss: ", epoch_loss)
```