pytorch_intro/markdown/0_Intro_Jupyter_Cuda.md

---
title: "PyTorch Intro I: SSH, Jupyter and Cuda"
author: Tom Weber
---

## Connecting to remote servers for heavy computing

The ssh command can be equipped with additional statements to allow port forwarding.

Thereby, one can use jupyter notebooks on remote servers. *(not recommended for actual projects!)*

E.g. `ssh -L 8000:localhost:8888 tomweber@REMOTESERVER`

This enables our machine to listen on the remote port 8888 (jupyter notebook port) and foward it to our local port 8000.


## Setting up the environment

This notebook assumes that it is run in a virtual environment. Using environments is encouraged in order to avoid package conflicts.

**Quick setup**

Close the jupyter server and execute the following shell commands, one after the other:


```shell
python3 -m venv .venv # install the environment
source .venv/bin/activate # activate the environment
pip install jupyter # install jupyter into the environment
```

## Installing PyTorch

In order to see if torch is installed, check the output of the next cell
(prepending an exclamation mark executes shell code inside the jupyter notebook).


``` code
!pip list --format columns | grep torch
```

If there is no output, it is _not_ installed. Therefore we want to install it.

Install PyTorch with:

``` code
!pip install torch
```

As long as the environment is activated (and we are hopefully running the notebook from there), pip will install the package and dependencies into the appropriate venv folder.
Global packages are masked and won't be conflicting with our local packages.


## Figuring out CUDA with PyTorch

``` code
import torch
```

Let's begin by checking if CUDA works with PyTorch at all:

``` code
if torch.cuda.is_available():
    print("CUDA available")
else:
    print("Could not find CUDA, possibly encountering problems with current CUDA version")
```

In contrast to most local machines, the servers are usually equipped with multiple GPUs. Let's see how many there are:

``` code
print("GPUs available: ", torch.cuda.device_count()) # show number of cuda devices
```

### Computing with tensors on the GPU

In PyTorch, tensors are always associated with a device on which they are running, i.e. CPU or GPU/CUDA. Operations can be arbitrarily executed on tensors no matter which device they are on.

By default, tensors are created on the "CPU" device.

``` code
x = torch.ones((3,3)) # create 3x3 tensor consisting of ones
print(x.device) # show associated device of x
```

In order to run computations on the GPU, the associated tensors must be explicitly copied there.

``` code
x = x.cuda() # copy tensor to cuda device
print(x.device) # show associated device of x
```

Let's look at an example:

``` code
cpu1 = torch.rand((400,400))
# create 400x400 tensor consisting of random (normal) numbers
cpu2 = torch.rand((400,400))

%timeit torch.matmul(cpu1,cpu2) # time the execution of matrix multiplication
```

``` code
gpu1 = torch.rand((400,400)).cuda()
# create 400x400 tensor consisting of random (normal) numbers and copy to CUDA device
gpu2 = torch.rand((400,400)).cuda()

%timeit torch.matmul(gpu1,gpu2) # time the execution of matrix multiplication
```

### Single GPU use case

By default, PyTorch will always use the "first" GPU (i.e. lowest device number) as the current device.
CAUTION: CUDA numbering is not necessarily the same as it is shown in `nvidia-smi`!
`nvidia-smi` orders by PCI-Bus.

We can check the selected device number with:

``` code
print("The currently selected GPU is number:", torch.cuda.current_device(),
      ", it's a ", torch.cuda.get_device_name(device=None))
```

One should always cross-reference if that is actually the device one wants to use. Which is easy in the case when there are different GPUs on the server. However, in our case, there are two GPUs with the same name.

``` code
!nvidia-smi -L # show the GPUs installed on the machine
```

If one wants to change the current device, there are several possible ways to achieve this.

1.) Best practice is to explicitly whitelist the GPU your code can see, effectively masking the rest. This will avoid any accidental overlap with other GPUs that you did not book.

Note: This way we can also bring consistency in the ordering by telling CUDA to order the GPUs by pci bus id.

Due to how jupyter notebooks work, executing the cell will not have any effect, because we already imported torch and intialized cuda.
Therefore, restart the the kernel and execute the cell again.


``` code
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # order devices by bus id
os.environ["CUDA_VISIBLE_DEVICES"]="0,2" # only make device 0 visible
```

Now let's check how many devices we can see:


``` code
import torch
print("GPUs available: ", torch.cuda.device_count())
for device in range(torch.cuda.device_count()):
    print("Device",device, ":", torch.cuda.get_device_name(device=device))
```


2.) One can set the cuda device manually.

``` code
torch.cuda.set_device(1) # make cuda device nr. 1 the current device
print(torch.cuda.get_device_name(device=None))
```

3.) But better practice would be to embed your code into a cuda device context

``` code
with torch.cuda.device(0): # context manager for specific cuda device
    # your code here
    print(torch.cuda.get_device_name(device=None))
```

Alternatively, on can also copy the tensors to a specific device:

``` code
x = torch.ones((3,3))
x_on_1 = x.to("cuda:0")
x_on_2 = x.to("cuda:1")
print(x_on_1.device)
print(x_on_2.device)
```

Often times, in various tutorials on the internet, you can find the following:

``` code
torch.device("cuda" if torch.cuda.is_available() else "cpu")
x = x.to(device)
print(x.device)
```

This way on can create CUDA agnostic code, that works on both, machines with and without GPU. However, only use this if you have made your reserved GPU explicitly visible and hid the rest. Otherwise this will automatically select GPU 0 as your CUDA device.


### Parallelize on multiple GPUs

Parallellizing training on multiple GPUs is in most cases a one-liner.

PyTorch comes with torch.nn.Parallel that makes it easy to split batches across GPUs.

Essentially, the model gets copied to each GPU and receives part of the minibatch to process.

``` code
net = torch.nn.DataParallel(net, device_ids=[0,1])
```