pytorch_intro/markdown/2_Text_Generation.md

---
title: "PyTorch Intro I: SSH, Jupyter and Cuda"
author: Tom Weber
---

## Preliminaries

Make sure we are only using our reserved GPUs.

``` code
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # order devices by bus id
os.environ["CUDA_VISIBLE_DEVICES"]="0,2" # only make device 0 visible
```

## Using Torch Modules and Datasets

This part of the PyTorch introduction will focus on creating custom torch modules and datasets, while applying those concepts to a fun character-level text generation task.

### Preparation

``` code
import torch
import numpy as np
from urllib.request import urlopen # for importing the data
```

Let us borrow a nice text dataset from TensorFlow.

``` code
text_source = "https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt"
text = urlopen(text_source).read().decode(encoding="utf-8")
```

Do some general NLP preprocessing.

``` code
def preprocess(text):
    alphabet = sorted(set(text))
    letter_to_int = {let: ind for ind, let in enumerate(alphabet)}
    int_to_letter = {ind: let for ind, let in enumerate(alphabet)}
    letter_ints = [letter_to_int[letter] for letter in text]
    alphabet_size = len(alphabet)
    return int_to_letter, letter_to_int, alphabet_size, letter_ints
```
Now we can transform our text into a sequence of integers, where each integer represents are character.

``` code
int_to_letter, letter_to_int, alphabet_size, letter_ints = preprocess(text)
print("Alphabet size:", alphabet_size)
print("Length of letter sequence:", len(text))
```

## Custom Datasets

Previously we imported a pre-made dataset and created a dataloader. This time, we want to create our own dataset that can be used to construct a dataloader.

A custom dataset needs to at least implement the `__len__(self)` and the `__getitem(self, index)___` method.
`__len__(self)` only needs to return the size/length of the dataset, while `__getitem(self, index)___` needs to map an index to a tuple of (sample, label). Batching will be automatically handled by the dataloader, so there is no need to think about that for now.

We want our model to predict the probability of all possible characters, that can succeed the input character. Hence, our samples will be sequences of a certain length, while the ground truth will be the same sequence but shifted forward by one character.

CAUTION: Not always the fastest method. If dataset is sufficiently simple and small, (as in our case here), manual batching is probably faster.

``` code
class Shakespeare_Dataset(torch.utils.data.Dataset):
    def __init__(self, text, seq_len):
        self.x = torch.LongTensor(text[:-1]) # get the data
        self.y = torch.LongTensor(text[1:])
        self.seq_len = seq_len # set the sequence length

    def __len__(self):
        return len(text) - self.seq_len - 1# length of corpora minus sequence length minus shift

    def __getitem__(self, index):
        return (self.x[index:index+self.seq_len],
                self.x[index:index+self.seq_len]) # return tuple of (sample, label)

```

Now, we can easily instatiate our dataset and let a dataloader handle the shuffling, batching etc.

``` code
shakespeare_dset = Shakespeare_Dataset(letter_ints, seq_len=100)
trainloader = torch.utils.data.DataLoader(shakespeare_dset, batch_size=32,
                                          shuffle=True, num_workers=2,
					                      drop_last=True)
```

## Custom Modules (models, layers, operations...)

The majority of high level computations in PyTorch are modeled as torch.nn.Modules, be it whole models or individual layers. A nn.Module needs to implement the `forward(self, input)` method which defines the operations the Module computes.

Let us define a Recurrent Network consisting of an embedding, two GRU layers and a dense output layer (called linear layer in PyTorch terms).

``` code
class RNN(torch.nn.Module):
    def __init__(self, vocab_size, hidden_size, embedding_size, batch=32, layers=2):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size # size of the GRU layers
        self.batch = batch
        self.layers = layers # how many GRU layers
        self.word_embeds = torch.nn.Embedding(vocab_size, embedding_size) # Embedding layer
        self.gru = torch.nn.GRU(embedding_size, hidden_size, layers, batch_first=True) # GRU layer(s)
        self.output_layer = torch.nn.Linear(hidden_size, vocab_size)

    def forward(self, inputs, hidden):
        x = self.word_embeds(inputs) # transform the input integer into a high dimensional embedding
        output, hidden = self.gru(x, hidden) # Compute the output of the GRU layer(s)
        output = self.output_layer(output) # compute the logits
        return output, hidden

    def initHidden(self):
        return torch.zeros(self.layers, self.batch, self.hidden_size)
```

### Training

Let us set up the model, some hyperparameters and define a training function

``` code
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # let us do the quick way this time
rnn = RNN(alphabet_size, 1024, 256, layers=2)
loss = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.005)
```

``` code
def train(model, optim, loss, device):
    current_loss = [] # record running loss
    model.to(device) # put the model on the specified device
    hidden = model.initHidden().to(device) # create the hidden state
    model.train() # tell the model its training time
    for X, y in trainloader:
        X, y = X.to(device), y.to(device) # collect the data and labels from the dataloader and put them on the device
        optimizer.zero_grad() # empty the gradients
        output, hidden = model(X, hidden) # compute the output
        hidden = hidden.detach() # take the hidden state out of the graph
        batch_loss = loss(output.transpose(1,2), y) # compute loss
        batch_loss.backward() # compute gradients
        optimizer.step() # update weights
        current_loss.append(batch_loss.item()) # record loss
    epoch_loss = np.mean(current_loss)
    return epoch_loss
```

Train the model for some epochs.

``` code
epochs = 200
for e in range(epochs):
    l = train(rnn, optimizer, loss, device)
    print("Epoch ",e+1, ", Loss: ", l)
torch.save(rnn.state_dict(), "../saved_models/rnn_{}epochs.pth".format(epochs+1))
```

### Text generation

Load our previously saved model.

``` code
rnn = RNN(alphabet_size, 1024, 256, layers=2, batch=1) # instantiate model
rnn.load_state_dict(torch.load("../saved_models/rnn_2epochs.pth")) # load weights
rnn.eval() # tell model its time to evaluate
```

Give the model a starting sequence.

``` code
seq = "NICO: " # starting sequence which we give the model
max_seq_len = 1000 # max sequence length
temp = 0.7 # temperature for sampling, the higher the temperature the more random the sampling, the colder the temperature the more conservative
hidden = rnn.initHidden()
input_idx = torch.LongTensor([[letter_to_int[s] for s in seq]]) # input characters to ints
```

``` code
for i in range(max_seq_len):
    output, hidden = rnn(input_idx, hidden) # predict the logits for the next character
    pred = torch.squeeze(output, 0)[-1]
    pred = pred / temp # apply temperature
    pred_id = torch.distributions.categorical.Categorical(logits=pred).sample() # sample from the distribution
    input_idx = torch.cat((input_idx[:,1:], pred_id.reshape(1,-1)), 1) # predicted character is added to our input
    seq += int_to_letter[pred_id.item()] # add predicted character to sequence
print(seq) # show us the sequence
```