pytorch_intro

7.4 KiB

Raw Blame History

title	author
PyTorch Intro I: SSH, Jupyter and Cuda	Tom Weber

Preliminaries

Make sure we are only using our reserved GPUs.

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # order devices by bus id
os.environ["CUDA_VISIBLE_DEVICES"]="0,2" # only make device 0 visible

Using Torch Modules and Datasets

This part of the PyTorch introduction will focus on creating custom torch modules and datasets, while applying those concepts to a fun character-level text generation task.

Preparation

import torch
import numpy as np
from urllib.request import urlopen # for importing the data

Let us borrow a nice text dataset from TensorFlow.

text_source = "https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt"
text = urlopen(text_source).read().decode(encoding="utf-8")

Do some general NLP preprocessing.

def preprocess(text):
    alphabet = sorted(set(text))
    letter_to_int = {let: ind for ind, let in enumerate(alphabet)}
    int_to_letter = {ind: let for ind, let in enumerate(alphabet)}
    letter_ints = [letter_to_int[letter] for letter in text]
    alphabet_size = len(alphabet)
    return int_to_letter, letter_to_int, alphabet_size, letter_ints

Now we can transform our text into a sequence of integers, where each integer represents are character.

int_to_letter, letter_to_int, alphabet_size, letter_ints = preprocess(text)
print("Alphabet size:", alphabet_size)
print("Length of letter sequence:", len(text))

Custom Datasets

Previously we imported a pre-made dataset and created a dataloader. This time, we want to create our own dataset that can be used to construct a dataloader.

A custom dataset needs to at least implement the __len__(self) and the __getitem(self, index)___ method. __len__(self) only needs to return the size/length of the dataset, while __getitem(self, index)___ needs to map an index to a tuple of (sample, label). Batching will be automatically handled by the dataloader, so there is no need to think about that for now.

We want our model to predict the probability of all possible characters, that can succeed the input character. Hence, our samples will be sequences of a certain length, while the ground truth will be the same sequence but shifted forward by one character.

CAUTION: Not always the fastest method. If dataset is sufficiently simple and small, (as in our case here), manual batching is probably faster.

class Shakespeare_Dataset(torch.utils.data.Dataset):
    def __init__(self, text, seq_len):
        self.x = torch.LongTensor(text[:-1]) # get the data
        self.y = torch.LongTensor(text[1:])
        self.seq_len = seq_len # set the sequence length
        
    def __len__(self):
        return len(text) - self.seq_len - 1# length of corpora minus sequence length minus shift
    
    def __getitem__(self, index):
        return (self.x[index:index+self.seq_len],
                self.x[index:index+self.seq_len]) # return tuple of (sample, label)

Now, we can easily instatiate our dataset and let a dataloader handle the shuffling, batching etc.

shakespeare_dset = Shakespeare_Dataset(letter_ints, seq_len=100)
trainloader = torch.utils.data.DataLoader(shakespeare_dset, batch_size=32,
                                          shuffle=True, num_workers=2,
					                      drop_last=True)

Custom Modules (models, layers, operations...)

The majority of high level computations in PyTorch are modeled as torch.nn.Modules, be it whole models or individual layers. A nn.Module needs to implement the forward(self, input) method which defines the operations the Module computes.

Let us define a Recurrent Network consisting of an embedding, two GRU layers and a dense output layer (called linear layer in PyTorch terms).

class RNN(torch.nn.Module):
    def __init__(self, vocab_size, hidden_size, embedding_size, batch=32, layers=2):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size # size of the GRU layers
        self.batch = batch 
        self.layers = layers # how many GRU layers
        self.word_embeds = torch.nn.Embedding(vocab_size, embedding_size) # Embedding layer
        self.gru = torch.nn.GRU(embedding_size, hidden_size, layers, batch_first=True) # GRU layer(s)
        self.output_layer = torch.nn.Linear(hidden_size, vocab_size)
        
    def forward(self, inputs, hidden):
        x = self.word_embeds(inputs) # transform the input integer into a high dimensional embedding
        output, hidden = self.gru(x, hidden) # Compute the output of the GRU layer(s)
        output = self.output_layer(output) # compute the logits
        return output, hidden

    def initHidden(self):
        return torch.zeros(self.layers, self.batch, self.hidden_size)

Training

Let us set up the model, some hyperparameters and define a training function

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # let us do the quick way this time
rnn = RNN(alphabet_size, 1024, 256, layers=2)
loss = torch.nn.CrossEntropyLoss() 
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.005)

def train(model, optim, loss, device):
    current_loss = [] # record running loss
    model.to(device) # put the model on the specified device
    hidden = model.initHidden().to(device) # create the hidden state
    model.train() # tell the model its training time
    for X, y in trainloader:
        X, y = X.to(device), y.to(device) # collect the data and labels from the dataloader and put them on the device
        optimizer.zero_grad() # empty the gradients
        output, hidden = model(X, hidden) # compute the output
        hidden = hidden.detach() # take the hidden state out of the graph
        batch_loss = loss(output.transpose(1,2), y) # compute loss
        batch_loss.backward() # compute gradients
        optimizer.step() # update weights
        current_loss.append(batch_loss.item()) # record loss
    epoch_loss = np.mean(current_loss)
    return epoch_loss

Train the model for some epochs.

epochs = 200
for e in range(epochs):
    l = train(rnn, optimizer, loss, device)
    print("Epoch ",e+1, ", Loss: ", l)
torch.save(rnn.state_dict(), "../saved_models/rnn_{}epochs.pth".format(epochs+1))

Text generation

Load our previously saved model.

rnn = RNN(alphabet_size, 1024, 256, layers=2, batch=1) # instantiate model
rnn.load_state_dict(torch.load("../saved_models/rnn_2epochs.pth")) # load weights
rnn.eval() # tell model its time to evaluate

Give the model a starting sequence.

seq = "NICO: " # starting sequence which we give the model
max_seq_len = 1000 # max sequence length
temp = 0.7 # temperature for sampling, the higher the temperature the more random the sampling, the colder the temperature the more conservative
hidden = rnn.initHidden() 
input_idx = torch.LongTensor([[letter_to_int[s] for s in seq]]) # input characters to ints

for i in range(max_seq_len):
    output, hidden = rnn(input_idx, hidden) # predict the logits for the next character
    pred = torch.squeeze(output, 0)[-1]
    pred = pred / temp # apply temperature
    pred_id = torch.distributions.categorical.Categorical(logits=pred).sample() # sample from the distribution
    input_idx = torch.cat((input_idx[:,1:], pred_id.reshape(1,-1)), 1) # predicted character is added to our input
    seq += int_to_letter[pred_id.item()] # add predicted character to sequence
print(seq) # show us the sequence

7.4 KiB Raw Blame History