内容简介:This blog post takes you through an implementation of multi-class classification on tabular data using PyTorch.We will use theWe’re using
This blog post takes you through an implementation of multi-class classification on tabular data using PyTorch.
Mar 18 ·11min read
We will use the wine dataset available on Kaggle. This dataset has 12 columns where the first 11 are the features and the last column is the target column. The data set has 1599 rows.
Import Libraries
We’re using tqdm
to enable progress bars for training and testing loops.
import numpy as np import pandas as pd import seaborn as sns from tqdm.notebook import tqdm import matplotlib.pyplot as plt import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler from sklearn.preprocessing import MinMaxScaler from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix, classification_report
Read Data
df = pd.read_csv("data/tabular/classification/winequality-red.csv")df.head()
EDA and Preprocessing
To make the data fit for a neural net, we need to make a few adjustments to it.
Class Distribution
First off, we plot the output rows to observe the class distribution. There’s a lot of imbalance here. Classes 3, 4, and 8 have a very few number of samples.
sns.countplot(x = 'quality', data=df)
Encode Output Class
Next, we see that the output labels are from 3 to 8. That needs to change because PyTorch supports labels starting from 0. That is [0, n] . We need to remap our labels to start from 0.
To do that, let’s create a dictionary called class2idx
and use the .replace()
method from the Pandas library to change it. Let’s also create a reverse mapping called idx2class
which converts the IDs back to their original classes.
To create the reverse mapping, we create a dictionary comprehension and simply reverse the key and value.
class2idx = { 3:0, 4:1, 5:2, 6:3, 7:4, 8:5 } idx2class = {v: k for k, v in class2idx.items()} df['quality'].replace(class2idx, inplace=True)
Create Input and Output Data
In order to split our data into train, validation, and test sets using train_test_split
from Sklearn, we need to separate out our inputs and outputs.
Input X
is all but the last column. Output y
is the last column.
X = df.iloc[:, 0:-1] y = df.iloc[:, -1]
Train — Validation — Test
To create the train-val-test split, we’ll use train_test_split()
from Sklearn.
First we’ll split our data into train+val and test sets. Then, we’ll further split our train+val set to create our train and val sets.
Because there’s a class imbalance, we want to have equal distribution of all output classes in our train, validation, and test sets. To do that, we use the stratify
option in function train_test_split()
.
# Split into train+val and test X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=69) # Split train into train-val X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size=0.1, stratify=y_trainval, random_state=21)
Normalize Input
Neural networks need data that lies between the range of (0,1). There’s a ton of material available online on why we need to do it.
To scale our values, we’ll use the MinMaxScaler()
from Sklearn. The MinMaxScaler
transforms features by scaling each feature to a given range which is (0,1) in our case.
x_scaled = (x-min(x)) / (max(x)–min(x))
Notice that we use .fit_transform()
on X_train
while we use .transform()
on X_val
and X_test
.
We do this because we want to scale the validation and test set with the same parameters as that of the train set to avoid data leakage. fit_transform
calculates scaling values and applies them while .transform
only applies the calculated values.
scaler = MinMaxScaler() X_train = scaler.fit_transform(X_train) X_val = scaler.transform(X_val) X_test = scaler.transform(X_test)X_train, y_train = np.array(X_train), np.array(y_train) X_val, y_val = np.array(X_val), np.array(y_val) X_test, y_test = np.array(X_test), np.array(y_test)
Visualize Class Distribution in Train, Val, and Test
Once we’ve split our data into train, validation, and test sets, let’s make sure the distribution of classes is equal in all three sets.
To do that, let’s create a function called get_class_distribution()
. This function takes as input the obj y
, ie. y_train
, y_val
, or y_test
. Inside the function, we initialize a dictionary which contains the output classes as keys and their count as values. The counts are all initialized to 0.
We then loop through our y
object and update our dictionary.
def get_class_distribution(obj): count_dict = { "rating_3": 0, "rating_4": 0, "rating_5": 0, "rating_6": 0, "rating_7": 0, "rating_8": 0, } for i in obj: if i == 0: count_dict['rating_3'] += 1 elif i == 1: count_dict['rating_4'] += 1 elif i == 2: count_dict['rating_5'] += 1 elif i == 3: count_dict['rating_6'] += 1 elif i == 4: count_dict['rating_7'] += 1 elif i == 5: count_dict['rating_8'] += 1 else: print("Check classes.") return count_dict
Once we have the dictionary count, we use Seaborn library to plot the bar charts. The make the plot, we first convert our dictionary to a dataframe using pd.DataFrame.from_dict([get_class_distribution(y_train)])
. Subsequently, we .melt()
our convert our dataframe into the long format and finally use sns.barplot()
to build the plots.
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(25,7))# Train sns.barplot(data = pd.DataFrame.from_dict([get_class_distribution(y_train)]).melt(), x = "variable", y="value", hue="variable", ax=axes[0]).set_title('Class Distribution in Train Set')# Validation sns.barplot(data = pd.DataFrame.from_dict([get_class_distribution(y_val)]).melt(), x = "variable", y="value", hue="variable", ax=axes[1]).set_title('Class Distribution in Val Set')# Test sns.barplot(data = pd.DataFrame.from_dict([get_class_distribution(y_test)]).melt(), x = "variable", y="value", hue="variable", ax=axes[2]).set_title('Class Distribution in Test Set')
Neural Network
We’ve now reached what we all had been waiting for!
Custom Dataset
First up, let’s define a custom dataset. This dataset will be used by the dataloader to pass our data into our model.
We initialize our dataset by passing X and y as inputs. Make sure X is a float
while y is long
.
class ClassifierDataset(Dataset): def __init__(self, X_data, y_data): self.X_data = X_data self.y_data = y_data def __getitem__(self, index): return self.X_data[index], self.y_data[index] def __len__ (self): return len(self.X_data) train_dataset = ClassifierDataset(torch.from_numpy(X_train).float(), torch.from_numpy(y_train).long())val_dataset = ClassifierDataset(torch.from_numpy(X_val).float(), torch.from_numpy(y_val).long())test_dataset = ClassifierDataset(torch.from_numpy(X_test).float(), torch.from_numpy(y_test).long())
Weighted Sampling
Because there’s a class imbalance, we use stratified split to create our train, validation, and test sets.
While it helps, it still does not ensure that each mini-batch of our model see’s all our classes. We need to over-sample the classes with less number of values. To do that, we use the WeightedRandomSampler
.
First, we obtain a list called target_list
which contains all our outputs. This list is then converted to a tensor and shuffled.
target_list = []for _, t in train_dataset: target_list.append(t) target_list = torch.tensor(target_list) target_list = target_list[torch.randperm(len(target_list))]
Then, we obtain the count of all classes in our training set. We use the reciprocal of each count to obtain it’s weight. Now that we’ve calculated the weights for each class, we can proceed.
class_count = [i for i in get_class_distribution(y_train).values()] class_weights = 1./torch.tensor(class_count, dtype=torch.float) print(class_weights) ###################### OUTPUT ######################tensor([0.1429, 0.0263, 0.0020, 0.0022, 0.0070, 0.0714])
WeightedRandomSampler
expects a weight for each sample . We do that using as follows.
class_weights_all = class_weights[target_list]
Finally, let’s initialize our WeightedRandomSampler
. We’ll call this in our dataloader below.
weighted_sampler = WeightedRandomSampler( weights=class_weights_all, num_samples=len(class_weights_all), replacement=True )
Model Parameters
Before we proceed any further, let’s define a few parameters that we’ll use down the line.
EPOCHS = 400 BATCH_SIZE = 64 LEARNING_RATE = 0.001 NUM_FEATURES = len(X.columns) NUM_CLASSES = 6
Dataloader
Let’s now initialize our dataloaders.
For train_dataloader
we’ll use batch_size = 64
and pass our sampler to it. Note that we’re not using shuffle=True
in our train_dataloader
because we’re already using a sampler. These two are mutually exclusive.
For test_dataloader
and val_dataloader
we’ll use batch_size = 1
.
train_loader = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, sampler=weighted_sampler )val_loader = DataLoader(dataset=val_dataset, batch_size=1)test_loader = DataLoader(dataset=test_dataset, batch_size=1)
Define Neural Net Architecture
Let’s define a simple 3-layer feed-forward network with dropout and batch-norm.
class MulticlassClassification(nn.Module): def __init__(self, num_feature, num_class): super(MulticlassClassification, self).__init__() self.layer_1 = nn.Linear(num_feature, 512) self.layer_2 = nn.Linear(512, 128) self.layer_3 = nn.Linear(128, 64) self.layer_out = nn.Linear(64, num_class) self.relu = nn.ReLU() self.dropout = nn.Dropout(p=0.2) self.batchnorm1 = nn.BatchNorm1d(512) self.batchnorm2 = nn.BatchNorm1d(128) self.batchnorm3 = nn.BatchNorm1d(64) def forward(self, x): x = self.layer_1(x) x = self.batchnorm1(x) x = self.relu(x) x = self.layer_2(x) x = self.batchnorm2(x) x = self.relu(x) x = self.dropout(x) x = self.layer_3(x) x = self.batchnorm3(x) x = self.relu(x) x = self.dropout(x) x = self.layer_out(x) return x
Check if GPU is active.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")print(device) ###################### OUTPUT ######################cuda:0
Initialize the model, optimizer, and loss function. Transfer the model to GPU. We’re using the nn.CrossEntropyLoss
because this is a multiclass classification problem. We don’t have to manually apply a log_softmax
layer after our final layer because nn.CrossEntropyLoss
does that for us. However, we need to apply log_softmax
for our validation and testing.
model = MulticlassClassification(num_feature = NUM_FEATURES, num_class=NUM_CLASSES)model.to(device) criterion = nn.CrossEntropyLoss(weight=class_weights.to(device)) optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)print(model) ###################### OUTPUT ######################MulticlassClassification( (layer_1): Linear(in_features=11, out_features=512, bias=True) (layer_2): Linear(in_features=512, out_features=128, bias=True) (layer_3): Linear(in_features=128, out_features=64, bias=True) (layer_out): Linear(in_features=64, out_features=6, bias=True) (relu): ReLU() (dropout): Dropout(p=0.2, inplace=False) (batchnorm1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (batchnorm2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (batchnorm3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) )
Train the model
Before we start our training, let’s define a function to calculate accuracy per epoch.
This function takes y_pred
and y_test
as input arguments. We then apply log_softmax
to y_pred
and extract the class which has a higher probability.
After that, we compare the the predicted classes and the actual classes to calculate the accuracy.
def multi_acc(y_pred, y_test): y_pred_softmax = torch.log_softmax(y_pred, dim = 1) _, y_pred_tags = torch.max(y_pred_softmax, dim = 1) correct_pred = (y_pred_tags == y_test).float() acc = correct_pred.sum() / len(correct_pred) acc = torch.round(acc) * 100 return acc
We’ll also define 2 dictionaries which will store the accuracy/epoch and loss/epoch for both train and validation sets.
accuracy_stats = { 'train': [], "val": [] }loss_stats = { 'train': [], "val": [] }
Let’s TRAAAAAIN our model!
print("Begin training.")for e in tqdm(range(1, EPOCHS+1)): # TRAINING train_epoch_loss = 0 train_epoch_acc = 0model.train() for X_train_batch, y_train_batch in train_loader: X_train_batch, y_train_batch = X_train_batch.to(device), y_train_batch.to(device) optimizer.zero_grad() y_train_pred = model(X_train_batch) train_loss = criterion(y_train_pred, y_train_batch) train_acc = multi_acc(y_train_pred, y_train_batch) train_loss.backward() optimizer.step() train_epoch_loss += train_loss.item() train_epoch_acc += train_acc.item() # VALIDATION with torch.no_grad(): val_epoch_loss = 0 val_epoch_acc = 0 model.eval() for X_val_batch, y_val_batch in val_loader: X_val_batch, y_val_batch = X_val_batch.to(device), y_val_batch.to(device) y_val_pred = model(X_val_batch) val_loss = criterion(y_val_pred, y_val_batch) val_acc = multi_acc(y_val_pred, y_val_batch) val_epoch_loss += train_loss.item() val_epoch_acc += train_acc.item()loss_stats['train'].append(train_epoch_loss/len(train_loader)) loss_stats['val'].append(val_epoch_loss/len(val_loader)) accuracy_stats['train'].append(train_epoch_acc/len(train_loader)) accuracy_stats['val'].append(val_epoch_acc/len(val_loader)) print(f'Epoch {e+0:03}: | Train Loss: {train_epoch_loss/len(train_loader):.5f} | Val Loss: {val_epoch_loss/len(val_loader):.5f} | Train Acc: {train_epoch_acc/len(train_loader):.3f}| Val Acc: {val_epoch_acc/len(val_loader):.3f}') ###################### OUTPUT ######################Epoch 001: | Train Loss: 1.55731 | Val Loss: 1.48898 | Train Acc: 5.556| Val Acc: 0.000Epoch 002: | Train Loss: 1.55930 | Val Loss: 1.27569 | Train Acc: 50.000| Val Acc: 100.000. . .Epoch 399: | Train Loss: 0.11390 | Val Loss: 0.10750 | Train Acc: 100.000| Val Acc: 100.000Epoch 400: | Train Loss: 0.11665 | Val Loss: 0.07421 | Train Acc: 100.000| Val Acc: 100.000
You can see we’ve put a model.train()
at the before the loop. model.train()
tells PyTorch that you’re in training mode.
Well, why do we need to do that? If you’re using layers such as Dropout
or BatchNorm
which behave differently during training and evaluation ( for example; not use dropout during evaluation ), you need to tell PyTorch to act accordingly.
Similarly, we’ll call model.eval()
when we test our model. We’ll see that below.
Back to training; we start a for-loop . At the top of this for-loop , we initialize our loss and accuracy per epoch to 0. After every epoch, we’ll print out the loss/accuracy and reset it back to 0.
Then we have another for-loop . This for-loop is used to get our data in batches from the train_loader
.
We do optimizer.zero_grad()
before we make any predictions. Since the backward()
function accumulates gradients, we need to set it to 0 manually per mini-batch.
From our defined model, we then obtain a prediction, get the loss(and accuracy) for that mini-batch, perform back-propagation using loss.backward()
and optimizer.step()
.
Finally, we add all the mini-batch losses (and accuracies) to obtain the average loss (and accuracy) for that epoch. We add up all the losses/accuracies for each mini-batch and finally divide it by the number of mini-batches ie. length of train_loader
to obtain the average loss/accuracy per epoch.
The procedure we follow for training is the exact same for validation except for the fact that we wrap it up in torch.no_grad
and not perform any back-propagation. torch.no_grad()
tells PyTorch that we do not want to perform back-propagation, which reduces memory usage and speeds up computation.
Visualize Loss and Accuracy
To plot the loss and accuracy line plots, we again create a dataframe from the accuracy_stats
and loss_stats
dictionaries.
# Create dataframes train_val_acc_df = pd.DataFrame.from_dict(accuracy_stats).reset_index().melt(id_vars=['index']).rename(columns={"index":"epochs"})train_val_loss_df = pd.DataFrame.from_dict(loss_stats).reset_index().melt(id_vars=['index']).rename(columns={"index":"epochs"})# Plot the dataframes fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20,7))sns.lineplot(data=train_val_acc_df, x = "epochs", y="value", hue="variable", ax=axes[0]).set_title('Train-Val Accuracy/Epoch')sns.lineplot(data=train_val_loss_df, x = "epochs", y="value", hue="variable", ax=axes[1]).set_title('Train-Val Loss/Epoch')
以上所述就是小编给大家介绍的《PyTorch [Tabular] —Multiclass Classification》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
高质量程序设计艺术
斯皮内利斯 / 韩东海 / 人民邮电出版社 / 2008-1 / 55.00元
在本书中,作者回归技术层面。从Apache web server、BSD版本的Unix system、ArgoUMl、ACE网络编程库等著名开源软件中选取了大量真实C、C++和java语言源代码,直观而深刻的阐述了代码中可能存在的各种质量问题,涉及可靠性、安全性、时间性和空间性、可移植性、可维护性以及浮点运算等方面,很多内容都市独辟蹊径,发前人所未发。正因如此,本书继作者的《代码阅读》之后在获Jo......一起来看看 《高质量程序设计艺术》 这本书的介绍吧!