Chapter 3 - Data Loading and Pre-Processing

In this chapter, I will address the very important issue of dealing with the data. To me and most practitioners, data collection is the most important issue in machine learning.

Copyright and License

Copyright © by Ricardo A. Calix.
All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, without written permission of the copyright owner.
MIT License.

FTC and Amazon Disclaimer

This post/page/article includes Amazon Affiliate links to products. This site receives income if you purchase through these links. This income helps support content such as this one.

Data Loading and Pre-processing

In this chapter, I will address the very important issue of dealing with the data. To me and most practitioners, data collection is the most important issue in machine learning.
There are many aspects that must be addressed when dealing with data. These include:

getting the data
cleaning data
pre-processing data
building a corpus and annotating it
performing inter-annotator agreement
etc.

Here in this chapter I will also address general issues about data as well as data issues related to deep learning such as one-hot encoding.

Loading the Data

Data can be obtained from the web such as text from twitter or web pages. Specific data sets can also be obtained from the machine learning libraries. Sklearn has a “dataset” module. This dataset module can be used to obtain certain data sets such as the iris dataset. An example of this code can be seen below.

Here, the first index in the data matrix represents the rows and the second index represents the columns. Data can also be obtained from text files. Many practitioners and academics will have their own data in text files. This data can be formatted in many different ways and loaded into the code. In most of the examples in this book, the data is assumed to be formatted in csv (comma separated) format. The code to load the data to both SKlearn based traditional machine learning algorithms and to PyTorch code files is shown below. In the code we can see that we are using the numpy library (or namespace) to obtain the data. We assume the data is stored in the file data/12559_Training_Dataset.csv and is read by loadtxt() into the python variable Matrix_data. Since we are using csv format, the data file can look like the following.

Once the data is in Matrix_data, it can be processed as a numpy array matrix. This means that it is no longer just an array but instead it is more a vector or matrix as in linear algebra. Many operations are now simplified like extracting certain columns or rows. This is usually referred to as slicing.

In the previous code we can see that we can calculate the dimensions of the matrix as in

A = len( Matrix_data[0,:] )

which gives you the number of columns or the number of features plus the class.

Data and Feature Pre-Processing

Feature scaling is very important to achieving good results in modeling tasks. For example, in Principal Component Analysis (PCA) which is a feature reduction technique, feature scaling is very important. The purpose of PCA is to project data to a vector that captures the most variability in the data. If the features are not scaled properly, one feature could dominate over the others and therefore be considered as the most variable feature. With feature scaling, features with real valued numbers from any range can be mapped to other ranges such as from -1.0 to 1.0. This is performed for all features so that no one feature will dominate in the model. Other classifiers and Deep learning models are also susceptible to feature scaling. The code below shows how the data can be scaled from X_train to X_train_normalized.

In the following code I am showing you a general way in which you can take a standard dataset such as iris or mnist and save it to a csv file. This approach allows you to do the deep learning modeling by reading data from text files where the data is formatted in a very standard and well know format such as csv (comma separated format).

One Hot Encoding

Algorithm implementations in PyTorch use one-hot encoding. Quite simply, one hot encoding means that you take labels in the following format.

and convert them to labels in the equivalent one-hot encoded format as shown below. So, we create new vectors where all values are zero except for the value of the position of the correct class in the vector.

One-hot encoding transforms the labels vector (y) of size n into a matrix of size n by B where B represents the number of classes in the data set. For the case of the iris dataset, we have 3 classes and, therefore, for a sample with label 2 we would get an equivalent one-hot encoded vector equal to [0,0,1]. The code to convert the data to one-hot encoded format is provided below. There are several ways of implementing one-hot encoding. I have used the approach below because I think it is the easiest to understand (while possibly not the most efficient or pythony). Notice how “a” is a 2-dimensional array initialized with all zeros. You use the value in the input vector data to determine the index i and j that are used to assign a 1 to the correct position in the matrix.

Features

A supervised machine learning algorithm is only as good as the features that are provided to it. This statement used to be very true and many people made careers of just developing features for problems in different domains. For instance, in NLP, many people would spend a lot of time developing parsers and other techniques to find and create features from a text based problem. Similarly, in image processing, researchers developed many techniques to filter data out of images to perform efficient image classification. Today, deep learning has somewhat changed this. It has managed to introduce approaches to decrease the amount of human involvement in the feature extraction process. Basically, deep learning methods, in some cases, have the ability to extract features from data using only un-supervised or semi-supervised techniques. In some way, you can say that deep learning algorithms can extract the features themselves without human involvement. This ability has had a very strong impact in the performance of the algorithms implemented in industry and in the work performed by machine learning specialists. These abilities for enhanced feature extraction are available in the main mediums of text processing and image processing.

Features from Text

Feature extraction from text usually involves the processing of text documents to extract tokens, words, chunks, or other phrases to be used to create the features. There are many, many types of features that can be extracted from text. Some of the methods (Jurafsky and Martin 2008) that can be used are:

the bag of words approach
frequency histograms of the words
Part of speech tagging of the words
Syntactic parsing of sentences
Anaphora resolution
word embeddings (the main method used today)

With the information provided by the previous methods, many features can be extracted. In general, text features can be binary or numeric. Binary features include the presence or absence of words, part of speech tags, chunks, syntactic parses, semantic parses, etc. Numeric based text features can be derived from performing counts or calculating distance metrics between words or higher level semantic concepts.

One technique in particular that has had success to extract features for supervised machine learning is called the gramulator (McCarthy et al. 2012). The gramulator is a feature extraction technique used particularly in natural language processing. The main idea is that, for a 2 class problem, you want to extract features (e.g. words) that are very frequent in one class but infrequent in the other. This helps to better discriminate between the classes. The downside of this approach, however, is that it needs a lot of annotated or labeled data to extract the grams or words from each class that are infrequent in the opposite class. If the grams are representative of the entire population; then, it can be expected that a classifier will have good performance in the classification task.

The downside of most of these techniques is that in the past they have required annotated data. Deep learning has several new techniques that address this issue and obtain good performance without needing large amounts of annotated data. These mainly involve some type of word embedding. Word embeddings are vector representations of words or syllables or subwords. The vectors are dense and usually of fixed size such as 256. The values of each vector representing the token in question are learned through some algorithmic scheme such as word2vec, or Transformers (covered in next chapters).

Features from Images

Feature extraction from images is another area where deep learning is revolutionizing the way features are engineered and obtained. The input to a classifier when performing image processing and classification is an image. In general images are 2 dimensional arrays that contain the pixel intensities that define the image. For color images, you have 3 two-dimensional matrices where each stores the intensity for the R, G, and B colors of the image.

Traditional Feature Extraction Techniques

In the past, these images where converted into feature vectors for use in machine learning using image processing based feature extraction techniques. Again, here, researchers spent considerable amounts of time performing feature engineering. Some of the methods widely used in image processing (Gonzales and Woods) include:

Image segmentation
Region growing
Image morphing
Fourier transformations
Etc.

Many features could be extracted from these methods. Generally speaking, many of these techniques are implemented as a filter that is applied to the images to convert them into another matrix of the same size or of another size. This new image will store new information about the image such as the pixel positions where an edge was detected. Here, once again, deep learning changed how this process is done.

Deep neural networks can be used to discover the features simply by providing the inputs and some annotation. This information alone allows the neural network architecture to connect the neurons in ways where it can discover features by itself. In fact, deep neural networks are not just used for classification but can also be used just for feature extraction. The extracted features can then be used with other classifiers such as SVM, logistic regression, etc. In these cases, the input layer and hidden layers are used for feature extraction without having to use the output layer for classification. Convolutional Neural Networks (CNNs) can provide this functionality in image processing and classification.

A Class to Load Images into Tensors

Loading images into tensors can be challenging. The following code listing presents an object oriented approach to load images from files into pytorch tensors. This class was written by my students Danielle Turner, David Highley, and Joseph Shapiro and I thank them for it.

First we import the libraries.

Next we define the class to load the images from folders into tensors.

class ImageDataset(Dataset):
  
    def __init__(self, dataset_save="data.pt", raw_data=None, train=True, shuffle=False, transform=None, target_transform=None, convert=False, size=32):
        self.targets = []
        self.labels = []
        self.data = []

self.X_train = []
        self.X_test = []
        self.y_train = []
        self.y_test = []
        
        self.transform = transform
        self.target_transform = target_transform

if convert:
            self.convert(dataset_save, raw_data, size)
        else:
            self.load(dataset_save)

seed = int(random.random() * 100) if shuffle else 42

self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.data, self.targets, test_size=0.2, random_state=seed)

if train:
            self.data = self.X_train
            self.targets = self.y_train
        else:
            self.data = self.X_test
            self.targets = self.y_test

def __len__(self):
        return len(self.data)

...

The Dataset of images is data\_path which is the path to the images folder. This should contain multiple folders of each class

The root is where the pt file will be stored. The train parameter is for whether to load train or test data. The transform parameter indicates the torch transform to apply to the image data. The target\_transform is the transform to apply to the targets. The convert parameter is used to indicate if the data should be converted from images, or loaded from a pt file. The size parameter is to set the size of the images to convert to. This should be the same as the size of the images in the pt file.

The next code segment includes the convert and load functions for the images.

class ImageDataset(Dataset):
  
    ...

def __getitem__(self, index):
        image, target = self.data[index], self.targets[index]
        if self.transform is not None:
            image = self.transform(image)
        if self.target_transform is not None:
            target = self.target_transform(target)
        return image, target
    
    def convert(self, dataset_save, raw_data, size):
        dataset = []
        self.labels = []
        targets = []
        if not os.path.exists(raw_data):
            raise ValueError('Raw image directory does not exist.') 
        for folder in os.listdir(raw_data):
            if folder == ".DS_Store":
                continue

for image in os.listdir(os.path.join(raw_data, folder)):
                if folder not in self.labels:
                    self.labels.append(folder)
                targets.append(self.labels.index(folder))

img_arr = imageio.imread(os.path.join(raw_data, folder, image), pilmode="RGB")
                resize = torchvision.transforms.Resize(size)
                crop_center = torchvision.transforms.CenterCrop(size)

img = torch.from_numpy(img_arr).permute(2, 0, 1).float()
                img = resize(img)
                img = crop_center(img)
                img /= 255

dataset.append(img)

self.data = torch.stack(dataset)
        self.targets = torch.Tensor(targets).type(torch.LongTensor)

torch.save((self.data, self.targets, self.labels), dataset_save)

def load(self, dataset_save):
        if not os.path.exists(dataset_save):
            raise ValueError('Dataset file does not exist. Try creating the dataset by running with convert=True first.') 
        self.data, self.targets, self.labels = torch.load(dataset_save)

Corpora

A supervised machine learning algorithm is only as good as the data that is provided to it. Learning models learn how human annotators assign the labels to a sample and a system can only be expected to be as good as the human annotator. Some tasks are more subjective than others for human annotators and this is usually reflected in the classifier performance. Therefore, many practitioners recommend that before measuring classifier performance, and analysis of the subjectivity of the human annotation should be performed. This is usually referred to as inter annotator agreement. When annotating a new resource, the quality of the annotation process must be measured in some way. Inter-annotator metrics refer to techniques used to measure the overall agreement between the annotations of two or more individuals. Important metrics used to evaluate inter-annotator agreement ( artsteinRef ) include:

average observed agreement
Pi
Alpha
S
Kappa

These metrics differ in how they correct for expected chance agreement.

Expected chance agreement

Expected chance agreement is a probability that 2 annotators will agree on their annotation for an item by chance. This probability depends on the number of classes. Formally, this probability is calculated as follows:

$ A_e = \sum_{k \in K} P( k | c_1) \cdot P(k | c_2) $

where $ c_i $ is the annotator $ i $ , and $ k $ is the assigned category. Inter-annotator agreement metrics are important because they help to set theoretical boundaries on the accuracy that a given machine learning methodology can achieve using the annotated corpora (Bird and Loper). A brief description of some of the techniques is provided in the following discussion.

Average Observed Agreement (Ao)

Averaged observed agreement is the easiest metric to compute. It is the percentage of annotations that two annotators agreed upon. The metric is formulated as follows where the variable “samples” represents the total number of annotation samples and “agreed” is the amount of samples for which both annotators agreed.

$ A_o = \frac{1}{samples} \sum Agreed $

Chance-corrected Metrics (Acorr)

Chance corrected metrics are those that take into account the expected chance agreement Ae. Once chance agreement is defined, the metric can be corrected. These types of metrics include: s, alpha, and kappa. Formally, the main concept in these metrics is defined as follows:

$ A_{corr} = \frac{A_o - A_e}{1 - A_e} $

The following code shows how to perform this analysis using the well known nltk ( www.nltk.org ) framework.

More details about data collection, corpora development, and web scraping can be found from the following sources: Web Scraping with Python by Ryan Mitchell, Calix and Knapp (2011), and Natural Language Processing with Python by Bird, Klein, and Loper.

Summary

In this chapter, a discussion about issues related to data and data pre-processing was provided. In particular, the following topics were addressed: data pre-processing, reading data from csv files, types of features, corpora processing, and one-hot encoding. In the next chapter, we will begin our discussion of how to program out first deep learning models with PyTorch.