1. Title

  2. Brief Introduction

  3. Abstract

  4. Conclusion

EMPLOYING MULTI-SOURCE TRANSFER LEARNING AND WEB SCRAPING TO ENHANCE MODEL ACCURACY WHERE DATASET IS LIMITED

Author(s): M. Yussif, G. Dovonon

Keywords: Transfer learning, PyTorch, Floyd Hub

Brief Introduction

The size of training data plays a vital role in Machine Learning (ML), in learning useful representations that accurately predict the task at hand. Generally, it is common knowledge that a small dataset will result in a sparse approximation of the underlying regularity, resulting in a model with abysmal performance. Techniques have been proposed in ML literature to facilitate selecting the best model in the face of a small training dataset. Methods include employing a short model with fewer parameters, cross-validation [1], transfer learning [3], among others. However, too little data size does not often yield much improvement with these approaches. Most importantly, in the African context where data is often highly unstructured and available in minimal quantities, building models with high accuracy calls for a different approach. This paper, demonstrates how to achieve models with high predictive accuracy, when faced with data that is highly unstructured or unavailable, using Multi-Source Transfer learning (Model Ensembling) or Web Scraping to assemble a structured dataset set for most ML tasks. Multi-Source Transfer is a common practice employed by participants in ML competitions to build winning models, while Web Scraping can be applied to create datasets in a context where no data infrastructure exists to facilitate dataset creation.

Abstract

Machine learning and, more specifically, deep learning have recently driven many innovations. The availability of massive datasets and computation resources has made it possible to create deeper neural networks that are able to learn more meaningful representations of the data. Those new possibilities are not always accessible to the average African company trying to leverage on deep learning to increase profit. In that case, scarcity of data, especially, could be a limitation since neural networks are known to be data-hungry. When faced with the issue of unavailability of public data, a company can either increase the size of the dataset by collecting data themselves or increase the size and complexity of the model. The option studied here is to use web scraping to manage and clean a bigger dataset. In trying to increase the size and complexity of the model, to avoid overfitting, the transfer learning approach was used. This technique involves the transfer of weights from several datasets using model ensembling. All these methods were tested on a rice meal classification problem. The problem consists of classifying images of four rice-based dishes: jollof rice, fried rice, plain rice, and waakye. The dataset contains 60 train images and 20 test images for each group making up a total of 240 training images and 80 testing images. The baseline of 75% was achieved using a dense net Convolutional Neural Network (CNN). The web scraping method used to increase the dataset size attained an accuracy of 87%. A multi-source transfer learning approach was also used where models were pre-trained on the Food-101 dataset and the Food-256 dataset. The multi- source transfer learning method achieved an accuracy of 90%. Using these two methods, we implement two ways to significantly increase the efficiency of a model when the original dataset is small.

Conclusion

In summary, this study showed has how easy it is to obtain a significant improvement in accuracy (up to 15%) over the baseline results, using Multi-Source Transfer Learning. The study also showed how to achieve a significant improvement in accuracy (up to 12%) over the baseline results by employing Web Scraping technique to acquire more dataset for a specific task. There is still a great avenue to improve upon these accuracies. We estimate that we can achieve a very high overall accuracy when these two techniques are combined. In our future work, we hope to test this hypothesis by combining these two techniques.

Download paper

Contact author

seed@ashesi.edu.gh