09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Automatic data preparation

The first stage of a typical machine learning pipeline deals with data preparation

(recall the pipeline of Figure 1). There are two main aspects that should be taken

into account: data cleansing, and data synthesis.

Chapter 14

Data cleansing is about improving the quality of data by checking for wrong data

types, missing values, errors, and by applying data normalization, bucketization,

scaling, and encoding. A robust AutoML pipeline should automate all of these

mundane but extremely important steps as much as possible.

Data synthesis is about generating synthetic data via augmentation for training,

evaluation, and validation. Normally, this step is domain-specific. For instance, we

have seen how to generate synthetic CIFAR10-like images (Chapter 4, Convolutional

Neural Networks) by using cropping, rotation, resizing, and flipping operations. One

can also think about generate additional images or video via GANs (see Chapter

6, Generative Adversarial Networks) and using the augmented synthetic dataset for

training. A different approach should be taken for text, where it is possible to train

RNNs (Chapter 9, Autoencoders) to generate synthetic text or to adopt more NLP

techniques such as BERT, seq2seq, or Transformers to annotate or translate text

across languages and then translate back to the original one – another domainspecific

form of augmentation.

A different approach is to generate synthetic environments where machine learning

can occur. This became very popular in reinforcement learning and gaming,

especially with toolkits such as OpenAI Gym, which aims to provide an easy-to-setup

simulation environment with a variety of different (gaming) scenarios.

Put simply, we can say that synthetic data generation is another option that should

be provided by AutoML engines. Frequently, the tools used are very domainspecific

and what works for image or video would not necessary work in other

domains such as text. Therefore, we need a (quite) large set of tools for performing

synthetic data generation across domains.

Automatic feature engineering

Featuring engineering is the second step of a typical machine learning pipeline (see

Figure 1). It consists of three major steps: feature selection, feature construction, and

feature mapping. Let's look at each of them in turn:

Feature selection aims at selecting a subset of meaningful features by discarding

those that are providing little contribution to the learning task. In this context,

meaningful is truly dependent on the application and the domain of your specific

problem.

[ 493 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!