This section explains how to prepare the wrangled or machine learning ready data for modelling.
Training & Target Features
Navigate to Data Engine > Define Dataset > Training & Target Features. This is where input and target features are selected for preprocessing.
To select input and target features
From the list of data in "My Data", select data
See a list of all the features or columns in the selected data. If there are many columns, you can search through the features or columns by column name
Check the box beside a feature or column to select it as an input feature for modelling
Click the radio button to select target or output feature
The next step is to preprocess the selected features so they are ready to be fed to machine learning algorithms for modelling.
This section explains how feature preprocessing algorithms can be applied to features in the data for modelling.
Before you begin
Feature preprocessing simply means transforming the features or columns to formats that can be easily understood by algorithms that learn from data. This way, the algorithms are able to learn from the data and use them in making decisions.
For instance, some algorithms cannot learn from text categories. E.g if a feature called "gender" contains values Male and Female. Pre-processing this will mean converting the text data type to numeric. Thus, the male becomes 1 and the female becomes 2.
The steps and algorithms for pre-processing data are shown and explained in this section. The algorithms for preprocessing Tabula, text (NLP) and time series data are different from those used for Vision data.
To get here Data Engine -> Define Dataset -> Feature Pre-processing
Tabular, NLP & Time Series:
To preprocess features for Tabular, NLP & Time Series data, the platform makes recommendations as seen in [#6].Select "Feature(s) for preprocessing" (#1)
Select either to apply preprocessing algorithms on a specific feature(s) at a time or on all the features in the dataset at a time. (#2)
"On single feature" means the preprocessing algorithm selected is applied to only the feature selected in step 2
"On Dataset" means the preprocessing algorithm is applied to all the features in the entire data set
A list of the Features selected in step 2 for preprocessing is shown (#3).
Note: Whenever features are selected and preprocessed with a particular algorithm, clear the features. Then select other features and apply a different preprocessing algorithm to them. Keep doing this until all the features in the data are preprocessed.
Select a preprocessing algorithm (#4) to be applied to the selected feature(s) and listed in step 4 and click "Add to Step" (#5).
Provide a name and edit code for custom preprocessing algorithms. Or fill in any necessary information required by an algorithm if needed (#5)
List of features and the corresponding preprocessing algorithms that will be applied once defined and saved (#6).
List of data pre-processing algorithms for Tabular, NLP, and time-series data and how they work are described below (follow steps 2 to 7 above)
How it works
New Feature Extractor - add custom feature preprocessing algorithm or function
Default MLP Preprocessing
The platform examines the features selected for preprocessing and then applies the appropriate preprocessing algorithms to the selected features.
Scale the values in a feature or column between 0 and 1. Applied to features with a Num data type.
Standardize data along any axis, center to the mean and component-wise scale to unit variance.
The range of values of raw data may vary widely and that may not be suitable for some machine learning algorithms. For example, many classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance
Standardize features by removing the mean and scaling to unit variance.
Scale down the values of a feature such that it has the properties of a standard normal distribution with a mean of zero and a standard deviation of 1.
Scale features if you intend to use algorithms that involve euclidean or gradient distance to find a global minimum point quickly. This means that you Have to scale features when using algorithms such as KNN, Kmeans clustering, Linear and Logistics Regression, all deep learning and artificial neural network algorithms like CNN.
You do not have to scale when using algorithms such as decision trees, random forest, xgboost, and all the bagging and boosting algorithms. Because the values in a feature are used to create branches based on conditions and rules.
Standardization of a dataset is a common requirement for many machine learning estimators. They might behave poorly if the individual features do not more or less look like standard normally distributed data.
|Transform features by scaling each feature to a given range. It scales and translates each feature individually such that it is in the given range on the training set e.g rescale the range of features between zero and 1 or -1 and 1. |
For example, suppose that we have the students' weight data, and the students' weights span [160 pounds, 200 pounds]. To rescale this data, we first subtract 160 from each student's weight and divide the result by 40 (the difference between the maximum and minimum weights).
Scale each feature by its maximum absolute value. Each feature is scaled and translated such that the maximal absolute value of each feature in the training set is 1. This feature preprocessing technique does not shift or center the data, and therefore does not destroy any sparsity.
L1 and L2 Normalization
Note: Normalization works only on rows and not on columns.
Note: This transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.
Categorical to numeric - transform non-numerical categories into numerical categories
Encode categorical data to numbers with values between 0 and the number of classes (n_class)-1.
To put it simply, this transforms character or text categories in a feature to numerical categories. E.g male, female, and transgender become 0, 1, and 2.
This section explains how vision data is preprocessed for modelling.
Feature Preprocessing for vision data works as follows.
Select the image Feature
Select the Feature Pre-processing algorithm to be applied to the images
Set parameters if any is needed then click "Add to step"
A list of all the feature preprocessing algorithms and their respective parameters that will be applied to the feature selected in step 1 once saved.
List of data pre-processing algorithms for Vision Data and how they work.
How it works
New Feature Extractor - add custom algorithm or function for preprocessing vision data
Refer to how to add New Feature Extractor.
Vision Feature Extractor
Categorical to numeric - transform non-numerical categories to numerical categories
Encode categorical data to numbers with values between 0 and the number of classes (n_class)-1. For instance red, green, and blue become 0,1 and 2.
Resize Image - resize image
Select "Resize Image" (Figure 21.2 #2)Parameters (#3):
The raw image is not modified, rather a new image with new dimensions is returned.
Work on RGB image
By default, all images loaded into the platform are converted to greyscale. So this command will convert the images to RGB.
Note that working on RGB is at least three times more computationally expensive.
Convert to Grayscale image
Convert all images to grayscale
Convert to BW image - Convert images to black and white
Select "Convert to BW image"Parameters:
Use only R from RGB
Use only the red colour in an RGB image
Use only G from RGB
Use only the green colour in an RGB image
Use only B from RGB
Use only the blue colour in an RGB image
Pytorch RandomResizedCrop - crop a given image to random size and aspect ratio.
Select "Pytorch RandomResizedCrop"Parameter:
Note that a crop of the random size of the original size and random aspect ratio of the original aspect ratio is created. The crop is finally resized to the given size (Parameter).
Pytorch Normalize - Normalize a tensor image with a mean and standard deviation.
Select "Pytorch Normalize"Parameters:
Note: Given mean [M1,...Mn] and std[S1,...Sn] for n channels, each channel of the input will be normalized.
Principal Component Analysis (PCA) - project data to a lower-dimensional space or put simply, reduce the dimensions of a feature set by maximizing the variance of the data point.
#Linear Dimensionality Reduction
Select "Principal Component Analysis" (PCA)Parameter:
Note that PCA rotates and projects data along the direction of increasing variance. The features with maximum variance are the principal components.
Linear Discriminant Analysis (LDA) - reduce the dimensions of the feature set by projecting data in a way that class separability is maximised.
#Linear Dimensionality Reduction
Select "Linear Discriminant Analysis" (LDA)Parameter:
Note that variables from the same class are put closely together by the projection while variables from different classes are placed far apart by the projection.
Independent Component Analysis (ICA) - separate independent sources from mixed signals. Unlike PCA which focuses on maximizing the variance of the data points, ICA focuses on independence.
#Linear Dimensionality Reduction
Select "Independent Component Analysis" (ICA)Parameter:
Singular Value Decomposition (SVD) - reduce the dimension of data using truncate svd.
#Linear Dimensionality Reduction
Select "Singular Value Decomposition" (SVD)Parameter:
Kernel Principal Component Analysis(KPCA) - use kernels to reduce the dimensions of data
#Non - Linear Dimensionality Reduction
Select "Kernel Principal Component Analysis" (KPCA)
Review and Save Dataset
Once data has been preprocessed, the preprocessed data has to be saved and then split as between training and validation sets. This split set is what is used for modelling.
Click on "Review and Save Dataset"
Dataset Name: name for the preprocessed data set
Click "Define Dataset"
Now that data has been wrangled, preprocessed, and saved, we have to split the data into training and validation sets for modelling.
This section explains how to split data into training and validation sets.
Note: Cross validation is done automatically in the background when data is split.
To split data for modelling
Click the "Cross-Validation Dataset" tab. You can toggle between this tab and the "Explore Dataset" tab. The Cross-Validation Dataset tab allows you to split data and the Explore Dataset tab allows you to explore and review everything that has been done to the data up until this point.
Select the data you have been working on
All preprocessed data sets saved under the selected data in step 2 show up here.
Check the box beside the dataset you would like to split. By default, the data is set to split: 80% for training and 20% for validation
Move the slider to select percentages for training and validation sets.
Click "Generate Dataset"
Click on the data icon in #3 to view the split datasets
You are ready for modelling. Check Machine Learning Engine for details on how to apply ML algorithms to the data.
Once data has been wrangled, preprocessed, saved and split into training and validation sets. The Explore Dataset tab allows you to review the dataset. The review include
Training and validation datasets
Training and target features
Feature preprocessing steps and algorithms applied to every feature
How missing values were treated if any
View configuration of the entire data and its features in a .json format
If there is the need for any changes, Click Edit Dataset to go back to Define Dataset and make any needed changes.
Feature preprocessing steps applied to any preprocessed and saved data set can be saved and re-used or applied to the same or different data sets at any time.
To Save feature preprocessing Recipe
To reuse Feature preprocessing recipe: