Beginner’s Guide to Machine Learning

Ah, the famous 2 words in this day and age: “Machine Learning”. The use of Machine Learning in itself isn’t a new concept. It has been around since the 1940s. Yep you heard that right! So what exactly is Machine Learning? Well let’s take a look.

Machine Learning is a broad field of Computer Science that looks into the ways in which Computers can mimic intelligence by predicting an output based on a set of inputs given. The simplest form of Machine Learning is linear regression. Remember the typical y=mx+c model that was taught in Secondary School? Yep, that’s machine learning. Imagine I have a set of x and y values. I can then choose a set of m and c values that minimise the errors between the various x and y values. For simplicity assume the line y = mx + c has to pass through the origin. Therefore, c is equal to 0. As such the subsequent task of choosing m is called training the model.

Let’s use an example. Assume I have 2 sets of data, one of them has an x value of 1 and a y value of 2 while the other has an x value of 3 and a y value of 6. In this case, m would be 2. So if I were to provide another x of value 5, I can easily deduce that my y would be 10.

So let’s get right into code. First make sure you have the Pycharm IDE installed (any IDE works, but I have personally found Pycharm especially easy to use for beginners).

Example of Tensorflow library being successfully imported

Perform the following commands to install both the Tensorflow and Pandas libraries.

$ pip install tensorflow
$ pip install pandas
$ pip install seaborn

These are the following steps for creating a Machine Learning Model to take in multiple input variables and perform regression analysis to obtain a numerical value. The steps to perform this are as follows:

Step 1: Import header libraries

Step 2: Import the data

Step 3: View the data

Step 4: Clean the data

Step 5: Split the data

Step 6: Normalize the data

Step 7: Build the model

Step 8: Use the model to make predictions

Step 1: Import header libraries

Import the various libraries
# Import these 3 libraries for use later
$ import tensorflow as tf
$ import pandas as pd
$ import numpy as np

# This is to prevent a backend error when importing matplotlib in MACOSX
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt

# Within the Tensorflow library import keras, an API for deep learning
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

Import these header libraries into your pycharm IDE. For MAC OS, there are additional lines that need to be written to prevent a backend error

Step 2: Import the data

$ column_names = ['V', 'A', 'Degrees', 'Charge']
$ raw_dataset = pd.read_csv("~/PycharmProjects/ML/test.csv", names=column_names, na_values='?', comment='\t', sep=',', skipinitialspace=True)

The pandas library provides a simple method called read_csv() that allows csv data to be imported easily. There are 6 main inputs into this method. The path of the csv file, the column names, which characters to set as NaN (Not a Number), which characters mark the start of comments, which character marks a delimiter, and whether to skip spaces after the delimiter

Step 3: View the data

$ print(raw_dataset.tail(5))
$ print(raw_dataset.head(3))

This prints the last 5 lines of the dataset, and the first 3 lines of the dataset. Viewing the dataset is always good to develop an appreciation of the data being provided to the model.

Step 4: Clean the data

$ print(raw_dataset.isna().sum())
$ raw_dataset = raw_dataset.dropna()

This sums up all the rows which have a NaN value and the second line drops these NaN values from the overall data. This simple method of cleaning is possible when the number of NaN values are significantly smaller than the overall data science.

Step 5: Split the data

# Choose 80% of the data from the original dataset for training
$ train_dataset = dataset_raw.sample(frac=0.8, random_state=0)
# The remaining 20% of the data goes into testing
$ test_dataset = dataset_raw.drop(train_dataset.index)

# This prints out the count, mean, standard deviation, min value, 25th, 50th, and 75th percentile followed by max value
$ print(str(train_dataset.describe().transpose()) + "\n")

# Copy the training and testing data into a set of features used to train the model
$ train_features = train_dataset.copy()
$ test_features = test_dataset.copy()

# Copy the training and testing data into a set of features used to label the outcome of the model
$ train_labels = train_features.pop('Charge')
# test_labels = test_features.pop('Charge')

First create a variable called train_dataset and obtain a fraction of the original dataset_raw to be used for training the model. In this case we have chosen 0.8 or 80% of the data to be used for training. Use the remaining 20% for testing. the .describe().transpose() method is used to print information about this training dataset. Finally, divide the training data to form features which are inputs to the model, and labels which are outputs of the model. So we will have features for training and labels for training followed by features for testing and labels for testing.

Step 6: Normalize the data

$ normalizer = preprocessing.Normalization(input_shape=[3, ])
$ normalizer.adapt(np.array(train_features))

Define a normalizer based on the shape of the input which in this case is 3 because we have 3 features: ‘V’, ‘A’, ‘Degrees’ and one label: ‘Charge’. Adapt the normalizer to normalize the train_features and subsequently test_features data. Normalizing the data allows the model to perform better.

Step 7: Build the model

$ dnn_model = keras.Sequential([
norm,
layers.Dense(64, activation='relu'),
layers.Dense(64, activation='relu'),
layers.Dense(1)
])

$ dnn_model.compile(loss='mean_absolute_error',
optimizer=tf.keras.optimizers.Adam(0.001))

$ history = dnn_model.fit(
train_features, train_labels,
validation_split=0.2,
verbose=0, epochs=1000)

Build the model using the training features and training labels with 1000 epochs which refers to the number of times the model will iterate through the fitting process. Increasing the number of epochs may lead to overfitting while decreasing the epochs may lead to underfitting. In this a deep neural network is used as the model.

Step 8: Use the model to make predictions

$ test_results = {}
$ test_results['dnn_model'] = dnn_model.evaluate(test_features, test_labels, verbose=0)
$ print(pd.DataFrame(test_results, index=['Mean absolute error [SOC]']).T)

This line evaluates the model using the 20% of the data that was divided to form test features and labels. The overall summary of the model can be viewed by using the DataFrame method from pandas.

Output from the DataFrame pandas method
$ dnn_model.predict(validation_dataset[:])

New datasets can also be passed into the model to be used for predictions. The model will output the single regression value dictating the ‘Charge’ in our specific example. Of course validation_dataset is a .csv file that has to be imported just like how dataset_raw was imported in Step 1.

Cheers, there you have it, your first foray into Machine Learning with Python (Tensorflow, pandas, numpy).

An aspiring Robotics Researcher. I am currently in my 4th year of undergraduate studies. I am working on optimising the navigation packages on ROS.