Build your first neural network in Python

10 min readDec 8, 2017

--

Artificial Neural Networks have gained attention, mainly because of deep learning algorithms. In this post, we will use a multilayer neural network in the machine learning workflow for classifying flowers species with sklearn and other python libraries.

Topics: #machine learning workflow, #supervised classification model, #feedforward neural networks, #perceptron, #python, #linear discrimination analysis, # data scaling & encoding, #iris

The post contains only the basic part of the code. For the full one together with many comments, please see here.

The machine learning workflow consists of 8 steps from which the first 3 are more theoretical-oriented:

Formulate the problem
Describe the dataset
Select the proper processing techniques, algorithm and model
Build the model
Train the model
Test the model
Bias vs Variance Trade Off
Deploy the model to solve the real-world problem

1. Formulate the problem:

“Process and transform the iris flowers dataset to create a prediction model. This model must predict in which of the three specific flower species each flower is likely to belong, with 95% or greater accuracy.”

Accuracy is defined as the freedom from error

2. Data description:

Iris is a genus of flowering plants species. It takes its name from the goddess of rainbow in Greek mythology.

This dataset is perhaps best known by its appearance in the pattern recognition literature and specifically in linear discriminant analysis (LDA). LDA is used as a dimensionality reduction technique or for finding a linear combination of features that separates the data points into different region of data points. A perceptron finds this linear discriminant function (approximation function).

Discriminant function h(x): we use training data to learn a function h(x) that maps inputs x directly onto a class label y

Supervised Prediction model on Iris dataset

The Iris dataset has three classes where one class is linearly separable from the other 2; the latter two are not linearly separable from each other. Each class refers to a type of iris plant and contains 50 instances.

The below command prints a table with statistics for each numerical column in our dataset. We need to see how representative the dataset is and what kind of preprocessing techniques it may need.

The next step is to visualize the dataset to capture the relationships between the features and the associated classes. We have more than two features that involve relationships between them (i.e., multivariate data). We use bivariate visualisations for every two features to capture all of their relationships.

There is an obvious separation between the 2 (instead of the 3) classes: 1. *Iris setosa and 2.* both *Iris virginica* and *Iris versicolor .*

We can see that the dataset is highly structured with linear relationships among the features. In a three dimensional plot, the separation among the classes would be even more clear.

That finding is important for the algorithm selection part because it tells us that a simple model like an SVM, Random Forest or even a logistic regression could achieve the same results with a neural net. Additionally, the data size is so small which indicates that there is no need to use a complex model like a neural net. Neural nets tend to work well on non-linear functions with many data points and unstructured data.

3. Selecting the proper preprocessing techniques, algorithm, and model

Preprocessing:

Before diving into preprocessing we have to split our data into training (80%) and test data (20%). We will use the training data (contain the classes for our iris species) to learn the model; and test data (contain only the features, without the classes) to measure the accuracy of our prediction model.

We will encode categorical variables with LabelEncoder() since a prediction model cannot work with categorical variables.

LabelEncoder introduced a new problem (noise) to our dataset: added numerical relationships in the features that now have become ordinal variables. That means that the model thinks that Iris-versicolor(1) is higher than Iris-setosa(0) and Iris-setosa(0) is smaller than Iris-virginica(2). To solve that we would take a three dimensional vector, rather than an one dimensional vector with 3 values ([0,1,2]). Thus, setosa would be [1,0,0], versicolor would be [0,1,0] and virginica woul dbe [0,0,1].

The second preprocessing technique is to scale our data with StandardScaler() as MLP (and gradient descent) is sensitive to un-normalized features. This helps us to speed up our optimization algorithm (gradient descent) and obtain a more accurate classifier.

We can use a plot to see what was the effect of feature scaling to our data values. It rescaled the features such that they have a mean of zero and a standard deviation of one.

The features’ values are centered and scaled (μ=0, σ=1)

Model & Algorithm

What is a Multilayer Perceptron?

The short answer: a linear supervised classifier or regressor. The idea is to find a good set of weights and biases.

Classifier -the learning task-: correlates features of data with class properties to group data points. For example, some points will be labelled as 1 for say the setosa class and as 0 to the other two classes and we don’t know in advance whether that was a correct classification for all the data points. Classification is different from clustering which is an unsupervised type of learning. Further info on the different machine learning types, models, and outputs, please see this post.

Perceptron -the learning algorithm-, the most basic form of a neural network which is a simple supervised linear feedforward classifier. The perceptron is a binary threshold neuron meaning that the neuron is activated by a function that returns boolean values (1= pattern discovery, namely a feature combination that leads to this output) since we have a binary classification task. In this way, patterns are detected by using the input-output examples. The weights are updated based on these boolean values so that only the input can be used for weight correction (reinforcement learning).

Feedforward pass = the sum of calculations that occur when the input passes through the neural network.

Single Layer Perceptron and a binary-thershold activation function (image by Nahua Kang Graph). Note that each neuron is a matrix. We can show mathematically that a certain neural network architecture — consists of the input layer, an output layer and a single logistic/sigmoid neuron, trained with a certain loss — cross entropy — coincides exactly with logistic regression at the optimal parameters.

Reinforcement: The perceptron is an example of a reinforcement learning algorithm (that is different from reinforcement learning approach where there are not any output to train the model with). After each presentation of an input-output example, we only know whether the network produces the expected output or not. The alternative would be learning with error correction; updates on weights are not only effected by the input but also by the magnitude of the error.

Types of learning by: ‘Turing’s Connectionism: An Investigation of Neural Network Architectures’. Supervised is different from reinforcement on the grounds that the former needs labelled examples to learn the input output mapping.

Multilayer Perceptron(MLP): abandon the perceptron rule and use the delta rule aka backpropagation which is more robust and capable of learning weights accross layers. The delta rule will be used in the function known as the error or cost or loss in order to see whether the data point was correctly classified. Its the “hello world” of deep learning models.

Multilayer -a modeling approach- : network consists of more than one layer of output nodes; that means, there is at least one hidden layer with more than oone neuroons that passes the inputs to the output layer (simplest model). The more layers you add, the more complex the model becomes, meaning that the more accurate it can become. But there is a price you have to pay: you need more data points and to be aware of the high chance for overfitting.

This is achieved through training via backpropagation, which is the reverse of a forward pass. The whole point is to measure the error the network makes when classifying the data points and then modify the weight so that this error becomes very small.

Alternative to MLP: probabilistic neural network based on the Bayesian approach. It supports statistical inference (discover distributions/properties to draw conclusions with) apart from parameters estimation (find an estimate for an unknown value).

Traditional NN acts deterministically: a single set of fixed weights; whereas Bayesian NN acts probabilistically: probability distribution over weights

**MLP with *n number of features, k number of neurons and* one hidden layer**. The inputs are a set of m number of features stored in each neuron in the input layer. Each neuron in the hidden layer transforms the values from the previous layer with a weighted linear summation, followed by a non-linear activation function (we use RELU in our example). The output layer receives the values from the last hidden layer and transforms them into output values (categorical cross entropy here for the iris dataset). In our example, we use stochastic gradient descent to find the optimal weight values while sending backwards the error in the weights(backpropagation).

4. Build: MLP in our species prediction

In the building step of a supervised classifier, we have to decide about the training set, the data preparation and the model architecture. The model architecture is about the number of nodes in each layer, the number of layers, and the way that nodes are connected together(feedforward, no loops between the units, optimizer, regularization etc.).

Hyperparameters define the structure of the network by the number of Hidden Nodes & Layers. The price we need to pay for automatic feature engineering (as done by the neural net) is making all these crucial decisions about the architecture of the neural net.

The number of nodes in the input and output layers is easy to determine. In our example, we have 4 features as input units and 3 classes as output units. The size of the hidden layer should be the mean of these 2 so around 4 should be adequate. We add instead 10 nodes in the hidden layer hidden_layer_sizes(10) making it more complex. We shouldn’t need more than one layers as we have a really small data set (7cols, 600rows). However, determining the number of hidden nodes is not a straightforward decision. The optimal size of the hidden layer is usually between the size of the input and size of the output layers. There are some rules of thumb but they fall out of the scope of this post.

Stochastic Gradient Descent (SGD) optimizer/solver: updates weight values in order to minimize the loss function in batches (stochastic). As the name says, SGD uses the gradient of the loss function (gradient is the derivative of a function with more than 2 variables f(x,y)). Use parameter solver = sgd. Adam solver tends to work better when large data sets are available.

learning_rate_init=0.01: controls how much to update the weights to the right direction given the residual error(true output-predicted output). The idea is to keep it close to the proportional value of 1/n where n is the number of features. It’s value shouldn’t be too large (quick but fail to converge to the optimal values) neither too small (accurate but slow learning/convergence).

Learning rate: how fast the steps (red marbles) move until convergence over 50 iterations

max_iter=500 : maximum number of epochs (=how many times each data point will be used until the solver convergence). Use a higher number to check whether the model performs better (given that you have a large data set)

5 : Train the model

Training = Minimize the loss function (categorical cross entropy here because oof the nature of the output)
Solver = Gradient Descent = Find the values of weights that minimize the loss function (back propagation here- explained in part 3.).

Hyperparameters for MLP training as taken from sklearn

** Some of the useful terminology on understanding parameters

Multi-class classifier: Classify instances into one of 3 or more classes.

Multinomial categorical: Distribution of the output, the target data

Relu: The units in the hidden layer apply the activation function to induce a sense of non-linearity in the network. Relu tells us how close these are to 1 (1= we discover a pattern, namely a feature combination that leads to this output).

Neurons: entries in a matrix, and as such, their number is necessary for specifying the matrices

Categorical Cross Entropy: The loss function we optimize (measures how good are the parameters) and take a vector of probability estimates per sample.

1 Epoch = 1 pass over the entire dataset

Batch size = the number of training examples in one forward/backward pass. The higher the batch size, the more memory space you’ll need.

Regularization L2/Ridge: It helps in avoiding overfitting (generalise to the unseen data examples-test data set), that is to shrink the values of the weights (also known as weight decay or penalty). The value alpha is the one with which our model reaches the minimum of the cross validation error. L2 is the sum of the square of the weights.

Comparison of the 3 different types of regularisation for automatic feature selection: Lasso/L1, Ridge/L2 and the Elastic Net/L1+L2. Each sub-plot is a 2-d contour plot of 2 weights in the two axes. The contour plot is the cost function for this pair weights. The minimum of the cost is found in the center of the ellipsis and as such thease are the ooptival values foor the weights. All data points in the same ellipsis have the same error. Which of the three penalties do you think can set a certain feature(s) to a zero contribution and thus lead to sparse solutioon? Image Citation: Zou, H., & Hastie, T. (2005) & From Linear Regression to Ridge Regression, the Lasso, and the Elastic Net.

My Intuition behind MLP:

First synapses: Given a random set of weights, feed the inputs forward: 10 weighted sums(z1 = w1*x1 + w2*x2 + w3*x3 + w4*x4 + b1) 10 combinations of input variables.
Hidden Layer Nodes: Z are passed into Relu, which makes a non-linear mapping (this helps in solving the vanishing gradient) between z and the response variable. Feed the result forward.
Second synapses*: The non-linear z will pass via a classifier like softmax which will give us a class.
Output Layer Node: We only know whether the network produces the expected output or not. Training happens based on back propagation: Compare the output with the assigned weights with the actual value.
Second synapses: Propagate the difference (error) backward. SGD use the error to update the weights.
Hidden Layer Nodes: Compute the error w.r.t. the input layer.
First synapses: SGD use the error to update the weights.
Rerun. After several training epochs, when the error between the actual output and the computed output is less than a previously specified value, the NN is considered trained. Then we replace the random weights, with the trained ones.

*Each synapse(neuron connection) has a weight associated that represents the influence from one neuron on the other.

For a pseudocode of MLP, please see here.

6: Test the model

Use the appropriate metrics to evaluate your model & prevent Accuracy Paradox

It is time to see our model’s accuracy which is 0.966666666667. Pretty well, right? Next step is combatting overfitting with cross validation.

The next step is to use a learning curve which speaks about model overfitting or under-fitting (if any).

Until then, Hap-py-coding ⇧⏎