## Part 3 – A simple neural network with TensorFlow

# First steps with TensorFlow – Part 3

## In the third part of the series *“First steps with TensorFlow”* I will show how to build a very simple neural network.

The main purpose will be the same that has been described in First steps with TensorFlow – Part 2, i.e. we want to classify the iris in the iris dataset.

## A *quick* review of neural networks

The full discussion of what neural networks (NN) are and how they work is well beyond the purpose of this blog post. Nonetheless, I will review a number of topics necessary for the comprehension of this post.

Neural networks are developed to mimic the neural connection in a brain. A neural network consists of a number of layers, and each layer consists of a number of units (or neurons). The task of every neuron is to process the information received and then transmit it to the neurons in the next layer.

The most frequent question that people find to ask themselves during the implementation of a NN is how to choose the number of hidden layers and hidden neurons. There is not any precise rule, in general the number of hidden layers depends strongly on the problem, at odds to the input and output layers. In particular:

*Input layer*: The input layer is where the network starts. The number of units in this layer is fixed and corresponds exactly to the number of input features.*Hidden layers*: The number of hidden layer and units per layer are the*free*parameters that one has to fix. There is no rule to decide these two parameters but it depends strongly on the problem.*Output layer*: The output layer is where the network ends and the predictions are given. In the case of a classification problem, like the one we are facing, the number of units in the output layer corresponds exactly to the number of classes.

## Training of a neural network

Once the architecture of a NN has been set, i.e. the number and type of hidden layers has been defined, we can proceed to the training of the neural network.

### Feedforward

The input features are fed to the input layer. From there the network goes through all the hidden layer until the output layer, where the predictions are produced. Given a layer \(i\) and its values \(x_i\), we can write the values \(h_j\) of the next layer \(j\) as follows:

\( h_j = f(W_{j,i}x_i+ b_{j,i})\)where \(f\) is the activation function, \(W_{j,i}\) is the weight matrix and \(b_{j,i}\) the bias. The activation is the function which *activates* a neuron and can have several expressions, but in most the cases is either a Rectified Linear Unit (ReLU) or a logistic function.

Hence, between two adjacent layers there is always a weight matrix which is responsible to *transmit* the information. Therefore for an N-layer NN we have N-1 weight matrices and the *j-th* matrix, i.e. the matrix of the *j-th* layer, will be a function of all the *j-1* matrices in the previous layers.

### Cost function

The output layer delivers a prediction, which must be compared to the real values. This comparison is an estimate of the error. Of course one wants to minimize the error, hence optimize the parameters of the problems, i.e. the weight matrices. One of the best known optimizing algorithms is the gradient descent.

### Backpropagation

The cost function described is an estimate of the error only of the output layer, but the parameters to be optimized are all the N-1 weight matrices. Therefore, the optimization of a NN is pursued not only through the optimization algorithm, but also by so-called **backpropagation**.

The back propagation consists in back propagating the error from the output up to the input layer in order to obtain an estimate of the error for every layer. The error estimates obtained in this way allow to optimize the single weight matrices and therefore the NN *learns* – or converges – quicker.

From the mathematical point of view, this is a consequence of the dependence of the weight matrix on all the weight matrices in the previous layers, hence the estimate of the error of a layer is obtained by simply applying the chain rule for derivatives.

### Importing the data

First of all, we start importing the packages we need and the data. Here I will not describe in detail the process of data import, because it has been already described in First steps with TensorFlow – Part 2, but I attach the code for completeness.

The iris dataset is well known for having: 150 examples, 3 iris classes, 50 examples / class

# Import packages import tensorflow as tf import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn import preprocessing # # Import data data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", sep=",", names=["sepal_length", "sepal_width", "petal_length", "petal_width", "iris_class"]) # # Shuffle data data = data.sample(frac=1).reset_index(drop=True) # # Then split `x`, whose columns are normalized to 1, and `y`, one-hot encoded all_x = data[["sepal_length", "sepal_width", "petal_length", "petal_width"]] min_max_scaler = preprocessing.MinMaxScaler() all_x = min_max_scaler.fit_transform(all_x) all_y = pd.get_dummies(data.iris_class) # # ... and split training and test set train_x, test_x, train_y, test_y = train_test_split(all_x, all_y, test_size=1 / 3) # # Check the dimensions print(train_x.shape) print(train_y.shape) print(test_x.shape) print(test_y.shape) # # and define number of features, n_x, and number of classes, n_y n_x = np.shape(train_x)[1] n_y = np.shape(train_y)[1] #

(100, 4) (100, 3) (50, 4) (50, 3) #

### Definition of a NN

First of all we define the placeholders for `x`

and `y`

, the learning rate, and we start defining a graph, as we did for the logistic regression.

For an introduction to graphs and sessions in TensorFlow read the First steps with TensorFlow – Part 1.

# Reset graph tf.reset_default_graph() # # Define learning rate learning_rate = 0.01 # # Start graph definition... g = tf.Graph() # ... and placeholders with g.as_default(): x = tf.placeholder(tf.float32, [None, n_x], name="x") y = tf.placeholder(tf.float32, [None, n_y], name="y") #

Within the graph we define the NN. For this purpose we use the package `tf.contrib.layers`

. In this package one can find several layer types (e.g. fully connected, convolutional, flatten, …).

The NN we are defining will be composed only by fully connected layers.

*Input layer*

We can start defining the input layer. The input layer has `n_x = 4`

units and the number of output units is set to 10, i.e. this is the number of units of the first hidden layer.

# Define the number of neurons for each hidden layer: h1 = 10 h2 = 20 h3 = 10 # # From input to 1st hidden layer with g.as_default(): fully_connected1 = tf.contrib.layers.fully_connected(inputs=x, num_outputs=h1, activation_fn=tf.nn.relu,scope="Fully_Conn1") #

*Hidden layers*

In this case the NN has three hidden layers with 10, 20, and 10 units respectively. Therefore every hidden layer takes as input the output of the previous layer.

# From 1st to 3rd hidden layer, through the 2nd with g.as_default(): fully_connected2 = tf.contrib.layers.fully_connected(inputs=fully_connected1, num_outputs=h2, activation_fn=tf.nn.relu,scope="Fully_Conn2") fully_connected3 = tf.contrib.layers.fully_connected(inputs=fully_connected2, num_outputs=h3, activation_fn=tf.nn.relu,scope="Fully_Conn3") #

*Output layer*

Eventually the output layer takes as input the output of the third hidden layer and makes the prediction. Therefore this layer has `n_y = 3`

units.

# From 3rd hidden layer to output with g.as_default(): prediction = tf.contrib.layers.fully_connected(inputs=fully_connected3, num_