First steps with TensorFlow – Part 2

Part 2 – Logistic Regression

First steps with TensorFlow – Part 2

If you have had some exposure to classical statistical modelling and wonder what neural networks are about, then multinomial logistic regression is the perfect starting point: It is a well-known statistical classification method and can, without any modifications, be interpreted as a neural network.

Logistic regression recap

The task of logistic regression is to predict a categorical variable from a set of continuous predictor variables. The idea is to use “something like” linear regression. Logistic regression takes several steps to transform the task into “something like” a regression problem. It turns out that the steps are very similar to what you do when you build a neural network.

So, assume we want to predict one of \(n\) categories from \(m\) continuous input variables.

Step 1: Make the predicted variable continuous

Instead of predicting one of the \(n\) categories we try to predict an \(m\)-dimensional vector \(p = (p_1, …, p_m)^\top\) which contains the probabilities of each of the categories for a given set of inputs. Now, at least technically, we can predict the outcome by a linear model

\(p \stackrel{?}{=} w \cdot x + b\)

where \(x\) is the \(m\)-dimensional vector of input variables, \(w\) is an \(n \times m\) matrix consisting of the row vectors \(w_i\), and \(b\) is a vector of size \(n\). If we were talking about a neural network, \(w\) would be called the weight matrix and the elements of \(b\) would be called the biases of the model.

N.B.: All vectors are assumed to be column vectors.

Step 2: Make the predicted values positive

In order to make \(w \cdot x + b\) a sensible predictor of \(p\) we first want to make sure that all predicted probabilities are positive. To that end we transform each row of the model output using the exponential function

\(p_i \stackrel{?}{=} e^{w_i \cdot x + b_i}\)

Step 3: Make sure that the predicted values sum up to one

And, of course, we want the probabilities to sum up to one. Therefore we normalize the predictions from the previous, tentative, equation and get

\(p_i = \frac{e^{w_i \cdot x + b_i}}{\sum\limits_{j} e^{w_j \cdot x + b_j}}\)

This is the model for multinomial logistic regression. In the context of neural networks it is common to rewrite this expression in terms of the softmax function, which is defined by

\(softmax(z)_i = \frac{e^{z_i}}{\sum\limits_{j} e^{z_j}}\)

resulting in

\(p = softmax(w x + b)\)

Step 4: Estimate the parameters

Assume that our training data consist of a set of input vectors \(x^{(k)}\) with associated labels (true categories) \(y^{(k)}\). The labels are assumed to be one-hot encoded, i.e. each \(y^{(k)}\) is a vector of length \(n\) which has the value \(1\) at the position of the true category and the value \(0\) otherwise.

As usual we will determine the model parameters \(w_{i, j}\) and \(b_j\) using maximum likelihood estimation. More specifically we will try to maximize the (log-)likelihood that the model, which is specified by the parameters \(w\) and \(b\), predicts for the true categories \(y^{(k)}\) given the values \(x^{(k)}\) of the predictor variables. Assume that the true category for the \(k\)-th sample from the training set is \(i\). Then the log likelihood of the \(k\)-th sample is

\( \log \mathcal{L}(w, b \mid x^{(k)}, y^{(k)}) = \log p(x^{(k)})_i\)

which can be re-written as

\( \log \mathcal{L}(w, b \mid x^{(k)}, y^{(k)}) = \sum\limits_{j} y^{(k)}_j \cdot \log p(x^{(k)})_j \)

(Recall that \(y^{(k)}\) is one-hot encoded, i.e. that \(y^{(k)}_i=1\) and that \(y^{(k)}_j=0\) for \(j \neq i\).)

In the context of neural networks it is common to rewrite this expression in terms of the categorical crossentropy function

\( H(y, p) = -\sum\limits_{i} y_i \cdot \log p_i \)

resulting in

\( \log \mathcal{L}(w, b \mid x^{(k)}, y^{(k)}) = -H(y^{(k)}, p(x^{(k)})) =-H(y^{(k)}, \mathit{softmax}(w x^{(k)} + b))\)

This is the definition of a perfectly normal neural network with an input layer, no hidden layers, an output layer that uses the softmax activation function and a categorical cross-entropy loss function.

In order to obtain our estimates of \(w\) and \(b\) we maximize the log likelihood of all training samples combined, i.e. we determine

\( \underset{w, b}{arg max} \sum\limits_{k} -H(y^{(k)}, \mathit{softmax}(w x^{(k)} + b))\)

or, equivalently,

\( \underset{w, b}{arg min} : \underset{k}{avg} \, H(y^{(k)}, \mathit{softmax}(w x^{(k)} + b))\)

In neural networks terms estimating \(w\) and \(b\) from all training samples at once corresponds to batch gradient descent (as opposed to stochastic gradient descent and mini-batch gradient descent).

We will show below how batch gradient descent in TensorFlow can be used to estimate a logistic regression model.

N.B.: Specialized logistic regression packages will use algorithms like L-BFGS rather than gradient descent to estimate the model parameters. We use gradient descent because we are mainly interested in the analogy to neural networks.

N.B.: Due to the fact that the probabilities are constrained to sum up to \(1\), one of the output dimensions is redundant. Therefore most regression packages reduce the number of dimensions of \(w\) and \(b\) by \(1\) by defining one of the output probabilities as a “pivot” relative to which the others are expressed. We skip that step because we are more interested in preserving the symmetry of the formulation and the analogy to neural networks.

Example: Iris data

We implement the steps outlined above using the classical iris data set. The task is to predict one of three iris species from four measurements on the iris flower, i.e. length and width of sepals and petals.

import tensorflow as tf
import numpy as np
import pandas as pd
import sklearn as skl
#

We first download a csv file with the iris data into a pandas data frame

data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", sep=",",
                   names=["sepal_length", "sepal_width", "petal_length", "petal_width", "iris_class"])
#

Because the data are ordered by iris species we shuffle it

np.random.seed(0)
data = data.sample(frac=1).reset_index(drop=True)
#

The downloaded data set now looks like this

data.head()
#
sepal_length sepal_width petal_length petal_width iris_class
0 5.8 2.8 5.1 2.4 Iris-virginica
1 6.0 2.2 4.0 1.0 Iris-versicolor
2 5.5 4.2 1.4 0.2 Iris-setosa
3 7.3 2.9 6.3 1.8 Iris-virginica
4 5.0 3.4 1.5 0.2 Iris-setosa

Then we extract the predictor variables

all_x = data[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
#
all_x.head()
#
sepal_length sepal_width petal_length petal_width
0 5.8 2.8 5.1 2.4
1 6.0 2.2 4.0 1.0
2 5.5 4.2 1.4 0.2
3 7.3 2.9 6.3 1.8
4 5.0 3.4 1.5 0.2

and the target variable, which is one-hot encoded using the pandas function pd.get_dummies()