Part 2 – Logistic Regression

First steps with TensorFlow – Part 2

If you have had some exposure to classical statistical modelling and wonder what neural networks are about, then multinomial logistic regression is the perfect starting point: It is a well-known statistical classification method and can, without any modifications, be interpreted as a neural network.

Logistic regression recap

The task of logistic regression is to predict a categorical variable from a set of continuous predictor variables. The idea is to use "something like" linear regression. Logistic regression takes several steps to transform the task into "something like" a regression problem. It turns out that the steps are very similar to what you do when you build a neural network.

So, assume we want to predict one of \(n\) categories from \(m\) continuous input variables.

Step 1: Make the predicted variable continuous

Instead of predicting one of the \(n\) categories we try to predict an \(m\)-dimensional vector \(p = (p_1, …, p_m)^\top\) which contains the probabilities of each of the categories for a given set of inputs. Now, at least technically, we can predict the outcome by a linear model

\(p \stackrel{?}{=} w \cdot x + b\)

where \(x\) is the \(m\)-dimensional vector of input variables, \(w\) is an \(n \times m\) matrix consisting of the row vectors \(w_i\), and \(b\) is a vector of size \(n\). If we were talking about a neural network, \(w\) would be called the weight matrix and the elements of \(b\) would be called the biases of the model.

N.B.: All vectors are assumed to be column vectors.

Step 2: Make the predicted values positive

In order to make \(w \cdot x + b\) a sensible predictor of \(p\) we first want to make sure that all predicted probabilities are positive. To that end we transform each row of the model output using the exponential function

\(p_i \stackrel{?}{=} e^{w_i \cdot x + b_i}\)

Step 3: Make sure that the predicted values sum up to one

And, of course, we want the probabilities to sum up to one. Therefore we normalize the predictions from the previous, tentative, equation and get

\(p_i = \frac{e^{w_i \cdot x + b_i}}{\sum\limits_{j} e^{w_j \cdot x + b_j}}\)

This is the model for multinomial logistic regression. In the context of neural networks it is common to rewrite this expression in terms of the softmax function, which is defined by

\(softmax(z)_i = \frac{e^{z_i}}{\sum\limits_{j} e^{z_j}}\)

resulting in

\(p = softmax(w x + b)\)

Step 4: Estimate the parameters

Assume that our training data consist of a set of input vectors \(x^{(k)}\) with associated labels (true categories) \(y^{(k)}\). The labels are assumed to be one-hot encoded, i.e. each \(y^{(k)}\) is a vector of length \(n\) which has the value \(1\) at the position of the true category and the value \(0\) otherwise.

As usual we will determine the model parameters \(w_{i, j}\) and \(b_j\) using maximum likelihood estimation. More specifically we will try to maximize the (log-)likelihood that the model, which is specified by the parameters \(w\) and \(b\), predicts for the true categories \(y^{(k)}\) given the values \(x^{(k)}\) of the predictor variables. Assume that the true category for the \(k\)-th sample from the training set is \(i\). Then the log likelihood of the \(k\)-th sample is

\( \log \mathcal{L}(w, b \mid x^{(k)}, y^{(k)}) = \log p(x^{(k)})_i\)

which can be re-written as

\( \log \mathcal{L}(w, b \mid x^{(k)}, y^{(k)}) = \sum\limits_{j} y^{(k)}_j \cdot \log p(x^{(k)})_j \)

(Recall that \(y^{(k)}\) is one-hot encoded, i.e. that \(y^{(k)}_i=1\) and that \(y^{(k)}_j=0\) for \(j \neq i\).)

In the context of neural networks it is common to rewrite this expression in terms of the categorical crossentropy function

\( H(y, p) = -\sum\limits_{i} y_i \cdot \log p_i \)

resulting in

\( \log \mathcal{L}(w, b \mid x^{(k)}, y^{(k)}) = -H(y^{(k)}, p(x^{(k)})) =-H(y^{(k)}, \mathit{softmax}(w x^{(k)} + b))\)

This is the definition of a perfectly normal neural network with an input layer, no hidden layers, an output layer that uses the softmax activation function and a categorical cross-entropy loss function.

In order to obtain our estimates of \(w\) and \(b\) we maximize the log likelihood of all training samples combined, i.e. we determine

\( \underset{w, b}{arg max} \sum\limits_{k} -H(y^{(k)}, \mathit{softmax}(w x^{(k)} + b))\)

or, equivalently,

\( \underset{w, b}{arg min} : \underset{k}{avg} \, H(y^{(k)}, \mathit{softmax}(w x^{(k)} + b))\)

In neural networks terms estimating \(w\) and \(b\) from all training samples at once corresponds to batch gradient descent (as opposed to stochastic gradient descent and mini-batch gradient descent).

We will show below how batch gradient descent in TensorFlow can be used to estimate a logistic regression model.

N.B.: Specialized logistic regression packages will use algorithms like L-BFGS rather than gradient descent to estimate the model parameters. We use gradient descent because we are mainly interested in the analogy to neural networks.

N.B.: Due to the fact that the probabilities are constrained to sum up to \(1\), one of the output dimensions is redundant. Therefore most regression packages reduce the number of dimensions of \(w\) and \(b\) by \(1\) by defining one of the output probabilities as a "pivot" relative to which the others are expressed. We skip that step because we are more interested in preserving the symmetry of the formulation and the analogy to neural networks.

Example: Iris data

We implement the steps outlined above using the classical iris data set from the UCI repository. Other data sets for example for textclassification, a typical Natural Language Processing task. The task is to predict one of three iris species from four measurements on the iris flower, i.e. length and width of sepals and petals.

import tensorflow as tf
import numpy as np
import pandas as pd
import sklearn as skl

We first download a csv file with the iris data into a pandas data frame

data = pd.read_csv("", sep=",",
                   names=["sepal_length", "sepal_width", "petal_length", "petal_width", "iris_class"])

Because the data are ordered by iris species we shuffle it

data = data.sample(frac=1).reset_index(drop=True)

The downloaded data set now looks like this


Then we extract the predictor variables

all_x = data[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

and the target variable, which is one-hot encoded using the pandas function pd.get_dummies()

all_y = pd.get_dummies(data.iris_class)

We remember the number of input and output dimensions

n_x = len(all_x.columns)
n_y = len(all_y.columns)

and split the data into 2/3 training and 1/3 test using the utility function model_selection.train_test_split() from scikit learn

train_x, test_x, train_y, test_y = skl.model_selection.train_test_split(all_x, all_y, test_size=1 / 3)
(100, 4)
(100, 3)
(50, 4)
(50, 3)

We are now ready to build our TensorFlow graph. We work with the default graph and the default (=interactive) session because it is more convenient in a notebook.

sess = tf.InteractiveSession()

We define the input for our predictor variable x as a matrix with n_x = 4 columns, and the input for our one-hot encoded target variable as a matrix with n_y = 3 columns

x = tf.placeholder(tf.float32, shape=[None, n_x])
y = tf.placeholder(tf.float32, shape=[None, n_y])

The parameters to be trained are a 3 * 4 weights-matrix w and a 3-dimensonal vector b of biases

W = tf.Variable(tf.zeros(shape=[n_x, n_y]))
b = tf.Variable(tf.zeros(shape=[n_y]))   

We define the predictor predictor function \(p = softmax(w x + b)\) using the TensorFlow builtins for matrix multiplication and softmax

prediction = tf.nn.softmax(tf.matmul(x, W) + b)

The function call tf.matmul(x, W) corresponds to the mathematical expression \(w x\), with a few qualifications:

  • While the variable \(x\) was a single vector, the TensorFlow variable x is a matrix containing multiple input vectors.
  • While the variable \(x\) was a column vector, the TensorFlow placeholder x contains input vectors as rows. This order of dimensions follows the machine learning convention.
  • Correspondingly the order of dimensions in the TensorFlow variable W is swapped compared to the variable \(w\),
  • and the order of the parameters in the function call tf.matmul(x, W) is swapped compared to the corresponding expression \(w x\).

Also note that the TensorFlow operations + b and tf.nn.softmax() are written in matrix notation, which means that + b and tf.nn.softmax() are applied to each row of their respective input matrices separately.

We have now defined the prediction as a matrix with 3 columns corresponding to the three different iris species

tf.Tensor 'Softmax:0' shape=(?, 3) dtype=float32

Defining our cost function using the cross-entropy loss \( \underset{k}{avg} \, H(y^{(k)}, p^{(k)}) = -\underset{k}{avg} \, \sum\limits_{i} y_i^{(k)} \cdot \log p_i^{(k)} \) is slightly more involved that defining the prediction, but can still be written in one line

cost = tf.reduce_mean(-tf.reduce_sum(y * tf.log(prediction), axis=1))

We will look at this expression step-by-step.

The function call tf.log(prediction) simply calculates the logarithm for each cell of the prediction matrix.

The matrix of true labels y has the same dimensions as tf.log(prediction) and the multiplication y * tf.log(prediction) is performed element-wise

y * tf.log(prediction)
tf.Tensor 'mul_1:0' shape=(?, 3) dtype=float32

The expression tf.reduce_sum() sums the elements of y * tf.log(prediction) along axis 1 (the axis count starts at 1), i.e. it calculates the sum over all columns for each row, resulting in a column vector containing a single value for the cross-entropy of each sample

tf.reduce_sum(y * tf.log(prediction), axis=1)
tf.Tensor 'Sum_1:0' shape=(?,) dtype=float32

Conceptually this expression just calculates the log of the predicted probability of the true category of each training sample.

The last step in the definition of the cost function is a simple average over the negation of this vector using tf.reduce_mean(...)

tf.Tensor 'Mean:0' shape=() dtype=float32

As we can see cost is a single float, i.e. shape=(). We use this cost function to define a simple gradient descent optimizer

optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

Now we are ready to perform the actual training.

We initialize the variables (w and b) and run the 1000 steps of gradient descent. Note that we can pass the pandas data frames train_x and train_y directly to TensorFlow without having to convert them to numpy ndarrays.
for epoch in range(10000):[optimizer], feed_dict={x: train_x, y: train_y})
W_hat, b_hat =[W, b])

Now we have trained a model

[[ 0.84457296  0.78274035 -1.62730658]
 [ 2.07996345 -0.33891651 -1.74104333]
 [-2.79475284 -0.19219251  2.98692822]
 [-1.25635636 -0.96077484  2.21712947]]
 [ 0.4169949   0.67145401 -1.08844793]

and can apply it to a sample record

926., feed_dict={x: test_x[:1], y: test_y[:1]}).tolist()[0]
[7.980591180967167e-05, 0.04710648953914642, 0.9528136849403381]

and get 95% probability for the last category, which is the correct species


In order to evaluate the performance of the model on the complete test dataset we check for every sample if the predicted label is correct.

  • We define a decision function which simply predicts the most likely category as tf.argmax(prediction, 1).
  • We convert the one-hot encoded label y back to the index of the category using tf.argmax(y, 1).
  • We compare the two using tf.equal() and convert the resulting boolean to a float (0=False, 1=True) using tf.cast().
prediction_is_correct = tf.cast(tf.equal(tf.argmax(prediction, 1), tf.argmax(y, 1)), tf.float32)

By averaging this function over all test samples we get the accuracy of the prediction.

(Note that we pass W_hat and b_hat to explicitly. Technically this is redundant because the values are already available in the session, but it would be essential if training and evaluation happened separately from each other.)

print(, feed_dict={W: W_hat, b: b_hat, x: test_x, y: test_y}))

For our test set 98% accuracy means that 49 out of 50 iris species were predicted correctly.

Below we show two additional tweaks which are commonly used in TensorFlow models.

  • We use the builtin TensorFlow function tf.nn.softmax_cross_entropy_with_logits() which is said to be more numerically stable than defining the cost function manually as we did above.
  • Instead of the basic gradient descent optimizer we use the Adam optimizer which converges faster, because it is able to automatically adjust the learning rate.
cost2 = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = tf.matmul(x, W) + b, labels = y))
optimizer2 = tf.train.AdamOptimizer(learning_rate=0.1).minimize(cost)
for epoch in range(101):[optimizer2], feed_dict={x: train_x, y: train_y})
    if epoch % 10 == 0:
        print(epoch,, feed_dict={x: test_x, y: test_y}))
0 0.38
10 0.74
20 0.98
30 0.96
40 0.98
50 0.98
60 0.96
70 0.98
80 0.98
90 0.98
100 0.98


Continue reading this article with First Steps with TensorFlow – Part 3 | Building a simple neural network

Missed something? See First steps with TensorFlow – Part 1 | Basics