Linear Regression with Tensorflow


We'll learn how a powerful statistical learning technique benefits from the efficiency and flexibility of Tensforflow.
We’ll learn about,
  • implementing linear regression with tensorflow (using normal equation)
  • implementing gradient descent (manually computing gradients)
  • using tensorflow autodiff to automatically compute gradients
  • using optimizers

Linear regression

Linear regression is a machine learning technique in the form of a linear equation, which takes in a set of input features, and produces an output, Here x1, x2, … xn are the input features, and y-hat is the predicted output. θ0, θ1, … θn are the set of parameters that are determined during the training based on the training data. In vector form, We'll discuss three techniques of linear regression with Tensorflow.

1. Using normal equation technique

We’ll first use a statistical technique called Normal Equation for determining the optimal θ vector (with Tensorflow).
Code example: We’ll predict use the California housing prices dataset.
import numpy as np
from sklearn.datasets import fetch_california_housing

# Download the data. California housing is a standard sklearn dataset, so we'll just use it from there.
housing = fetch_california_housing()
m, n = housing.data.shape

# Add a bias column (with all ones)
housing_data_with_bias = np.c_[np.ones((m, 1)), housing.data]

# Initialize X and y constants in tensorflow
X = tf.constant(housing_data_with_bias, dtype = tf.float32, name='X')
y = tf.constant(housing.target.reshape(-1, 1), dtype = tf.float32, name='y')

# Define the value of theta with normal equation
XT = tf.transpose(X)
XTdotX = tf.matmul(XT, X)
XTdotX_inverse = tf.matrix_inverse(XTdotX)
XTdotY = tf.matmul(XT, y)
theta = tf.matmul(XTdotX_inverse, XTdotY)

# Evaluate theta
with tf.Session() as sess:
    theta_value = theta.eval()

print(theta_value)
Note: You can also run normal equation directly using NumPy, the advantage with Tensorflow will be that it will automatically run it on GPU if your computer have one.
Normal equation is great, however it doesn’t scale well when we have large no. of features. The computational complexity of computing (XT . X)-1 is typically between O(n2.4) to O(n3) depending on the implementation (n = no. of features), which easily becomes an issue when n is large.
Q. Normal equation algorithm is efficient when we have a lot of training instances, but very less features.
 
 
 
 
In the next section, we’ll discuss gradient descent technique that allow efficient scaling with number of features.

2. Linear regression with gradient descent (manually computing gradients)

We’ll use batch gradient descent, manually computing the gradients. Remember to first normalize the input feature vectors when using gradient descent, otherwise the training may be slow.
import numpy as np
from sklearn.datasets import fetch_california_housing

# Download the data. California housing is a standard sklearn dataset, so we'll just use it from there.
housing = fetch_california_housing()
m, n = housing.data.shape

# Add a bias column (with all ones)
housing_data_with_bias = np.c_[np.ones((m, 1)), housing.data]

# Normalize input features
from sklearn.preprocessing import StandardScaler
housing_data_with_bias_scaled = StandardScaler().fit_transform(housing_data_with_bias)

n_epochs = 1000
learning_rate = 0.01

# Define X, y, theta
X = tf.constant(housing_data_with_bias_scaled, dtype = tf.float32, name = 'X')
y = tf.constant(housing.target.reshape(-1, 1), dtype = tf.float32, name = 'y')
theta = tf.Variable(tf.random_uniform([n+1, 1], -1.0, 1.0), name = 'theta')
y_prediction = tf.matmul(X, theta, name = 'y_prediction')

# Compute mean squared error
error = y_prediction - y
mse = tf.reduce_mean(tf.square(error), name = 'mse')

# Compute gradients
gradients = 2/m * tf.matmul(tf.transpose(X), error)

# Update theta
theta_new = theta - learning_rate * gradients
theta_update_op = tf.assign(theta, theta_new)

init = tf.global_variables_initializer()

# Run
with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_epochs):
        if epoch % 100 == 0:
            print('Epoch', epoch, 'MSE =', mse.eval())
        sess.run(theta_update_op)

    best_theta = theta.eval()

print(best_theta)
This works fine, however it requires us to manually compute the gradients from the MSE cost function. Manually computing gradients is easy for linear regression, but for more complicated machine learning algorithms (e.g. neural networks), it can really be tedious and error prone. In the next section, we’ll use a Tensorflow’s autodiff technique to compute the gradients.

3. Computing gradients using Tensorflow’s “autodiff”

Consider the function, f(x) = exp(exp(exp(x))) For computing gradients, we require its derivative, f'(x) = exp(x) * exp(exp(x)) * exp(exp(exp(x))) There are two way of computing f'(x)
  • Compute a = exp(x), b = exp(exp(x)), c = exp(exp(exp(x)), then finally f'(x) = a * b * c. The problem is that we have to compute exp(x) thrice, exp(exp(x)) twice, eventually executing the exp function nine times, which is inefficient and can be avoided if we reuse the values once computed.
  • Or, compute as, a = exp(x), then b = exp(b), then c = exp(c), and finally f'(x) = a * b * c. This is efficient because we reuse the values once computed (exp function is executed only three times).
  • The first method above is called symbolic differentiation, which is clearly not efficient. Tensorflow computation happens in the second way, which is what autodiff feature is about. It just reuse the values does not compute them again.
    # Compute gradients w.r.t. theta with Tensorflow' autodiff
    gradients = tf.gradients(mse, [theta])[0]
    
    # Update theta
    theta_new = theta - learning_rate * gradients
    theta_update_op = tf.assign(theta, theta_new)
    Full code
    This is great, efficient. However we’re still computing and updating theta vector manually. We can use an optimizer instead which will compute both the gradients and the theta out of the box.

    Using Optimizer instead of manually computing θ vector

    Code example: We’ll use Gradient Descent optimizer.
    # Use an optimizer to perform gradient descent (compute gradients, compute and update theta values)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate)
    training_op = optimizer.minimize(mse)
    Full code
    Tensorflow also provide other types of optimizers, e.g. MomentumOptimizer which converges much faster than Gradient descent.
    # Use an optimizer to perform gradient descent (compute gradients, compute and update theta values)
    optimizer = tf.train.MomentumOptimizer(learning_rate = learning_rate, momentum = 0.9)
    training_op = optimizer.minimize(mse)
    
    Full code