Machine Learning & Big Data Blog

Introduction to TensorFlow and Logistic Regression

Guide
3 minute read
Walker Rowe

Here we introduce TensorFlow, an opensource machine learning library developed by Google. We explain what it does and show how to use it to do logistic regression.

(This tutorial is part of our Guide to Machine Learning with TensorFlow & Keras. Use the right-hand menu to navigate.)

Background

TensorFlow has many applications to machine learning, including neural networks. One application of neural networks is handwriting analysis. Another is facial recognition. TensorFLow is design to allow such problems to scale without limit as the nodes in the graph can be run across a distributed network. Google uses TensorFlow in some of their production applications.

One interesting aspect about TensorFlow is not only does the logic use the CPU of a machine, it can use the GPU, or graphical processor unit. That provides more power per machine as GPUs typically have a lot of power as powering the computer screen requires speed.

Install and Basic Concepts

To follow this tutorial, first install TF using the directions here.

The basis unit in TensorFlow is the tensor. A tensor is an array of any number of dimensions. For example:

[1] is a 1 dimension array
[[1,1]] is 2 dimension array

To get started, first run Python and import TensorFlow:

import tensorflow as tf

You can assign values directly or make a placeholder where you assign the value later. For example a single value can be written:

x =  tf.constant(3.0, dtype=tf.float32)

Where x is an immutable constant (meaning you cannot change it).

But the tensor has no value until you initiate a Session and run it:

import tensorflow as tf
sess = tf.Session()
x =  tf.constant(3.0, dtype=tf.float32)
print(sess.run([x])) 
Outputs:
[3.0]

Or you can write:

import tensorflow as tf
sess = tf.Session()
y = tf.Variable([3.0], dtype=tf.float32)
init = tf.global_variables_initializer()
sess.run(init) 
print(sess.run([y]))
Outputs:
[array([ 3.], dtype=float32)]

In the example above, the Variable(s) have no value until you run tf.global_variables_initializer().

You can add tensors and do other math, like this:

x =  tf.constant([3,3], dtype=tf.float32)
y =  tf.constant([4,4], dtype=tf.float32)
print (x + y)
print(sess.run([x+y])) 
outputs:
Tensor("add_4:0", shape=(2,), dtype=float32)
[array([ 11.,  11.], dtype=float32)]

As you can see, the values of x and y have no value until you call run.

Here is another example. This is the graph of a line f(x)=mx + b, where m is the slope and b the y-intercept.

m = tf.Variable([2], dtype=tf.float32)
b = tf.Variable([3], dtype=tf.float32)
x = tf.placeholder(tf.float32)
y = m * x + b

You can pass an array of n values to that and run that function n times. Here we use [1, 2, 3, 4]:

init = tf.global_variables_initializer()
sess.run(init)
print(sess.run(y, {x: [1, 2, 3, 4]}))
Outputs:
[  5.   7.   9.  11.]

Linear Regression with tf.estimator

For background on logistic regression, and interpretation of the results, you can read this document from WikiPedia. We also get our test data from that document. The goal is to predict the likelihood that a student will pass a test given how many hours they have studied.

Copy and paste the code below into the Python interpreter as we explain.

Having installed TensorFlow, now run python.

First we import pandas, as it is the easiest way to work with columnar data. The hours are floating numbers, like x.xx. We multiply them by 100 and convert them to an integer since the TensorFlow functions we used for logistic regression require either strings or integers.

import pandas
hours = [0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50]
passx = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1]
df = pandas.DataFrame(passx)
df['hours'] = hours
df.columns = ['pass', 'hours']
h = df['hours'].apply(lambda x: x * 100).astype(int)
df['hours']=h
print(df)
outputs:
print(df)
hours  pass
0    0.50     0
1    0.75     0
2    1.00     0
3    1.25     0
...

We create a function input_fn that we can pass into the LinearClassifier model below. This function returns a data frame using the tf.estimator.inputs.pandas_input_fn method.

def input_fn(df):
labels = df["pass"]
return tf.estimator.inputs.pandas_input_fn(
x=df,
y=labels,
batch_size=100,
num_epochs=10,
shuffle=False,
num_threads=5)

TensorFlow writes its working data to disk, so we give it a place to do that. And we have to create a NumericColumn object, since our independent variable in continuous and not categorical. Then we create the LinearClassifier model.

import tensorflow as tf
import tempfile
model_dir = tempfile.mkdtemp()
hours = tf.feature_column.numeric_column("hours")
base_columns = [hours]
m = tf.estimator.LinearClassifier(model_dir=model_dir, feature_columns=base_columns)

Now we run the train method.

m.train(input_fn(df),steps=None)
Outputs:
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpS8OD2H/model.ckpt.
INFO:tensorflow:loss = 69.3147, step = 1
INFO:tensorflow:Saving checkpoints for 10 into /tmp/tmpS8OD2H/model.ckpt.
INFO:tensorflow:Loss for final step: 54.1885.
<tensorflow.python.estimator.canned.linear.LinearClassifier object at 0x7f103b560390>

Use same data for test data set as the training set. In real life you would split them in two. But we have very little data here.

results = m.evaluate(input_fn(df),steps=None)
Outputs:
INFO:tensorflow:Starting evaluation at 2017-11-02-14:20:16
INFO:tensorflow:Restoring parameters from /tmp/tmpS8OD2H/model.ckpt-10
INFO:tensorflow:Finished evaluation at 2017-11-02-14:20:16
INFO:tensorflow:Saving dict for global step 10: accuracy = 0.75, accuracy_baseline = 0.5, auc = 0.895, auc_precision_recall = 0.907308, average_loss = 0.535767, global_step = 10, label/mean = 0.5, loss = 53.5767, prediction/mean = 0.585759

Here we print out the same results as above but in an easier to read manner.

print("model directory = %s" % model_dir)
for key in sorted(results):
print("%s: %s" % (key, results[key]))
Outputs:
accuracy: 0.75
accuracy_baseline: 0.5
auc: 0.895
auc_precision_recall: 0.907308
average_loss: 0.535767
global_step: 10
label/mean: 0.5
loss: 53.5767
prediction/mean: 0.585759

The accuracy could be improved. You could create a larger data set and split the input data into a training and test data set. You could also adjust num_epochs and other values.

Learn ML with our free downloadable guide

This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who wants to learn machine learning. We start with very basic stats and algebra and build upon that.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

Business, Faster than Humanly Possible

BMC empowers 86% of the Forbes Global 50 to accelerate business value faster than humanly possible. Our industry-leading portfolio unlocks human and machine potential to drive business growth, innovation, and sustainable success. BMC does this in a simple and optimized way by connecting people, systems, and data that power the world’s largest organizations so they can seize a competitive advantage.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.