Here we introduce TensorFlow, an opensource machine learning library developed by Google. We explain what it does and show how to use it to do logistic regression.
(This tutorial is part of our Guide to Machine Learning with TensorFlow & Keras. Use the right-hand menu to navigate.)
Background
TensorFlow has many applications to machine learning, including neural networks. One application of neural networks is handwriting analysis. Another is facial recognition. TensorFLow is design to allow such problems to scale without limit as the nodes in the graph can be run across a distributed network. Google uses TensorFlow in some of their production applications.
One interesting aspect about TensorFlow is not only does the logic use the CPU of a machine, it can use the GPU, or graphical processor unit. That provides more power per machine as GPUs typically have a lot of power as powering the computer screen requires speed.
Install and Basic Concepts
To follow this tutorial, first install TF using the directions here.
The basis unit in TensorFlow is the tensor. A tensor is an array of any number of dimensions. For example:
[1] is a 1 dimension array[[1,1]] is 2 dimension array
To get started, first run Python and import TensorFlow:
import tensorflow as tf
You can assign values directly or make a placeholder where you assign the value later. For example a single value can be written:
x = tf.constant(3.0, dtype=tf.float32)
Where x is an immutable constant (meaning you cannot change it).
But the tensor has no value until you initiate a Session and run it:
import tensorflow as tf sess = tf.Session() x = tf.constant(3.0, dtype=tf.float32) print(sess.run([x])) Outputs: [3.0]
Or you can write:
import tensorflow as tf sess = tf.Session() y = tf.Variable([3.0], dtype=tf.float32) init = tf.global_variables_initializer() sess.run(init) print(sess.run([y])) Outputs: [array([ 3.], dtype=float32)]
In the example above, the Variable(s) have no value until you run tf.global_variables_initializer().
You can add tensors and do other math, like this:
x = tf.constant([3,3], dtype=tf.float32) y = tf.constant([4,4], dtype=tf.float32) print (x + y) print(sess.run([x+y])) outputs: Tensor("add_4:0", shape=(2,), dtype=float32) [array([ 11., 11.], dtype=float32)]
As you can see, the values of x and y have no value until you call run.
Here is another example. This is the graph of a line f(x)=mx + b, where m is the slope and b the y-intercept.
m = tf.Variable([2], dtype=tf.float32) b = tf.Variable([3], dtype=tf.float32) x = tf.placeholder(tf.float32) y = m * x + b
You can pass an array of n values to that and run that function n times. Here we use [1, 2, 3, 4]:
init = tf.global_variables_initializer() sess.run(init) print(sess.run(y, {x: [1, 2, 3, 4]})) Outputs: [ 5. 7. 9. 11.]
Linear Regression with tf.estimator
For background on logistic regression, and interpretation of the results, you can read this document from WikiPedia. We also get our test data from that document. The goal is to predict the likelihood that a student will pass a test given how many hours they have studied.
Copy and paste the code below into the Python interpreter as we explain.
Having installed TensorFlow, now run python.
First we import pandas, as it is the easiest way to work with columnar data. The hours are floating numbers, like x.xx. We multiply them by 100 and convert them to an integer since the TensorFlow functions we used for logistic regression require either strings or integers.
import pandas hours = [0.50,0.75,1.00,1.25,1.50,1.75,1.75,2.00,2.25,2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50] passx = [0,0,0,0,0,0,1,0,1,0,1,0,1,0,1,1,1,1,1,1] df = pandas.DataFrame(passx) df['hours'] = hours df.columns = ['pass', 'hours'] h = df['hours'].apply(lambda x: x * 100).astype(int) df['hours']=h print(df) outputs: print(df) hours pass 0 0.50 0 1 0.75 0 2 1.00 0 3 1.25 0 ...
We create a function input_fn that we can pass into the LinearClassifier model below. This function returns a data frame using the tf.estimator.inputs.pandas_input_fn method.
def input_fn(df): labels = df["pass"] return tf.estimator.inputs.pandas_input_fn( x=df, y=labels, batch_size=100, num_epochs=10, shuffle=False, num_threads=5)
TensorFlow writes its working data to disk, so we give it a place to do that. And we have to create a NumericColumn object, since our independent variable in continuous and not categorical. Then we create the LinearClassifier model.
import tensorflow as tf import tempfile model_dir = tempfile.mkdtemp() hours = tf.feature_column.numeric_column("hours") base_columns = [hours] m = tf.estimator.LinearClassifier(model_dir=model_dir, feature_columns=base_columns)
Now we run the train method.
m.train(input_fn(df),steps=None) Outputs: INFO:tensorflow:Create CheckpointSaverHook. INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpS8OD2H/model.ckpt. INFO:tensorflow:loss = 69.3147, step = 1 INFO:tensorflow:Saving checkpoints for 10 into /tmp/tmpS8OD2H/model.ckpt. INFO:tensorflow:Loss for final step: 54.1885. <tensorflow.python.estimator.canned.linear.LinearClassifier object at 0x7f103b560390>
Use same data for test data set as the training set. In real life you would split them in two. But we have very little data here.
results = m.evaluate(input_fn(df),steps=None) Outputs: INFO:tensorflow:Starting evaluation at 2017-11-02-14:20:16 INFO:tensorflow:Restoring parameters from /tmp/tmpS8OD2H/model.ckpt-10 INFO:tensorflow:Finished evaluation at 2017-11-02-14:20:16 INFO:tensorflow:Saving dict for global step 10: accuracy = 0.75, accuracy_baseline = 0.5, auc = 0.895, auc_precision_recall = 0.907308, average_loss = 0.535767, global_step = 10, label/mean = 0.5, loss = 53.5767, prediction/mean = 0.585759
Here we print out the same results as above but in an easier to read manner.
print("model directory = %s" % model_dir) for key in sorted(results): print("%s: %s" % (key, results[key])) Outputs: accuracy: 0.75 accuracy_baseline: 0.5 auc: 0.895 auc_precision_recall: 0.907308 average_loss: 0.535767 global_step: 10 label/mean: 0.5 loss: 53.5767 prediction/mean: 0.585759
The accuracy could be improved. You could create a larger data set and split the input data into a training and test data set. You could also adjust num_epochs and other values.