Machine Learning & Big Data Blog

Using StringIO to Read Delimited Text Files into NumPy

Mongosh commands.
2 minute read
Walker Rowe

In this tutorial, we’ll show you how to read delimited text data into a NumPy array using the StringIO package.

(This tutorial is part of our Pandas Guide. Use the right-hand menu to navigate.)

Data we used

We will read this crime data:

,crime$cluster,Murder,Assault,UrbanPop,Rape
Alabama,4,13.2,236,58,21.2
Alaska,4,10,263,48,44.5
Arizona,4,8.1,294,80,31
Arkansas,3,8.8,190,50,19.5
California,4,9,276,91,40.6
Colorado,3,7.9,204,78,38.7
Connecticut,2,3.3,110,77,11.1
Delaware,4,5.9,238,72,15.8
Florida,4,15.4,335,80,31.9

Parameters

In the code below, we download the data using urllib. Then we use np.genfromtxt to import it to the NumPy array. Note the following parameters:

delimiter=”,” The delimiter between columns.
skip_header=1 We skip the header since that has column headers and not data.
dtype=dtypes This parameter means use the tuples (name, dtype) to convert the data using the name as the assigned numpy dtype (data type).

If we don’t want to assign names we would use (dtype1, dtype2, …).

Note that we use the type float. Since NumPy is built using the C language, you can use any of the many ctypes, like 32 bit integers etc.

We use S12 for str as str converts this data to ” “. You could also use unicode U12.

We also could have written np.string_ and np.unicode_ but that does not give any length, so it means a null terminated byte, which is not a string. So, it would return a blank space.

We could have used object as well.

Note that NumPy uses these names:

· dtype=[(‘crime’, ‘S12’), (‘cluster’, ‘<f8’), (‘Murder’, ‘<f8’), (‘Assault’, ‘<f8’), (‘UrbanPop’, ‘<f8’), (‘Rape’, ‘<f8’)])

· The < sign refers to the byte order which can be little-endian or big-endian.

usecols=(1,5) We did not use this parameter. If we had used it, it would have skipped the first column.

The code explained

Here is the code:

import urllib
import numpy as np
from io import StringIO
url = "https://raw.githubusercontent.com/werowe/MLexamples/master/crime_data.csv"
file = urllib.request.urlopen(url)
data = ""
for d in file:
data = data + d.decode('utf-8')
dtypes=[('crime',"S12"),
('cluster', float),
('Murder' ,float),
('Assault',float),
('UrbanPop',float),
('Rape',float)]
arr=np.genfromtxt(StringIO(data), delimiter=",", skip_header=1,
dtype=dtypes)

Results in:

array([(b'Alabama', 4., 13.2, 236., 58., 21.2),
(b'Alaska', 4., 10. , 263., 48., 44.5),

Note that NumPy returned a byte array for the string column. If we want a string, we can use Unicode:

dtypes=[('crime','U25'),
('cluster', '>f'),
('Murder' ,float),
('Assault',float),
('UrbanPop',float),
('Rape',float)]

Results in:

array([('Alabama', 4., 13.2, 236., 58., 21.2),
('Alaska', 4., 10. , 263., 48., 44.5),

If we leave off dtypes and let NumPy pick the data types, it NaN (missing data) to the string column. It also uses float as the default for all numeric values.

arr=np.genfromtxt(StringIO(data), delimiter=",", skip_header=1)

Results in:

array([[  nan,   4. ,  13.2, 236. ,  58. ,  21.2],
[  nan,   4. ,  10. , 263. ,  48. ,  44.5],

Having assigned names to columns we can refer to their name instead of index:

arr['Murder']
array([13.2, 10. ,  8.1,  8.8,  9. ,  7.9,  3.3,  5.9, 15.4, 17.4,  5.3,
2.6, 10.4,  7.2,  2.2,  6. ,  9.7, 15.4,  2.1, 11.3,  4.4, 12.1,
2.7, 16.1,  9. ,  6. ,  4.3, 12.2,  2.1,  7.4, 11.4, 11.1, 13. ,
0.8,  7.3,  6.6,  4.9,  6.3,  3.4, 14.4,  3.8, 13.2, 12.7,  3.2,
2.2,  8.5,  4. ,  5.7,  2.6,  6.8])

Missing values

We can tell NumPy to plug in a value for a missing value, like -1, using missing_values. The default behavior for floats is np.nan. For int it is -1.

Alaska,4,10,263,48,44.5
Arizona,4, ,1,294,80,31

That concludes this tutorial.

Related reading

Learn ML with our free downloadable guide

This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who wants to learn machine learning. We start with very basic stats and algebra and build upon that.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

Business, Faster than Humanly Possible

BMC empowers 86% of the Forbes Global 50 to accelerate business value faster than humanly possible. Our industry-leading portfolio unlocks human and machine potential to drive business growth, innovation, and sustainable success. BMC does this in a simple and optimized way by connecting people, systems, and data that power the world’s largest organizations so they can seize a competitive advantage.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.