Machine Learning & Big Data Blog

MongoDB Aggregate Functions Explained

3 minute read
Walker Rowe

This is the second part of the tutorial on how to use NodeJS with MongoDB. Here we switch to using the regular MongoDB shell and commands to make the study of aggregate functions simpler.

To show how to use aggregate functions, we will first explain how to do basic queries. Then we will show how to do the WordCount program, which is what people start with when they are first learning, for example, Apache Spark.

(This article is part of our MongoDB Guide. Use the right-hand menu to navigate.)

Basically, there are two aggregate functions: aggregate and MapReduce. Aggregate functions are the same as the familiar SQL command:

SQL select xxx, count (xxx) from table group by xxx

The aggregate functions are count, sum, average, min, max, etc.

The data that we used in the first part was smoker survey date. It has a complex structure. Recall that the schema is this:

var schema = new mongoose.Schema({
cachedContents : {
largest : String,
non_nill : Number,
nill : Number,
tip : [{
item : String,
count: Number
}] ,
smallest : String,
format : {
displayStyle : String,
align : String
}
}
});

And the sample data looks like this:

"_id" : ObjectId("59e9a6144bde260fa07e68f1"),
"cachedContents" : {
"nill" : 0,
"non_nill" : 14069,
"largest" : "WY",
"tip" : [
{
"item" : "NC",
"count" : 20,
"_id" : ObjectId("59e9a6144bde260fa07e6905")
},
{
"item" : "ND",
"count" : 19,
"_id" : ObjectId("59e9a6144bde260fa07e6904")
},
{
"item" : "OH",
"count" : 18,
"_id" : ObjectId("59e9a6144bde260fa07e6903")
},
…. (shortened)…..
{
"item" : "WY",
"count" : 2,
"_id" : ObjectId("59e9a6144bde260fa07e68f3")
},
{
"item" : "IN",
"count" : 1,
"_id" : ObjectId("59e9a6144bde260fa07e68f2")
}
]
},
"__v" : 0
}

That is complicated to work with, since it has an array of embedded documents: tip:[{item, count}]. So we will first show how to query this data, then we will make simpler data to show the aggregate and mapreduce functions.

Simple Queries

A simple find query will output all fields. Like this:

db.smokersmodels.find({ "cachedContents.largest" : "2016" }).pretty()
"_id" : ObjectId("59e9a6144bde260fa07e6fda"),
"cachedContents" : {
"nill" : 0,
"non_nill" : 14069,
"largest" : "2016",
"tip" : [
{
"item" : "2014-2015",
"count" : 20,
"_id" : ObjectId("59e9a6144bde260fa07e6fe5")
},
… (and so forth) ….

Notice that we use pretty() to format the output to JSON so that it is easier to read.

Notice too that we put quote marks around the field names. If you look at the instructions from MongoDB they leave then off. But that causes a syntax error when we use the mongo shell. There seems to be some confusion about that if you look at stackoverflow. We are using MongoDB version mongodb-linux-x86_64-debian81-3.4.9.

And we use the dot notation like cachedContents.largest to pull fields nested in our document.

To output specific fields we put them after the query and put the number 1 to indicate which ones to print. Notice here how the output looks when we leave off pretty().

db.smokersmodels.find({ "cachedContents.largest" : "2016" } , { "cachedContents.largest" : 1 })
{ "_id" : ObjectId("59e9a6144bde260fa07e688c"), "cachedContents" : { "largest" : "2016" } }
{ "_id" : ObjectId("59e9a6144bde260fa07e68ad"), "cachedContents" : { "largest" : "2016" } }
{ "_id" : ObjectId("59e9a6144bde260fa07e68e5"), "cachedContents" : { "largest" : "2016" } }
{ "_id" : ObjectId("59e9a6144bde260fa07e693d"), "cachedContents" : { "largest" : "2016" } }
{ "_id" : ObjectId("59e9a6144bde260fa07e69c3"), "cachedContents" : { "largest" : "2016" } }

Simple Aggregation

Use the distinct operation to do simple aggregation:

db.smokersmodels.distinct("cachedContents.largest")
[
"2016",
"WY",
"Wyoming",
"Tobacco Use – Survey Data",
"Cigarette Use (Adults)",
"Smoking Status",
"BRFSS",
"Some Days",
"%",
"Percentage",
"87.9",
"*",
"Data in these cells have been suppressed because of a small sample size",
"14.9",
"85.4",
"90.4",
"39839",
"Overall",
"White",
"All Ages",
"All Grades",
"BEH",
"101BEH"
]

Word Count

Now we make some simpler data to illustrate how to do the WordCount program with MongoDB.

First, create some data:

db.words.insertMany( [
{ word: "the" },
{ word: "rain" },
{ word: "in" },
{ word: "spain" },
{ word: "spain" }
]);

They print it out to see what it looks like:

db.words.find().pretty()
{ "_id" : ObjectId("59eb051e878eb4d6aca74243"), "word" : "the" }
{ "_id" : ObjectId("59eb051e878eb4d6aca74244"), "word" : "rain" }
{ "_id" : ObjectId("59eb051e878eb4d6aca74245"), "word" : "in" }
{ "_id" : ObjectId("59eb051e878eb4d6aca74246"), "word" : "spain" }
{ "_id" : ObjectId("59eb051e878eb4d6aca74247"), "word" : "spain" }

The trick with the WordCount program has always been to make the (key,value) pairs (word, 1) and then sum the number 1:

db.words.aggregate([ {$group: { _id: "$word", cnt: {$sum: 1} } } ])
{ "_id" : "in", "cnt" : 1 }
{ "_id" : "rain", "cnt" : 1 }
{ "_id" : "spain", "cnt" : 2 }
{ "_id" : "the", "cnt" : 1 }

As you can see, the word spain occurs 2 times.

MapReduce

Doing this with mapReduce basically does the same thing. We first create (key,value) pairs (word, 1) using the emit function. Then we use the JavaScript Array.sum function to sum the number 1.

db.words.mapReduce(
function() { emit(this.word,1); },
   function(key, values) {return Array.sum(values)}, {
query:{},
   } out:"total_matches"
).find()
{ "_id" : "in", "value" : 1 }
{ "_id" : "rain", "value" : 1 }
{ "_id" : "spain", "value" : 2 }
{ "_id" : "the", "value" : 1 }

The result is the same.

Additional Resources

Free e-book: The Beginner’s Guide to MongoDB

MongoDB is the most popular NoSQL database today and with good reason. This e-book is a general overview of MongoDB, providing a basic understanding of the database.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

Business, Faster than Humanly Possible

BMC empowers 86% of the Forbes Global 50 to accelerate business value faster than humanly possible. Our industry-leading portfolio unlocks human and machine potential to drive business growth, innovation, and sustainable success. BMC does this in a simple and optimized way by connecting people, systems, and data that power the world’s largest organizations so they can seize a competitive advantage.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.