Machine Learning & Big Data Blog

MongoDB vs Cassandra: NoSQL Databases Compared

4 minute read
Walker Rowe

Here we make a side by side comparison of MongoDB versus Cassandra. We provide examples, with commands and code, and not just a narrative explanation.

In sum, Cassandra is the modern version of the relational database, albeit where data is grouped by column instead of row, for fast retrieval. MongoDB stores records as documents in JSON format. It has a JavaScript shell and a rich set of functions which makes it easy to work with. Both systems are designed to scale enormously.

(This article is part of our MongoDB Guide. Use the right-hand menu to navigate.)

Data Structure

Cassandra is a column-oriented database. MongoDB stores records in JSON format. The MongoDB shells also support JavaScript so that you can build up queries and data conversion and manipulation in steps, saving each operation in a JavaScript variable.

A JSON data record is self-describing, because the field name and the data value is stored in the same place, i.e., inside the document. While a schema is not required with MongoDB, as JSON by definition does not need one, you can make one:

var schema = new mongoose.Schema({
cachedContents : {
largest : String,
non_null : Number,
null : Number,
top : [{
item : String,
count : Number
}],
smallest : String,
format : {
displayStyle : String,
align : String
}
}
});

In this Introduction to Cassandra, we explain that Cassandra tables use a schema, like a traditional relational database. But it does away with the notion of database normalization. Oracle administrators would say that the Cassandra schema is flat. The reason for this is storage across a network of commodity servers is cheap, so there is no reason to make joins to retrieve data, which slows down data retrieval. Instead, redundancy is OK. Also Cassandra tables do not require every field to be populated. So there is no overhead of storing empty data.

Cassandra also provides JSON support:

use json;
CREATE type json.sale ( id int, item text, amount int );
CREATE TABLE json.customers ( id int  PRIMARY KEY, name text, balance int, sales list> );
INSERT INTO json.customers (id, name, balance, sales) 
VALUES (123, 'Greenville Hardware', 700,
[{ id: 5544, item : 'tape', amount : 100},
{ id: 5545, item : 'wire', amount : 200}]) ;
select * from customers;
id  | balance | name                | sales
-----+---------+---------------------+--------------------------------------------------------------------------------
123 |     700 | Greenville Hardware | [{id: 5544, item: 'tape', amount: 100}, {id: 5545, item: 'wire', amount: 200}]

Adding Data

Cassandra:

CREATE TABLE Library.book (       
ISBN text, 
copy int, 
title text,  
PRIMARY KEY (ISBN, copy)
);
INSERT INTO  Library.book (ISBN, copy, title) VALUES('1234',1, 'Bible');

MongoDB:

db.collection.insertOne( { isbn: 100 } )
{
"acknowledged" : true,
"insertedId" : ObjectId("5c4493aa750820eae9756a15")
}

Create Index

Cassandra:

CREATE TABLE  Library.patron (      
ssn int PRIMARY KEY,  
checkedOut set 
);
INSERT INTO Library.patron (ssn, checkedOut) values (123,{'1234','5678'});
create index on Library.patron (checkedOut);

MongoDB:

db.address.createIndex( { "location": "2dsphere"} )

Clustering

We explain how to Set Up a Cassandra Cluster and MongoDB Cluster Installation.

MongoDB (mongod daemon) uses a Mongo Master, Mongo Shard, and Mongo Config server to replicate data. The Query server, aka router, (mongos daemon) determines to which server to send queries and data operations, like adding a record.

Cassandra has no configuration server to control the operation of other servers. There is no master-slave relationship. Instead there is a ring of servers, each serving equal functions, albeit storing different parts (i.e., shards) of the data.

Sharding

Sharding means distributing data across a cluster. This is done by applying an algorithm across part of all of a document or index. MongoDB and Cassandra both provide a fine level of control over this.

Cassandra does this with Cassandra Partition key, Composite key, and Clustering Columns and Using Tokens to Distribute Cassandra Data. In brief, each table requires a unique primary key. The first field listed is the partition key, since its hashed value is used to determine the node to store the data. If those fields are wrapped in parentheses then the partition key is composite. Otherwise the first field is the partition key. Any fields listed after the primary key are called clustering columns. These store data in ascending or descending order within the partition for the fast retrieval of similar values. All the fields together are the primary key.

And here MongoDB sharding is explained. Basically, you assign different parts of the data to different servers using an index. For example, records with the index customers could be on one set of servers and vendors on the other. But if you want a completely random distribution then you use a hashed value for the index. You can also assign data to servers using a range of values.

sh.enableSharding("tobacco")
{ "ok" : 1 }

Data Consistency through Replication

Replication means making more than one copy of data to prevent against data loss due to hardware or other failure.

MongoDB uses a concept call ReplicaSets, configured with rs.initiate:

rs.initiate()
{
"info2" : "no configuration specified. Using a default configuration for the set",
ConfigReplSet:SECONDARY> rs.status()
{
"set" : "ConfigReplSet",

We explain Configuring Apache Cassandra Data Consistency. You create replication like this:

CREATE KEYSPACE Library
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 };

Support for GEO Operations

MongoDB has superior support for storing and querying geographical data. One obvious reason for that is the universal format for that is GEOJson. Take a look at Track Tweets by Geographic Location and Query MongoDB by GeoLocation.

Here is a MongoDB query to find records near a certain longitude and latitude:

db.address.find ({
location: {
$near: {
$geometry: {
type: "Point" ,
coordinates: [ -72.7738706,41.6332836 ]
},
$maxDistance: 4,
$minDistance: 0
}
}
})

And since the mongo shell supports Javascript you can save queries in variables:

var connecticut = db.address.find ({location:
{$geoWithin:
{$geometry: {
type: "Polygon",
coordinates:
[[ [ -73.564453125, 41.178653972331674 ],
[ -71.69128417968749, 41.178653972331674 ],
[ -71.69128417968749, 42.114523952464246 ],
[ -73.564453125, 42.114523952464246 ],
[ -73.564453125, 41.178653972331674 ]
]]
}}}});

Stress Testing

We talk about memory management and performance tuning in Stress Testing and Performance Tuning Apache Cassandra and MongoDB Memory Usage

Free e-book: The Beginner’s Guide to MongoDB

MongoDB is the most popular NoSQL database today and with good reason. This e-book is a general overview of MongoDB, providing a basic understanding of the database.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

Business, Faster than Humanly Possible

BMC empowers 86% of the Forbes Global 50 to accelerate business value faster than humanly possible. Our industry-leading portfolio unlocks human and machine potential to drive business growth, innovation, and sustainable success. BMC does this in a simple and optimized way by connecting people, systems, and data that power the world’s largest organizations so they can seize a competitive advantage.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.