Machine Learning & Big Data Blog

Partition Key vs Composite Key vs Clustering Columns in Cassandra

Mongosh commands.
4 minute read
Walker Rowe

Here we explain the differences between partition key, composite key and clustering key in Cassandra.

In brief, each table requires a unique primary key. The first field listed is the partition key, since its hashed value is used to determine the node to store the data. If those fields are wrapped in parentheses then the partition key is composite. Otherwise the first field is the partition key. Any fields listed after the partition key are called clustering columns. These store data in ascending or descending order within the partition for the fast retrieval of similar values. All the fields together are the primary key.

We discussed partitioning data here.

(This article is part of our Cassandra Guide. Use the right-hand menu to navigate.)

Clustering columns

Clustering columns determines the order of data in partitions.

What is the reason for having clustering columns? The whole point of a column-oriented database like Cassandra is to put adjacent data records next to each other for fast retrieval.

Remember than in a regular rdbms database, like Oracle, each row stores all values, including empty ones. But in a column oriented database one row can have columns (a,b,c) and another (a,b) or just (a). So if we are only interested in the value a then why not store that in the same data center, rack, or drive for fast retrieval? Remember that SQL select statements create subsets. So the column-oriented approach makes the prime data structure a type of subset.

This approach makes logical sense since we are usually only interested in a part of the data at any one time. For example why retrieve employee tax IDs, salary, manager’s name, when we just want their name and phone number?

Examples

Let’s look at books. The ISBN is a serial number of a book used by publishers. They are supposed to be unique. But let’s suppose they do not need to be for these examples.

Create a books keyspace, table, and put some data into it. (We discussed keyspaces here.)

Note that the primary key is PRIMARY KEY (isbn, author, publisher). In this case isbn is the partition key and author and publisher are clustering keys. It would make sense that in a collection of books you would want to store them by author and then publisher.

CREATE KEYSPACE books
       WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1};

CREATE TABLE books (
   isbn text,
   title text,
   author text,
  publisher text,
   category text,
   PRIMARY KEY (isbn, author, publisher)
);

insert into books (isbn, title, author, publisher, category) values ('111', 'Fishing', 'Fred', 'Penguin Group', 'Sports');

insert into books (isbn, title, author, publisher, category) values ('111', 'Sailing', 'Sally', 'Penguin Group', 'Sports');

insert into books (isbn, title, author, publisher, category) values ('111', 'Archery', 'Fred', 'Penguin Group', 'Sports');

insert into books (isbn, title, author, publisher, category) values ('111', 'Boating', 'Joe', 'Penguin Group', 'Sports');

Notice that all of the values in the primary key must be unique, so it dropped one record because author Fred wrote and published more than one book with published Penguin Group.

Now select the partition key and the primary key. They are all the same since we want them all stored on the same virtual node.

select token(isbn), isbn from books;

 system.token(isbn)  | isbn
---------------------+------
 -175843201295106731 |  111
 -175843201295106731 |  111
 -175843201295106731 |  111

Now select all records and notices that the data is sorted by author and then publisher within the partition key 111.

select * from books;

 isbn | author | publisher     | category | title
------+--------+---------------+----------+---------
  111 |   Fred | Penguin Group |   Sports | Archery
  111 |    Joe | Penguin Group |   Sports | Boating
  111 |  Sally | Penguin Group |   Sports | Sailing

Now add another record but give it a different primary key value, which could result it in being stored in a different partition. What virtual node it is stored on depends on the token range assigned to the virtual node.

insert into books (isbn, title, author, publisher, category) values ('333', 'Trees', 'Charles Darwin', 'Hachette', 'Nature');


insert into books (isbn, title, author, publisher, category) values ('333', 'Trees', 'Charles Darwin', 'Amazon', 'Nature');

Observe again that the data is sorted on the cluster columns author and publisher. And the token is different for the 333 primary key value.

Recall that the partitioner has function configured in cassandra.yaml calculated the hash value and then distributes the data based upon partitioner. The default is org.apache.cassandra.dht.Murmur3Partitioner

select token(isbn), isbn, author, publisher from books;

 system.token(isbn)  | isbn | author         | publisher
---------------------+------+----------------+---------------
 -175843201295106731 |  111 |           Fred | Penguin Group
 -175843201295106731 |  111 |            Joe | Penguin Group
 -175843201295106731 |  111 |          Sally | Penguin Group
 7036029452358700311 |  333 | Charles Darwin |        Amazon
 7036029452358700311 |  333 | Charles Darwin |      Hachette

Composite Keys

This is just a table with more than one column used in the calculation of the partition key. We denote that with parentheses like this: PRIMARY KEY ((isbn, author), publisher). In this case isbn and author are the partition key and publisher is a clustering key.

drop table books;

CREATE TABLE books (
   isbn text,
   title text,
   author text,
  publisher text,
   category text,
   PRIMARY KEY ((isbn, author), publisher)
);

insert into books (isbn, title, author, publisher, category) values ('111', 'Fishing', 'Fred', 'Penguin Group', 'Sports');

insert into books (isbn, title, author, publisher, category) values ('111', 'Sailing', 'Sally', 'Penguin Group', 'Sports');

insert into books (isbn, title, author, publisher, category) values ('111', 'Archery', 'Fred', 'Penguin Group', 'Sports');

insert into books (isbn, title, author, publisher, category) values ('111', 'Boating', 'Joe', 'Penguin Group', 'Sports');

Now to show the partition key value we use the SQL token function and give it both the isbn and author values:

select token(isbn,author), isbn from books;

 system.token(isbn, author) | isbn
----------------------------+------
         725505645253967381 |  111
         960809148155353310 |  111
        5462216525918432145 |  111

Add the same data as above with the insert SQL statements. Notice that adding this data also drops one book because one author wrote more than one book with the same ISBN. By definition the primary key must be unique. That includes clustering columns, since they are part of the primary key. All we have changed with the compound key is the calculation of the partition key and thus where the data is stored.

select * from books;

 isbn | author | publisher     | category | title
------+--------+---------------+----------+---------
  111 |   Fred | Penguin Group |   Sports | Archery
  111 |    Joe | Penguin Group |   Sports | Boating
  111 |  Sally | Penguin Group |   Sports | Sailing

Learn ML with our free downloadable guide

This e-book teaches machine learning in the simplest way possible. This book is for managers, programmers, directors – and anyone else who wants to learn machine learning. We start with very basic stats and algebra and build upon that.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

Business, Faster than Humanly Possible

BMC empowers 86% of the Forbes Global 50 to accelerate business value faster than humanly possible. Our industry-leading portfolio unlocks human and machine potential to drive business growth, innovation, and sustainable success. BMC does this in a simple and optimized way by connecting people, systems, and data that power the world’s largest organizations so they can seize a competitive advantage.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.