How To Run a Multi-Node Cluster Database with Cassandra on Ubuntu 14.04


Apache Cassandra is a highly scalable open source database system, achieving great performance on multi-node setups.

Previously, we went over how to run a single-node Cassandra cluster. In this tutorial, you’ll learn how to install and use Cassandra to run a multi-node cluster on Ubuntu 14.04.


Because you’re about to build a multi-node Cassandra cluster, you must determine how many servers you’d like to have in your cluster and configure each of them. It is recommended, but not required, that they have the same or similar specifications.

To complete this tutorial, you’ll need the following:

Scylla release: version 1.0

The Scylla team is pleased to announce the release of Scylla 1.0 (GA), the first production ready Scylla release. Scylla is an open source, Apache-Cassandra-compatible NoSQL database, with superior performance and consistent low latency.

From now own, only critical bugs (showstoppers) will be fixed in branch-1.0.x. We will continue to fix bugs and add feature on the master branch toward 1.1 and beyond. Followup minor releases (1.1, 1.2 etc) will be timed-base releases at the end of each month; Scylla 1.1 due date is end of April.

Release 1.0 does not add new functionality to RC2, only showstopper bugs were fixed. A full list of contributions and known issues is available on the Scylla wiki. More on Scylla 1.0 status and compatibility with Cassandra here. More on Scylla road map here. We invested a lot of effort in testing Scylla 1.0, if you do find any issue, please let us know.

Get started with Scylla 1.0 here. If you have any questions about the new release, please post to thescylladb-users mailing list.

Follow @ScyllaDB on Twitter or subscribe to this site’s RSS feed to keep up with future releases and news.

Scylla release: version 1.0

Druid is a fast column-oriented distributed data store

Sub-Second Queries

Druid supports fast aggregations and sub-second OLAP queries. Druid is designed for multi-tenancy and is ideal for powering user-facing analytic applications.

Real-time Streams

Druid supports streaming data ingestion and offers insights on events immediately after they occur. Retain events indefinitely and unify real-time and historical views.

Scalable to Petabytes

Existing Druid clusters have scaled to petabytes of data and trillions of events, ingesting millions of events every second. Druid is extremely cost effective, even at scale.

Deploy Anywhere

Druid runs on commodity hardware. Deploy it in the cloud or on-premise. Integrate with existing data systems such as Hadoop, Spark, Kafka, Storm, and Samza.

Vibrant Community

Druid is a community led project. Join the fast growing community and work with developers from across the world.

KVM creators open-source fast Cassandra drop-in replacement Scylla

“Two key figures behind popular open-source hypervisor KVM are today unveiling a new NoSQL database that they describe as a far faster drop-in replacement for Apache Cassandra.

The Scylla database, from KVM inventor Avi Kivity and the man who oversaw the hypervisor’s development, Dor Laor, offers what they say is 10 times better throughput and latency than wide column store Cassandra, while maintaining complete compatibility.”…

“Scylla has been written in C++ 14 – together with the project’s Seastar programming model. The Seastar C++ application framework is designed for high concurrency server applications and described on GitHub as “an event-driven framework allowing you to write non-blocking, asynchronous code in a relatively straightforward manner”…

TITAN Distributed Graph Database

“Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversals in real time.

In addition, Titan provides the following features:

Understanding the Impact of Cassandra Compact Storage

“At Librato, our primary data store for time-series metrics is Apache Cassandra built using a custom schema we’ve developed over time. We’ve written and presented on it several times in the past. We store both real-time metrics and historical rollup time-series in Cassandra. Cassandra storage nodes have the largest footprint in our infrastructure and hence drive our costs, so we are always looking for ways to improve the efficiency of our data model.

As part of our ongoing efficiency improvements and development of new backend functionality, we recently took the time to reevaluate our storage schema. Coming from the early days of Cassandra 0.8.x, our schema has always been built atop the legacy Thrift APIs, and whenever we stood up a new ring, we migrated it using the `nodetool` command. We’ve been closely following the development of CQL and had already moved parts of our read-path to the new native interface in 2.0.x. However, we wanted to take a closer look at fully constructing our schema migrations (creating the CQL tables, or “column families” as they were called) using the native CQL interface…”

Building a Distributed Fault-Tolerant Key-Value Store

“First of all, let’s discuss briefly what a Key-Value store is, and how it compares to a relational database.

Key-Value Stores offer a simple abstraction over your data, working as a dictionary data-structure. Such database provides a mechanism for storage and retrieval of data that is modeled in manipulated by means of basic CRUD operations (create, read, update, delete). The API of these databases is usually kept simple, and even if they provide an SQL-like language like Cassandra’s Query Language, it’s intentionally kept much simpler than full-blown SQL.


This simpler functionality means that Key-Value Stores, and NoSQL databases in general, often give more responsibility to the user, who now needs to manually do a lot of the work that the system takes care of automatically in a relational database. They sacrifice the expressivity brought by an expressive language like SQL, and the integrity checks brought by these schema-based models. This in turn means that NoSQL systems are free to choose other trade-offs that will result in higher availability, performance, scalability or other specific qualities.

One important thing regarding RDBMS and NoSQL is their respective theoretical models, which establishes the guarantees that such a system provides to the end user. They are known as the ACID and BASE models, in one of those fancy metaphors made out of acronyms.

So, why are NoSQL systems popular nowadays? Well, not without some controversy, but the main selling points could be summarized as:

  • Speed
  • Single Point of Failure (SPOF) avoidance
  • Better support for Large amounts of unstructured data
  • Lower TCO (Total cost of operation, sysadmins)
  • Incremental scalability

This are the sort of qualities that define Google, Facebook and the other big players and their business models, so it makes perfect sense to them. Whether it makes sense for your particular situation (probably not), well, it’s the core of that controversy, and it’s not really the intention of this post to dig into that…”