Discord continues to grow faster than we expected and so does our user-generated content. With more users comes more chat messages. In July, we announced 40 million messages a day, in December we announced 100 million, and as of this blog post we are well past 120 million. We decided early on to store all chat history forever so users can come back at any time and have their data available on any device. This is a lot of data that is ever increasing in velocity, size, and must remain available. How do we do it? Cassandra!
- Write and Read
- Data Model
- Best Combos!
- How to get started
In an industry where so many people want to change the world, it’s fair to say that low cost object storage has done just that. Building a business that requires flexible low-latency storage is now affordable in a way we couldn’t imagine before.
When building the Exoscale public cloud offering, we knew that a simple object storage service, protected by Swiss privacy laws, would be crucial. After looking at the existing object storage software projects, we decided to build our own solution: Pithos.
Pithos is an open source S3-compatability layer for Apache Cassandra, the column database. In other words, it allows you to use standard S3 tools to store objects in your own Cassandra cluster. If this is the first time that you’ve looked at object storage software then you may wonder why Pithos is built on top of a NoSQL database but it’s not all that unusual.
At Spotify we have have over 60 million active users who have access to a vast music catalog of over 30 million songs. Our users have a choice to follow thousands of artists and hundreds of their friends and create their own music graph. On our service they also discover new and existing content by experiencing a variety of music promotions (album releases, artist promos), which get served over our ad platform. These options have empowered our users and made them really engaged. Over time they have created over 1.5 billion playlists and just last year they streamed over 7 billion hours worth of music.
But at times an abundance of options has also made our users feel a bit lost. How do you find that right playlist for your workout from over a billion playlists? How do you discover new albums which are relevant to your taste? We help our users discover and experience relevant content by personalizing their experience on our platform.
Personalizing user experience involves learning their tastes and distastes in different contexts. A metal genre listener might not enjoy an announcement for a metal genre album when they are trying to put their kid to sleep and playing kid’s music at night. Serving them a recommendation for a kid’s music album might be more relevant in that context. But this experience might not be relevant for another metal genre listener who doesn’t mind receiving metal genre album recommendations during any context. These two users with similar listening habits might have different preferences. Personalizing their experiences on Spotify according to their respective taste in different contexts helps us make them more engaged.
Given these product insights we set out to build a personalization system which could analyze both real-time and historic data to understand user’s context and behavior respectively. Over time we’ve evolved our personalization tech stack due to a flexible architecture and ensured we used the right tools to solve the problem at scale.
Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is a transactional database that can support thousands of concurrent users executing complex graph traversalsin real time.
In addition, Titan provides the following features:
- Elastic and linear scalability for a growing data and user base.
- Data distribution and replication for performance and fault tolerance.
- Multi-datacenter high availability and hot backups.
- Support for ACID and eventual consistency.
- Support for various storage backends:
- Support for global graph data analytics, reporting, and ETL through integration with big data platforms:
- Support for geo, numeric range, and full-text search via:
- Native integration with the TinkerPop graph stack:
- Open source with the liberal Apache 2 license.
Deletes in Cassandra
Cassandra uses a log-structured storage engine. Because of this, deletes do not remove the rows and columns immediately and in-place. Instead, Cassandra writes a special marker, called a tombstone, indicating that a row, column, or range of columns was deleted. These tombstones are kept for at least the period of time defined by the gc_grace_seconds per-table setting. Only then a tombstone can be permanently discarded by compaction.
This scheme allows for very fast deletes (and writes in general), but it’s not free: aside from the obvious RAM/disk overhead of tombstones, you might have to pay a certain price when reading data back if you haven’t modelled your data well.
Specifically, tombstones will bite you if you do lots of deletes (especially column-level deletes) and later perform slice queries on rows with a lot of tombstones.
Symptoms of a wrong data model
To illustrate this scenario, let’s consider the most extreme case – using Cassandra as a durable queue, a known anti-pattern, e.g.
CREATE TABLE queues (
PRIMARY KEY (name, enqueued_at)
Having enqueued 10000 10-byte messages and then dequeued 9999 of them, one by one, let’s peek at the last remaining message using cqlsh with TRACING ON:
SELECT enqueued_at, payload FROM queues WHERE name = 'queue-1' LIMIT 1;
Apache Cassandra is a highly scalable open source database system, achieving great performance on multi-node setups.
Previously, we went over how to run a single-node Cassandra cluster. In this tutorial, you’ll learn how to install and use Cassandra to run a multi-node cluster on Ubuntu 14.04.
Because you’re about to build a multi-node Cassandra cluster, you must determine how many servers you’d like to have in your cluster and configure each of them. It is recommended, but not required, that they have the same or similar specifications.
To complete this tutorial, you’ll need the following:
- At least two Ubuntu 14.04 servers configured using this initial setup guide.
- Each server must be secured with a firewall using this IPTables guide.
- Each server must also have Cassandra installed by following this Cassandra installation guide.
The Scylla team is pleased to announce the release of Scylla 1.0 (GA), the first production ready Scylla release. Scylla is an open source, Apache-Cassandra-compatible NoSQL database, with superior performance and consistent low latency.
From now own, only critical bugs (showstoppers) will be fixed in branch-1.0.x. We will continue to fix bugs and add feature on the master branch toward 1.1 and beyond. Followup minor releases (1.1, 1.2 etc) will be timed-base releases at the end of each month; Scylla 1.1 due date is end of April.
Release 1.0 does not add new functionality to RC2, only showstopper bugs were fixed. A full list of contributions and known issues is available on the Scylla wiki. More on Scylla 1.0 status and compatibility with Cassandra here. More on Scylla road map here. We invested a lot of effort in testing Scylla 1.0, if you do find any issue, please let us know.
Druid supports fast aggregations and sub-second OLAP queries. Druid is designed for multi-tenancy and is ideal for powering user-facing analytic applications.
Druid supports streaming data ingestion and offers insights on events immediately after they occur. Retain events indefinitely and unify real-time and historical views.
Scalable to Petabytes
Existing Druid clusters have scaled to petabytes of data and trillions of events, ingesting millions of events every second. Druid is extremely cost effective, even at scale.
Druid runs on commodity hardware. Deploy it in the cloud or on-premise. Integrate with existing data systems such as Hadoop, Spark, Kafka, Storm, and Samza.
Druid is a community led project. Join the fast growing community and work with developers from across the world.