thoughts…

rants and bookmarks about programming stuff…


Why MongoDB is a bad choice for storing our scraped data

“MongoDB was used early on at Scrapinghub to store scraped data because it’s convenient. Scraped data is represented as (possibly nested) records which can be serialized to JSON. The schema is not known ahead of time and may change from one job to the next. We need to support browsing, querying and downloading the stored data. This was very easy to implement using MongoDB (easier than the alternatives available a few years ago) and it worked well for some time.

Usage has grown from a simple store for scraped data used on a few projects to the back end of our Scrapy Cloud platform. Now we are experiencing limitations with our current architecture and rather than continue to work with MongoDB, we have decided to move to a different technology (more in a later blog post). Many customers are surprised to hear that we are moving away from MongoDB, I hope this blog post helps explain why it didn’t work for us…”

http://blog.scrapinghub.com/2013/05/13/mongo-bad-for-scraped-data/


Redis as the primary data store? WTF?!

“Redis is a key-value in memory data store typically used for caches and other such mechanisms to speed up web applications. We however store all our data in Redis as our primary database.

The web is abound with warnings and cautionary tales about going this route. There are horror stories about lost data, hitting memory limits, or people unable to effectively manage the data within Redis, so you might be wondering “What on earth were you thinking?!” So here is our story, why we decided to use Redis anyway, and how we overcame those issues.

First of all, I want to stress that most applications shouldn’t even worry about the engineering hurdles involved with going this route. It was important for our use case, but we may very well be an edge case…”

http://moot.it/blog/technology/redis-as-primary-datastore-wtf.html


RabbitMQ bindings for Lua

“The amqp.lua package adds support for sending messages to RabbitMQ via LuaJIT FFI. This allows a Lua programmer to communicate with other programs using enterprise grade messaging infrasturcture. The module amqp.lua makes use of LuaJIT’s awesome foreign function interface (FFI) to invoke the librabbitmq’s functions directly. It exposes a simplified interface to librabbitmq natively, but also preserves the ability of an intrepid programmer to use the full depth of the upstream library…”

https://github.com/cthulhuology/amqp.lua


bitmapist: Powerful realtime analytics with Redis 2.6′s bitmaps and Python

bitmapist (GitHub) – a powerful realtime analytics library that can help you answer following questions:

  • Has user 123 been online today? This week? This month?
  • Has user 123 performed action “X”?
  • How many users have been active have this month? This hour?
  • How many unique users have performed action “X” this week?
  • How many % of users that were active last week are still active?
  • How many % of users that were active last month are still active this month?

This library is very easy to use and enables you to create your own reports easily…”

http://amix.dk/blog/post/19714


Data Placement in Swift

“One of the hard problems that needs to be solved in a distributed storage system is to figure out how to effectively place the data within the storage cluster. Swift has a “unique-as-possible” placement algorithm which ensures that the data is placed efficiently and with as much protection from hardware failure as possible.

Swift places data into distinct availability zones to ensure both high durability and high availability. An availability zone is a distinct set of physical hardware with unique failure mode isolation. In a large deployment, availability zones may be defined as unique facilities in a large data center campus. In a single-DC deployment, the availability zones may be unique rooms, separated by firewalls and powered with different utility providers. A multi-rack cluster may choose to define availability zones as a rack and everything behind a single top-of-rack switch. Swift allows a deployer to choose how to define availability zones based on the particular details of the available infrastructure…”

http://swiftstack.com/blog/2013/02/25/data-placement-in-swift/


From SimpleDB to Cassandra: Data Migration for a High Volume Web Application at Netflix

“There will come a time in the life of most systems serving data, when there is a need to migrate data to a more reliable, scalable and high performance data store while maintaining or improving data consistency, latency and efficiency. This document explains the data migration technique we used at Netflix to migrate the user’s queue data between two different distributed NoSQL storage systems…”

http://nosql.mypopescu.com/post/43387882910/from-simpledb-to-cassandra-data-migration-for-a-high

http://techblog.netflix.com/2013/02/netflix-queue-data-migration-for-high.html?m=1

Follow

Get every new post delivered to your Inbox.

Join 514 other followers