Automatically Archive Items to S3 Using DynamoDB Time to Live (TTL) with AWS Lambda and Amazon Kinesis Firehose

Earlier this year, Amazon DynamoDB released Time to Live (TTL) functionality, which automatically deletes expired items from your tables, at no additional cost. TTL eliminates the complexity and cost of scanning tables and deleting items that you don’t want to retain, saving you money on provisioned throughput and storage. One AWS customer, TUNE, purged 85 terabytes of stale data and reduced their costs by over $200K per year.

Today, DynamoDB made TTL better with the release of a new CloudWatch metric for tracking the number of items deleted by TTL, which is also viewable for no additional charge. This new metric helps you monitor the rate of TTL deletions to validate that TTL is working as expected. For example, you could set a CloudWatch alarm to fire if too many or too few automated deletes occur, which might indicate an issue in how you set expiration time stamps for your items.

In this post, I’ll walk through an example of a serverless application using TTL to  automate a common database management task: moving old data from your database into archival storage automatically. Archiving old data helps reduce costs and meet regulatory requirements governing data retention or deletion policies. I’ll show how TTL—combined with DynamoDB Streams, AWS Lambda, and Amazon Kinesis Firehose—facilitates archiving data to a low-cost storage service like Amazon S3, a data warehouse like Amazon Redshift, or to Amazon Elasticsearch Service.,Amazon_DynamoDB&linkId=50979940


Indexing Metadata in Amazon Elasticsearch Service Using AWS Lambda and Python

You can use Amazon S3 to implement a data lake architecture as the single source of truth for all your data. Taking this approach not only allows you to reliably store massive amounts of data but also enables you to ingest the data at a very high speed and do further analytics on it. Ease of analytics is important because as the number of objects you store increases, it becomes difficult to find a particular object—one needle in a haystack of billions.

Objects in S3 contain metadata that identifies those objects along with their properties. When the number of objects is large, this metadata can be the magnet that allows you to find what you’re looking for. Although you can’t search this metadata directly, you can employ Amazon Elasticsearch Service to store and search all of your S3 metadata. This blog post gives step-by-step instructions about how to store the metadata in Amazon Elasticsearch Service (Amazon ES) using Python and AWS Lambda.,Amazon_DynamoDB&linkId=50822242

Amazon Elasticsearch Service support for Elasticsearch 5.1

The Amazon Elasticsearch Service is a fully managed service that provides easier deployment, operation, and scale for the Elasticsearch open-source search and analytics engine. We are excited to announce that Amazon Elasticsearch Service now supports Elasticsearch 5.1 and Kibana 5.1.

Elasticsearch 5 comes with a ton of new features and enhancements that customers can now take advantage of in Amazon Elasticsearch service. Elements of the Elasticsearch 5 release are as follow:

  • Indexing performance: Improved Indexing throughput with updates to lock implementation & async translog fsyncing
  • Ingestion Pipelines: Incoming data can be sent to a pipeline that applies a series of ingestion processors, allowing transformation to the exact data you want to have in your search index. There are twenty processors included, from simple appending to complex regex applications
  • Painless scripting: Amazon Elasticsearch Service supports Painless, a new secure and performant scripting language for Elasticsearch 5. You can use scripting to change the precedence of search results, delete index fields by query, modify search results to return specific fields, and more.
  • New data structures: Lucene 6 data structures, new data types; half_float, text, keyword, and more complete support for dots-in-fieldnames
  • Search and Aggregations: Refactored search API, BM25 relevance calculations, Instant Aggregations, improvements to histogram aggregations & terms aggregations, and rewritten percolator & completion suggester
  • User experience: Strict settings and body & query string parameter validation, index management improvement, default deprecation logging, new shard allocation API, and new indices efficiency pattern for rollover & shrink APIs
  • Java REST client: simple HTTP/REST Java client that works with Java 7 and handles retry on node failure, as well as, round-robin, sniffing, and logging of requests
  • Other improvements: Lazy unicast hosts DNS lookup, automatic parallel tasking of reindex, update-by-query, delete-by-query, and search cancellation by task management API

Get Started with Amazon Elasticsearch Service: How Many Shards Do I Need?

Welcome to this introductory series on Elasticsearch and Amazon Elasticsearch Service (Amazon ES). In this and future blog posts, we provide the basic information that you need to get started with Elasticsearch on AWS.

How many shards?
Elasticsearch can take in large amounts of data, split it into smaller units, called shards, and distribute those shards across a dynamically changing set of instances. When you create an Elasticsearch index, you set the shard count for that index. Because you can’t change the shard count of an existing index, you have to make the decision on shard count before sending your first document. To begin, set the shard count based on your calculated index size, using 30 GB as a target size for each shard.

Number of Shards = Index Size / 30GB

To learn how to calculate your index size, see the blog post Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need?

As you send data and queries to the cluster, continuously evaluate the resource usage and adjust the shard count based on the performance of the cluster.

What is a shard?
WhatIsShardA search engine has two jobs: Create an index from a set of documents, and search that index to compute the best, matching documents. If your index is small enough, a single data structure on a single machine can easily store that index. For larger document sets, in cases where a single machine is not large enough to hold the index, or in cases where a single machine can’t compute your search results, the index is split into pieces. These pieces are called shards in Elasticsearch. Each document is routed to a shard that is calculated, by default, by using a hash of that document’s ID.

A shard is both a unit of storage and a unit of computation. Elasticsearch deploys shards independently to the instances in the cluster to parallelize the storage and processing for the index. And it does this elastically (hence the “elastic” in the name “Elasticsearch”). If you add more instances to a cluster, Amazon Elasticsearch Service automatically rebalances the shards of the cluster, moving them between instances.

As storage, shards are distinct from one another. The document set in one shard doesn’t overlap the document set in any other shard. This approach makes shards independent for storage.

As computational units, shards are also distinct from one another. Each shard is an instance of an Apache Lucene index that computes results on the documents it holds. Because all of the shards comprise the index, they must function together to process each query and update request for that index. To process a query, Elasticsearch routes the query to all shards in the index. Each shard computes its response locally and then these responses are aggregated for the final response. To process a write request (a document addition or an update to an existing document), Elasticsearch routes the request to the appropriate shard.

Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need?

Welcome to the first in a series of blog posts about Elasticsearch and Amazon Elasticsearch Service, where we will provide the information you need to get started with Elasticsearch on AWS.

How many instances will you need?
When you create an Amazon Elasticsearch Service domain, this is one of the first questions to answer.

Reindexing Data with Elasticsearch

Sooner or later, you’ll run into a problem of reindexing the data of your Elasticsearch instances. When we do Elasticsearch consulting for clients we always look at whether they have some way to efficiently reindex previously indexed data. The reasons for reindexing vary – from data type changes, analysis changes, to introduction of new fields that that need to be populated. No matter the case, you may either reindex from your source of truth or treat your Elasticsearch instance as such. Up to Elasticsearch 2.3 we had to use external tools to help us with this operation, like Logstash or stream2es. We even wrote about how to approach reindexing of data with Logstash. However, today we would like to look at the new functionality that will be added to Elasticsearch 2.3 – the re-index API.

The pre-requisites are quite low – you only need Elasticsearch 2.3 (not yet officially released as of this writing) and you need to be able to run a command on it. And that’s it, nothing more is needed and Elasticsearch will do the rest for us.