Recently we’ve needed to run queries on our Cloudfront web logs, for example, to determine the number of Googlebot hits to certain page types. We configured Cloudfront to write logs to an S3 bucket, and we set up an AWS Athena table to query those logs.
(By the way, if you haven’t used Athena before, I think it’s nothing short of amazing — the ability to use SQL to query fields from a bunch of gzipped text files sitting in S3? That’s incredible.)
But over time our queries became really slow. For every query, Athena had to scan the entire log history, reading through all the log files in our S3 bucket. This churned through a lot of data (about 120 GB) and made the queries slow and pretty expensive — taking over 65 seconds to run and costing about $0.70 per query.
So we did some searching and came across this AWS blog post, which describes how to set up Athena partitioning to vastly speed up date-based queries. We decided to use AWS Lambda to partition our logs, along with the Serverless framework to easily deploy the Lambda code.
AWS just announced the release of S3 Batch Operations. This is a hotly-anticpated release that was originally announced at re:Invent 2018. With S3 Batch, you can run tasks on existing S3 objects. This will make it much easier to run previously difficult tasks like retagging S3 objects, copying objects to another bucket, or processing large numbers of objects in bulk.
In this post, we’ll do a deep dive into S3 Batch. You will learn when, why, and how to use S3 Batch. First, we’ll do an overview of the key elements involved in an S3 Batch job. Then, we’ll walkthrough an example by doing sentiment analysis on a group of existing objects with AWS Lambda and Amazon Comprehend.
If you’ve ever tried to run operations on a large number of objects in S3, you might have encountered a few hurdles. Listing all files and running the operation on each object can get complicated and time consuming as the number of objects scales up. Many decisions have to be made: is running the operations from my personal computer fast enough? Or should I run it from a server that’s closer to the AWS resources, benefiting from AWS’s fast internal network? If so, I’ll have to provision resources (e.g. ec2 instance, lambda functions, containers, etc) to run the job.
Thankfully, AWS has heard our pains and announced AWS S3 Batch Operations preview during the last AWS Reinvent conference. This new service (which you can access by asking AWS politely) allows you to easily run operations on very large numbers of S3 objects in your bucket. Curious to know how it works? Let’s get going.
The mechanism for uploading files from a browser has been around since the early days of the Internet. In the server-full environment it’s very easy to use Django, Express, or any other popular framework. It’s not an exciting topic — until you experience the scaling problem.
Imagine this scenario — you have an application that uploads files. All is well until the site suddenly gains popularity. Instead of handling a gigabyte of uploads a month, usage grows to 100Gb an hour for the month leading up to tax day. Afterwards, the usage drops back down again for another year. This is exactly the problem we had to solve.
File uploading at scale gobbles up your resources — network bandwidth, CPU, storage. All this data is ingested through your web server(s), which you then have to scale — if you’re lucky this means auto-scaling in AWS, but if you’re not in the cloud you’ll also have to contend with the physical network bottleneck issues.
You can also face some difficult race conditions if your server fails in the middle of handling the uploaded file. Did the file make to its end destination? What was the state of the processing? It can be very hard to replay the steps to failure or know the state of transactions when the server is overloaded.
Fortunately, this particular problem turns out to be a great use case for serverless — as you can eliminate the scaling issues entirely. For mobile and web apps with unpredictable demand, you can simply allow the application to upload the file directly to S3. This has the added benefit of enabling an https endpoint for the upload, which is critical for keeping the file’s contents secure in transit.
All this sounds great — but how does this work in practice when the server is no longer there to do the authentication and intermediary legwork?
Corral is a MapReduce framework designed to be deployed to serverless platforms, like AWS Lambda. It presents a lightweight alternative to Hadoop MapReduce. Much of the design philosophy was inspired by Yelp’s mrjob — corral retains mrjob’s ease-of-use while gaining the type safety and speed of Go.
Corral’s runtime model consists of stateless, transient executors controlled by a central driver. Currently, the best environment for deployment is AWS Lambda, but corral is modular enough that support for other serverless platforms can be added as support for Go in cloud functions improves.
Corral is best suited for data-intensive but computationally inexpensive tasks, such as ETL jobs.
More details about corral’s internals can be found in this blog post.
Earlier this year, Amazon DynamoDB released Time to Live (TTL) functionality, which automatically deletes expired items from your tables, at no additional cost. TTL eliminates the complexity and cost of scanning tables and deleting items that you don’t want to retain, saving you money on provisioned throughput and storage. One AWS customer, TUNE, purged 85 terabytes of stale data and reduced their costs by over $200K per year.
Today, DynamoDB made TTL better with the release of a new CloudWatch metric for tracking the number of items deleted by TTL, which is also viewable for no additional charge. This new metric helps you monitor the rate of TTL deletions to validate that TTL is working as expected. For example, you could set a CloudWatch alarm to fire if too many or too few automated deletes occur, which might indicate an issue in how you set expiration time stamps for your items.
In this post, I’ll walk through an example of a serverless application using TTL to automate a common database management task: moving old data from your database into archival storage automatically. Archiving old data helps reduce costs and meet regulatory requirements governing data retention or deletion policies. I’ll show how TTL—combined with DynamoDB Streams, AWS Lambda, and Amazon Kinesis Firehose—facilitates archiving data to a low-cost storage service like Amazon S3, a data warehouse like Amazon Redshift, or to Amazon Elasticsearch Service.
Many companies are suffering data breaches because attackers gain access to data in AWS S3 buckets. I don’t want to repeat all the news articles outlining all the S3 data breaches. A Google search will give many examples, and it seems like by the time I write this another one will be in the news. Instead, I’d like to jump to why these S3 bucket breaches are happening and how to securely store data in an S3 bucket.
In my free time, I run a small blog that uses Amazon S3 to host static content and Amazon CloudFront to distribute it world-wide. I use a home-grown, static website generator to create and upload my blog content onto S3.
My blog uses two S3 buckets: one for staging and testing, and one for production. As a website owner, I want to update the production bucket with all changes from the staging bucket in a reliable and efficient way, without having to create and populate a new bucket from scratch. Therefore, to synchronize files between these two buckets, I use AWS Lambda and AWS Step Functions.
In this post, I show how you can use Step Functions to build a scalable synchronization engine for S3 buckets and learn some common patterns for designing Step Functions state machines while you do so.
s3-lambda enables you to run lambda functions over a context of S3 objects. It has a stateless architecture with concurrency control, allowing you to process a large number of files very quickly. This is useful for quickly prototyping complex data jobs without an infrastructure like Hadoop or Spark.
At Littlstar, we use
s3-lambda for all sorts of data pipelining and analytics.