Recently we’ve needed to run queries on our Cloudfront web logs, for example, to determine the number of Googlebot hits to certain page types. We configured Cloudfront to write logs to an S3 bucket, and we set up an AWS Athena table to query those logs.
(By the way, if you haven’t used Athena before, I think it’s nothing short of amazing — the ability to use SQL to query fields from a bunch of gzipped text files sitting in S3? That’s incredible.)
But over time our queries became really slow. For every query, Athena had to scan the entire log history, reading through all the log files in our S3 bucket. This churned through a lot of data (about 120 GB) and made the queries slow and pretty expensive — taking over 65 seconds to run and costing about $0.70 per query.
So we did some searching and came across this AWS blog post, which describes how to set up Athena partitioning to vastly speed up date-based queries. We decided to use AWS Lambda to partition our logs, along with the Serverless framework to easily deploy the Lambda code.