Reindexing Data with Elasticsearch

Sooner or later, you’ll run into a problem of reindexing the data of your Elasticsearch instances. When we do Elasticsearch consulting for clients we always look at whether they have some way to efficiently reindex previously indexed data. The reasons for reindexing vary – from data type changes, analysis changes, to introduction of new fields that that need to be populated. No matter the case, you may either reindex from your source of truth or treat your Elasticsearch instance as such. Up to Elasticsearch 2.3 we had to use external tools to help us with this operation, like Logstash or stream2es. We even wrote about how to approach reindexing of data with Logstash. However, today we would like to look at the new functionality that will be added to Elasticsearch 2.3 – the re-index API.

The pre-requisites are quite low – you only need Elasticsearch 2.3 (not yet officially released as of this writing) and you need to be able to run a command on it. And that’s it, nothing more is needed and Elasticsearch will do the rest for us.

Caching at Reddit

Performance matters. One of the first tools we as developers reach for when looking to get more performance out of a system is caching. As Reddit has grown in users and response times have improved, the amount of caching has grown to be quite large as well.

In this post we’ll talk about some of the nuts-and-bolts numbers of Reddit’s caching infrastructure—the number of instances, size of instances, and overall throughput. We hope that sharing this information may help others gauge what type of performance and sizing they can expect when building similar clusters. At the very least, we hope you’ll find it interesting to see a bit more about how Reddit works under the hood.

We’ll also go over the Reddit-specific type of work our caches do, how we use mcrouter to manage our caches more effectively, and the custom monitoring (MemcachedSlabCollector and mcsauna) we’ve written to help us understand what’s going on behind the scenes. We’ll also talk about some of the more subtle issues that we’ve run into when deploying changes to our caches.

Caching at Reddit

Monetize your APIs in AWS Marketplace using API Gateway

Amazon API Gateway helps you quickly build highly scalable, secure, and robust APIs. Today, we are announcing an integration of API Gateway with AWS Marketplace. You can now easily monetize your APIs built with API Gateway, market them directly to AWS customers, and reuse AWS bill calculation and collection mechanisms.

AWS Marketplace lists over 3,500 software listings across 35 product categories with over 100K active customers. With the recent announcement of SaaS Subscriptions, API sellers can, for the first time, take advantage of the full suite of Marketplace features, including customer acquisition, unified billing, and reporting. For AWS customers, this means that they can now subscribe to API products through AWS Marketplace and pay on an existing AWS bill. This gives you direct access to the AWS customer base.

To get started, identify the API on API Gateway that you want to sell on AWS Marketplace. Next, package that API into usage plans. Usage plans allow you to set throttling limits and quotas to your APIs and allow you to control third-party usage of your API. You can create multiple usage plans with different limits (e.g., Silver, Gold, Platinum) and offer them as different API products on AWS Marketplace.

AWS Lambda – A Look Back at 2016

2016 was an exciting year for AWS Lambda, Amazon API Gateway and serverless compute technology, to say the least. But just in case you have been hiding away and haven’t heard of serverless computing with AWS Lambda and Amazon API Gateway, let me introduce these great services to you.  AWS Lambda lets you run code without provisioning or managing servers, making it a serverless compute service that is event-driven and allows developers to bring their functions to the cloud easily for virtually any type of application or backend.  Amazon API Gateway helps you quickly build highly scalable, secure, and robust APIs at scale and provides the ability to maintain and monitor created APIs.

With the momentum of serverless in 2016, of course, the year had to end with a bang as the AWS team launched some powerful service features at re:Invent to make it even easier to build serverless solutions.  These features include:

Amazon Simple Queue Service (SQS) Gains FIFO Queues

Amazon’s Simple Queue Service (SQS) recently gained FIFO (first-in, first-out) queues, which are designed to “guarantee that messages are processed exactly once, in the order that they are sent, and without duplicates”. AWS rolled out this new queue type in the US East (Ohio) and US West (Oregon) regions and “plans to make it available in many others in early 2017”.

Amazon SQS is described as a “fast, reliable, scalable, fully managed message queuing service [designed to] decouple the components of a cloud application [and] transmit any volume of data, without losing messages or requiring other services to be always available”.

Introducing the ‘Startup Kit Serverless Workload’

“What’s the easiest way to get started on AWS?” is a common question. Although there are many well established paths to getting started, including using AWS Elastic Beanstalk, serverless computing is a rapidly growing alternative.

Serverless computing allows you to build and run applications and services without thinking about servers. On AWS, the AWS Lambda service is the central building block for serverless computing. AWS also provides several other services to support serverless architectures. These include Amazon API Gateway, which you can use with Lambda to create a RESTful API, and Amazon DynamoDB, a NoSQL cloud database service that frees you from the burden of setting up a database cluster.

A completely serverless architecture is shown in the following diagram.



How to Steal an AI (reverse engineer machine-learning)

In a paper they released earlier this month titled “Stealing Machine Learning Models via Prediction APIs,” a team of computer scientists at Cornell Tech, the Swiss institute EPFL in Lausanne, and the University of North Carolina detail how they were able to reverse engineer machine learning-trained AIs based only on sending them queries and analyzing the responses. By training their own AI with the target AI’s output, they found they could produce software that was able to predict with near-100% accuracy the responses of the AI they’d cloned, sometimes after a few thousand or even just hundreds of queries.

“You’re taking this black box and through this very narrow interface, you can reconstruct its internals, reverse engineering the box,” says Ari Juels, a Cornell Tech professor who worked on the project. “In some cases, you can actually do a perfect reconstruction.”

How to Steal an AI

Amazon States Language

This document describes a JSON-based language used to describe state machines declaratively. The state machines thus defined may be executed by software. In this document, the software is referred to as “the interpreter”.

Copyright © 2016 Inc. or Affiliates.

The operation of a state machine is specified by states, which are represented by JSON objects, fields in the top-level “States” object. In this example, there is one state named “Hello World”.

    "Comment": "A simple minimal example of the States language",
    "StartAt": "Hello World",
    "States": {
    "Hello World": { 
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789012:function:HelloWorld",
      "End": true

When this state machine is launched, the interpreter begins execution by identifying the Start State. It executes that state, and then checks to see if the state is marked as an End State. If it is, the machine terminates and returns a result. If the state is not an End State, the interpreter looks for a “Next” field to determine what state to run next; it repeats this process until it reaches a Terminal State (Succeed, Fail, or an End State) or a runtime error occurs.

In this example, the machine contains a single state named “Hello World”. Because “Hello World” is a Task State, the interpreter tries to execute it. Examining the value of the “Resource” field shows that it points to a Lambda function, so the interpreter attempts to invoke that function. Assuming the Lambda function executes successfully, the machine will terminate successfully.

A State Machine is represented by a JSON object.