Parsing 18 billion JSON lines with Go

At my employer Tjek we recently decided to rebuild our event pipeline and move it to Google BigQuery to reduce complexity in the stack and to remove some services that were no longer maintainable.

BigQuery offers a bunch of nice tools for querying and visualizing our data, which would enable a number of our internal teams to work directly with the data & share it with customers without having to make a request to the engineering department.

The new service that would replace the old http service was easy to write, but then came the question of moving the historic data into BigQuery.

To migrate the old data, we had to backfill our entire dataset, accumulated over the last 10 years, into BigQuery and this is the story of how it was done.

https://itnext.io/parsing-18-billion-lines-json-with-go-738be6ee5ed2

The Lesson to Unlearn

“The most damaging thing you learned in school wasn’t something you learned in any specific class. It was learning to get good grades.

When I was in college, a particularly earnest philosophy grad student once told me that he never cared what grade he got in a class, only what he learned in it. This stuck in my mind because it was the only time I ever heard anyone say such a thing.

For me, as for most students, the measurement of what I was learning completely dominated actual learning in college. I was fairly earnest; I was genuinely interested in most of the classes I took, and I worked hard. And yet I worked by far the hardest when I was studying for a test.

In theory, tests are merely what their name implies: tests of what you’ve learned in the class. In theory you shouldn’t have to prepare for a test in a class any more than you have to prepare for a blood test. In theory you learn from taking the class, from going to the lectures and doing the reading and/or assignments, and the test that comes afterward merely measures how well you learned…”

http://paulgraham.com/lesson.html

Introducing AWS Lambda Destinations

Today we’re announcing AWS Lambda Destinations for asynchronous invocations. This is a feature that provides visibility into Lambda function invocations and routes the execution results to AWS services, simplifying event-driven applications and reducing code complexity.

Asynchronous invocations

When a function is invoked asynchronously, Lambda sends the event to an internal queue. A separate process reads events from the queue and executes your Lambda function. When the event is added to the queue, Lambda previously only returned a 2xx status code to confirm that the queue has received this event. There was no additional information to confirm whether the event had been processed successfully.

A common event-driven microservices architectural pattern is to use a queue or message bus for communication. This helps with resilience and scalability. Lambda asynchronous invocations can put an event or message on Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), or Amazon EventBridge for further processing. Previously, you needed to write the SQS/SNS/EventBridge handling code within your Lambda function and manage retries and failures yourself.

With Destinations, you can route asynchronous function results as an execution record to a destination resource without writing additional code. An execution record contains details about the request and response in JSON format including version, timestamp, request context, request payload, response context, and response payload. For each execution status such as Success or Failure you can choose one of four destinations: another Lambda function, SNS, SQS, or EventBridge. Lambda can also be configured to route different execution results to different destinations.

https://aws.amazon.com/blogs/compute/introducing-aws-lambda-destinations/

ID Card Digitization and Information Extraction using Deep Learning – A Review

In this article, we will discuss how any organisation can use deep learning to automate ID card information extraction, data entry and reviewing procedures to achieve greater efficiency and cut costs. We will review different deep learning approaches that have been used in the past for this problem, compare the results and look into the latest in the field. We will discuss graph neural networks and how they are being used for digitization.

While we will be looking at the specific use-case of ID cards, anyone dealing with any form of documents, invoices and receipts, etc and is interested in building a technical understanding of how deep learning and OCR can solve the problem will find the information useful.

https://nanonets.com/blog/id-card-digitization-deep-learning/

Go: Goroutine, OS Thread and CPU Management

Creating an OS Thread or switching from one to another can be costly for your programs in terms of memory and performance. Go aims to get advantages as much as possible from the cores. It has been designed with concurrency in mind from the beginning.

M, P, G orchestration

To solve this problem, Go has its own scheduler to distribute goroutines over the threads. This scheduler defines three main concepts, as explained in the code itself:

The main concepts are:
G - goroutine.
M - worker thread, or machine.
P - processor, a resource that is required to execute Go code.
    M must have an associated P to execute Go code[...].

Here is a diagram of this PMG model:

P, M, G diagram

Each goroutine (G) runs on an OS thread (M) that is assigned to a logical CPU (P). Let’s take a simple example to see how Go manages them…

https://medium.com/a-journey-with-go/go-goroutine-os-thread-and-cpu-management-2f5a5eaf518a

What is Serverless? The “2020” edition

Serverless (in some people’s minds) currently encompasses:

  • Anything that looks like “Function as a Service” like AWS Lambda, Google Cloud Functions, and Azure Functions
  • Anything that can run a Function as a Service system, like OpenFaaS and similar
  • Ok… lots of people think it’s a synonym for Function as a Service (spoiler: it’s not)
  • Any solution that runs “on demand compute” such as Google App Engine (spoiler: it’s not)
  • Anything that runs a container on demand like Google Cloud Run or Fargate (note: I like Cloud Run)
  • Basically “on demand compute” of some description, some of which “scales to zero”

https://medium.com/swlh/what-is-serverless-the-2020-edition-5a2f21581fe5

View at Medium.com