AWS provides many tools for integrating data through an ingestion pipeline. In this article, we implement a pipeline using AWS Lambda.
## Getting Started
You are faced with the task of taking data from many sources and processing them as transactions into your own software system in a reliable and consistent manner. The data could be coming from spreadsheets, databases, or other software systems. Also, you need to be able to easily add new data sources and scale to thousands or even millions of transactions a day. While the software community has made immense progress in areas such as standardized data formats (JSON, CSV, XML) and APIs (REST, OpenAPI), this is still one of the most challenging systems a team can build. In a 2019 market report by Business Wire:
To tackle this problem, let's break it down into it's key components with our goal being a flexible and reusable solution; a data integration pipeline. A data integration pipeline is a system that is used to take data from one system and process it into another using a sequence of steps. With this type of pipeline, we can take input data from many different sources and process the rows as transactions into our system. Typically these transactions include significant business logic, and each row of data received may result in updates or inserts into multiple tables.
We will walk through the process of designing a data integration pipeline and launching it on AWS. By utilizing the serverless technologies AWS Lambda, SQS (Simple Queue Service), and API Gateway; we will be able to simplify our process into smaller steps, optimize for cost, and provide a mechanism that will scale very easily. In addition, we will provision our infrastructure (including environments for dev, staging, and prod) using Nullstone. This post is one of many in a series of Nullstone Use Cases.
## Defining the Problem
Before we start building any infrastructure or cracking open our favorite IDE to begin coding, let's first break down the problem we are trying to solve. To illustrate this process, I will use an example as the target for our integration pipeline. In this example, let's imagine that we are a business that needs to integrate with each of our customers. As we onboard each customer, we will need to ensure that we can take the data they will supply to us and add this data to our system.
While we won't dive into every detail in this post, we will focus on some of the higher-level concepts to help us design our pipeline. In this scenario, we need to satisfy the following:
## Scope
I have designed and built a few different integration pipelines throughout my career, but in this post, I will cover one general strategy that works well for an extensive range of scenarios. The goal is to produce a simple design that remains flexible. Keeping the complexity down helps us reduce risk and increase developer understanding of the solution. We want a system that our development teams can reason with, contribute to, and debug easily. We will build using interchangeable parts that follow contracts and only handle a single responsibility to achieve this.
## Simplify Using Single Responsibility
To keep our thinking and solution as simple as possible, we will start by thinking about the problem using the Single Responsibility Principle. This principle states that you should write your code in small pieces where each piece only has one responsibility or reason to change. We should apply this line of thinking to our code and our entire system, including the infrastructure. Once we define our building blocks with clear boundaries and contracts, we should compose our system from those blocks and interchange them as we need. So what are the basic building blocks of our pipeline?
### Phase 1: Receive the Data
### Phase 2: Parse/Validate
### Phase 3: Transform
### Phase 4: Final Transaction
By breaking the pipeline down into these parts, many of the individual responsibilities start to become a bit more obvious. Phases 1, 2, and 4 can be developed in a completely generic way. If we need to support a new data format, we can write a new parser and nothing else needs to change. It is also effortless to test each part of the system individually.
## Constructing and Hosting our Pipeline
Now that we have laid out the various stages of our integration pipeline, we need to think about how we will host it. As a software developer, you might initially look to start building all of this using code. It is a good approach and one I have used before.
However, for this example, I'm going to take a little different approach that will allow us to move more quickly, get to market faster. We will use AWS as our hosting platform and utilize AWS Lambda, SQS Queues, and an API Gateway. Using this model, we will be able to keep our costs down, maintain a high degree of resiliency, and scale with ease.
Instead of hosting our system on dedicated machines or even containers, using serverless allows us to only pay for what we use and quickly scale up and down. Serverless helps with scenarios where you have big bursts and then no activity. You won't be charged anything when the serverless functions aren't being executed.
By using SQS, we are getting a queueing mechanism we know to be highly resilient and scalable. We no longer have to worry about developing the queueing mechanism.
### Phase 1: Receive the Data
The pipeline begins with an API Gateway which exposes an HTTPS endpoint that the customers can call. The request is passed off to a Lambda function that will store that data and acknowledge that the data was received successfully. The final step is to publish the data into an SQS queue.
### Phase 2: Parse/Validate
The 2nd phase begins with a Lambda function that is subscribed to an SQS queue. When the Lambda function in phase 1 publishes the data on this queue, this Lambda function is triggered to execute. Our Lambda function validates the data it receives and parses it into an array of hashes. Each hash is then published onto the next SQS queue for the next phase.
### Phase 3: Transform
The 3rd phase is also a Lambda function that receives a single hash from the SQS queue it is subscribed to. It does any mapping and transformation required to get the data into a data structure that follows a standard contract. This phase is where all of the customer-specific logic will reside. Because we have isolated this as a separate phase, we can develop new logic for each customer in this one spot and don't need to change the rest of the pipeline. When we are ready to test and deploy new customer-specific changes, we can deploy the changes to this phase into our QA environment. Once these changes have been tested, we can promote them to our production environment.
## Phase 4: Final Transaction
Finally, the Lambda function from Phase 3 calls our standard API to execute the transaction.
## Building our Environment
To build out this environment in AWS, we could do this in a few different ways.
If you are at a point where you want to try out SQS or Lambda functions, it is probably quite an overwhelming task, and it is hard to know where to start.
At Nullstone, we have seen this scenario play out many times. In the following video, we will walk you through how easy it can be to stand up the infrastructure for this scenario. In the end, you will be left with an infrastructure that is working out of the box, can be easily repeated across as many environments as you like, and can be modified easily as you scale.
Now that our pipeline is established, we just need to write the functions to handle each part of our pipeline.
## Additional Environments
In the video above, we launched our solution into our dev environment to begin developing. Once we are ready for QA, staging, or production environments, we will need to duplicate our setup reliably to ensure a smooth rollout. As we add new customers and integrations, these additional environments are vital. The new code to handle each customer should be first rolled out to our QA environments to be tested and validated. If we have any bug fixes that need to be made, we should also roll these out to our QA environments to be verified before going into production.
Using Nullstone, we can add those additional environments and launch the infrastructure. All the configuration is automatically duplicated across environments. Your production environment will need more CPU/memory/etc., so adjust those values as you launch for that environment. To see the process of adding additional environments, please check out the video below.
## Wrap Up
In this post, we took a look at designing and launching an integration pipeline, including multiple environments. While we couldn't dive into all the details, hopefully, this helped you establish an architecture and design for your own integration pipeline.
If you would like help designing your integration pipeline or have any other infrastructure or cloud questions, please contact our team of cloud experts at info@nullstone.io.