Updated: Jun 30
Despite the fact that big data has been a buzz word since the early 2010s, building a big data strategy can be quite difficult for every organization because of its complexities, from building EMR clusters which are a glorified combination of various open-source tools and software products, to running SparkJobs, which is a huge undertaking that requires building a bare metal infrastructure. The challenge of big data initiatives is becoming increasingly challenging, and there are only a few companies willing to go the extra mile.
In early 2018, 1 of our clients (logistics startup), we were ask to build a very simple data analytics platform, to generate ETA and usage analytics. This is a greenfield development, that is a dream of every technology team. From our research + audit of the current system, we found right away that this is quite a climb because of continuously moving pieces.
The team just recently implemented microservices architecture, and that team separated the microservices’s database to an actual RDS instance! The tables we need for are from 3 or 4 different RDS database instance.
One of the required data for the usage and ETA, comes from a log that is generated from a spring injected java code. That goes to our elastic search cluster that house all other logs with its size 8TB (that time, last time we checked its 18TB and growing).
We don’t have any facility to do ETL and/or ELT yet.
For POCs, DevOps/platform team recommended Apache Hive, which is part of AWS EMR, so they buit it. They also provided a simple exporting data from Postgres —> S3 using AWS DataPipeline. It worked on a very simple table, but then, we started having problems with complex and large tables and eventually we started to have issues with it, And upon reflection here is what we concluded:
Its expensive, these open source tools want their own instance, if not, they are memory hungry you know (JVM)
Configuration is very very cumbersome, automating this is always a mess. We always have to rebuild it
Dependencies versioning are hard to track. We don’t know which matched well with other tools. This created a nightmare where at one point all our tables have varchar columns.
Bugs! These is an issue where some libraries contains bugs that affect the functionality of the tool
Infrastructure challenges: Networking, where tools need to communicate on certain port. storage, where we need to setup certain storage requirements. CPU & memory, where we need to allocate certain resources for an app.
Operational, efficiencies, ability for SRE/Platform team to diagnose an issue at hand if the tools started breaking or when it had problems.
Maintenance and upgrades, need to be always current with new releases for security and bug fixes.
With all the issues, above, the end goal for us is really to provide a large database, a data warehouse, that will house are very basic report. ETA as well as Usage analytics. The first hurdle is figuring out how to load data to snowflake, our new CTO (who had experience with Snowflake and had pushed us to move to SF) told us that we can load via S3. Which was perfect since we already have the portion of extract and transform using AWS Datapipeline.
Main requirement for loading data is the file_format, this describes how to parse your CSV, what rules mainly. if your file-format is not properly set, you will see a lot of error mainly there is discrepancy between rows_parsed vs. rows_loaded.
In the example above, we have 1 failure, which could be anything, so we can double check that file at time of export before loading to see if there are mostly bad character added which is almost always the case.
This is done manually to show the capabitity and hand-it off to Analytics folks so they can start building their analysis, queries and reports.
Effectively, we need to automate this process continuously update and load data. AWS DataPipline route, is ideal for exporting data, but then we need to add a node, a pipeline node to load the data.
The design of the pipeline is:
Source Node (Postgres) —> Bash (Export Command From Postgres) —> Upload Node (S3) —> Snowflake Copy Node (Bash Calling A Python Script)
Snowflake Node, we’ve used a an AWS Image AMI that has python libs and installed
And we have this command embedded. to load the data to Snowflake.
We have a method the python code generated, the following is just part of the code snippet.
adding this Snowflake load node on AWS DataPipeline, ensure the continuous load of data to Snowflake.
On the next series – Part 2, we’ll discuss internals of Snowflake of and other challenges we have with ingestion like ingesting 8TB of logs as well and others.