Simple, serverless data warehousing with AWS S3 and Athena

Simple, serverless data warehousing with AWS S3 and Athena

Serverless data warehouse AWS S3 and Athena

I’ve lost count of how many services are available on Amazon Web Services (AWS). Amazon’s website currently states 175. Whilst this frustrates me (as I never feel I know enough), it does provide me with a steady stream of ‘ah…this makes life easier’ moments as I discover new services. One of these moments came recently with Amazon Athena.

The case for Athena – data…and lots of it!

We’re creating data in ever-increasing volumes via a multitude of sources. According to Forbes, over 90% of the data in existence today was only created in the last two years. Take a moment to let that sink in! The simple truth is no matter what business we’re in, we will rely more and more on the rapid consumption of data to help inform our decisions.

For most of us (at least for now) this will involve processing text files. This will include (amongst others) legacy systems exporting the ‘tried and tested’ CSV file, millennial systems generating XML dumps and the modern, trendy systems spitting out JSON.

How would we typically go about getting this data into a useable form, so that it’s ready for querying? Well, until recently it would go a little something like this…

  • Start by being extra nice to the IT department 😊
  • Get a sample of each of your datasets (CSV, TSV, XML, JSON etc)
  • Analyse each of them and build a schema (structure) for your data warehouse
  • Estimate how much space you need to store this data
  • Configure a database server
  • Write lots of code to extract, transform and load (ETL) your files into the database
  • Spent lots of time testing, tweaking and bug fixing

Launch your data warehouse, then…

  • Worry about security and disaster recovery
  • Realise that you will soon need more space and processing power
  • Note that you might only process data once a month and that your new server will spend 96.6% of its time doing nothing but burning money

The whole process is slow and rigid. Fortunately, there are faster, more agile and more cost-effective ways of doing this. One of them is Amazon Athena.

An alternative approach

In a nutshell, Athena is a tool that lets you query text files stored in Amazon S3. You simply create a series of S3 buckets (think of them as ‘folders’), upload your CSV/TSV/XML/JSON files into the correct buckets, do a little bit of configuration and Athena does the rest.

In a short space of time you have a data warehouse that you can query with SQL or connect directly to your Business Intelligence software (e.g. Tableau, PowerBI and Looker). Athena invalidates many of the obstacles and frustrations you’d face in a traditional environment, for example:

  • Rapid and massively simplified configuration:
    Upload your files to S3, create a database in Athena, tell it where to find the data and specify the columns you want to include. There’s no more complicated extraction, transformation and load (ETL) processes that need to be written.
  • It’s serverless:
    Serverless doesn’t mean there aren’t any servers – there are servers, but you don’t need to worry about them. Amazon automatically takes care of all the servers behind the scenes.  There’s no server build, no software configuration, no patching and no maintenance.
  • Unlimited storage:
    For all intents and purposes, S3 provides infinite storage. The amount of capacity they have is truly mind boggling. You literally need to do nothing but keep on uploading your data files. S3 will just handle it and you only pay for the storage you use.
  • Solid security and disaster recovery model:
    Your data is stored in S3, which by default has 99.999999999% durability. Translated, this means you have a 0.000000001% chance of losing a piece of data in one year. If you’re data is totally non-replaceable, you can always set S3 to automatically create an additional backup of your data. Amazon also supports encryption in transit and at rest for your data, coupled with a robust security access model.
  • Processing power when you need it:
    Because it’s serverless, Athena only allocates processing capacity when it’s needed. When you’re not using it, Athena will cost you nothing (except storage costs, which are very low). When you need to crunch huge amounts of data, Athena will automatically scale-up its processing power and you’ll only be charged a flat fee of $5 per terabyte of data you scan.
  • Flexibility with little fuss:
    As much as CTO’s and technical leads like to pin things down, we know that things are constantly changing. The data you need in your warehouse today, is not the data you need tomorrow. With Athena you can quickly and easily amend your data schemas/models to fit your requirements.

Summary

I’ve used Athena several times now and I love it, so much so that I’m writing this blog post! With that said, I’m aware there are scenarios where Athena might not be the best fit (I can think of several as I type this paragraph). However, for those simply wishing to be able to query flat data files quickly and, in a cost, effective way, Athena is well worth a look.

Authors note: I’ve deliberately used generic terminology in this article. For example, I’m aware that you could argue that I’m actually building a data lake, rather than a warehouse. But I’ll leave that for another article 😊