Earlier this year I wrote about how to use Amazon S3 and Athena to turn your text data files into a highly performant and extremely cost-effective data warehouse. I love the simplicity of that approach and you can’t beat it for cost, but there are times when you might have more complex requirements than just processing text files.
You might need to bring in data from a multitude of systems including CRM, marketing, ERP and customer support tools, external databases and just about any system you can think of. With that in mind, I wanted to do a post about a tool I have been using for the last several months called ‘Stitch’.
The problems with building a data warehouse
I’ve worked on several data warehousing projects recently and two things have really struck me. Firstly, there’s quite a bit of complexity that needs to be worked through – especially when you’re pulling data from multiple systems. Each must be analysed to understand what data is available, the structure of the data, how the data is retrieved, the variables at play, dependencies and how errors are handled etc. This is before a developer has written their first line of code!
Secondly, there’s a lot of re-inventing the wheel. Most of us are using systems that others are using. Others who, like you are trying to build their own data warehouses. For example, let’s say you use Zendesk for customer support and want to use this data in your data warehouse. I’d bet there’s a lot of other organisations trying to do the same thing. This leads to a huge amount of near identical code, wasted time and energy across the industry.
Where Stitch comes to the rescue
Imagine that someone else had done all the hard work of getting data out of a specific system. They’d analysed the data that was available, worked out how to pull the data out and found fixes for any common problems. In fact, they’d done this for well-over 100 systems and packaged their work into a ‘menu’ of ‘Connectors’ that you could pick and choose from.
Well, that’s exactly what Stitch did. All the hard work has been done and there’s no code to write!
The whole process can broadly be split into two steps:
- Choose a destination: Select and connect to your data warehouse. Stitch supports all the likely candidates you’d use for your data warehouse, including PostgreSQL, AWS RedShift, Google BigQuery, Azure Synapse Analytics and Snowflake. (You’ll obviously need an empty data warehouse to get going.)
- Choose your systems: Select and configure the systems you want to pull data from (known as Connectors). For each one, Stitch will ask you a series of questions relating to which data to pull over and on what schedule
I’m sure you can see the advantages of this approach! As mentioned, analysing and writing the code to pull data from a single system takes a lot of time. And then you need to multiply this time, effort and cost by however many systems you want to pull data from. But with Stitch, this work has been done for you!
Sounds pretty good to me!
‘Other’ data sources
The first obvious question is what happens if Stitch doesn’t have a Connector for the system you want to pull data from. Despite having over 100 pre-defined Connectors, you may find yourself in a situation where your system isn’t included. Here, you’ve got three main options:
- Export data from the system into a file and place it into Azure Blob Storage or AWS S3. Stitch has built in connectors to pull data from these. This is the easiest solution, but requires a bit of manual effort from time-to-time
- Use a ‘Push API’. I don’t want to get too technical in this article, but you can write a little bit of code to push data directly into Stitch, bypassing the standard Connectors. Granted, this is more complex and requires some coding knowledge
- Build your own Connector. So, this is the ‘nuclear’ option and does require an experienced developer. In the vast majority of cases your system will be covered by one of the pre-built Connectors. But if you have a fairly niche system and a little more budget (2-4 weeks development time, depending upon the system), then this may be an option for you
Technicalities of using Stitch
Stitch hides away most of the technicalities of developing a data warehouse. For many Connectors the setup is simple – give it your credentials, choose your data and select a schedule.
However, be aware that some Connectors are slightly more complicated. A primary example of this is when pulling data from other databases. Stitch needs to be able to workout which records have been changed since it last pulled data from the source, and this is where the options can get a little complicated. This is beyond the scope of this article but it’s important to get this bit right, or else you might end up spending more money than you need to.
Sounds good, but what about pricing?
Stitch is surprisingly cost-effective. It includes a free plan and paid plans starting from $99 per month. Pricing is based on how many rows of data you process per month, combined with how many Connectors you use. Compare this to a five figure (or sometimes six figure) sum you may pay for a bespoke solution and you can see the benefits of using something like Stitch.
In summary
I like Stitch. It’s a fantastic tool for those looking for a small-mid sized data warehouse without spending large amounts of money on custom development. I have a couple of gripes with it (namely the lack of flexibility with scheduling), but overall, I’m very impressed with the offering.
If you are pulling from your own databases or flat files then be careful when configuring your Connectors (see ‘Technicalities of using Stitch’), or else you can end up blowing your row count and spending more money.
In addition, it’s worth noting that Stitch only performs minimal data transformations (e.g. standardising column types and small bits and pieces). If you are looking to do extensive transformation of your data, then you may need to combine Stitch with another tool. With that said, I’d highly encourage people to register for a free trial to see what the platform offers. You could end up having a fully functional data warehouse in a fraction of the time than you anticipated and a few pounds left over!