Should you have an ETL window in your Modern Data Warehouse?
Ah the ETL (Extract-Transform-Load) Window, the schedule by which the Business Intelligence developer sets their clock, the nail-biting nightly period during which the on-call support hopes their phone won’t ring. It’s a cornerstone of the data warehousing approach… and we shouldn’t have one. There, I said it.
Hear me out – back in the on-premises days we had data loading processes that connect directly to our source system databases and perform huge data extract queries as the start of one long, monolithic data pipeline, resulting in our data warehouse. These extract queries can massively impact users of our business-critical applications, and data preparation eats up all of our data warehouse performance and so we rightly move these extracts out of business hours. But business users still want their reports when they start work the next day, hence we’re left with an ever-shortening window during which we can process data.
This all makes absolute sense, but it’s become more than a solution to a problem, it’s now tradition. If you currently wait until midnight until you start processing your data, and process the data all at once, take a moment to think about why you’re doing it. Are you still bound by source-system access?
Modern Data Warehousing
The majority of systems I’ve worked with no longer have this issue. We’re moving away from large, heavily coupled, on-premises data analytics systems. As a general rule, you don’t want your cloud systems strongly tied to your on-premises systems – if there is some bad data in your extract, you need to be able to deal with it without going back to the source system. To counter this, we build a very lightweight data transfer to land the data into unstructured storage before performing any data validation, checks etc. Then, if there are any problems with our downstream data processing, we can go back to this landed data, not back to our source. We deliberately build loosely-coupled data processing systems to build independence from source.
In addition to this, the systems doing our data processing and preparation have changed. The scalability of the cloud means we can provision temporary compute just for data processing, get the data ready and then push it into our warehouse, without affecting the performance our warehouse users receive.
Finally – where we get our data from, is changing massively. Seldom do we have full access to our source system database, with software-as-a-service solutions becoming more popular and additional data feeds being made available in various forms. If we’re pulling from an API to get our data, it’s generally a better approach to do lots of smaller extracts, rather than one big extract a day. The “internet of things” (IoT) industry is growing massively, which is all about a constant feed of streaming events rather than daily batches of data.
In summary, the velocity at which we receive data is growing steadily, and yet we still find people waiting to process data until their nightly ETL window… just because that’s what they know.
Data-Driven Processing
What is the alternative? In Microsoft Azure, we have Event-Grid and various messaging services, we can attach triggers to these events to kick off processes when an event occurs. Data Factory itself can be triggered by its REST API, so anything that can call a webservice can now trigger our data processing workflows.
Have we received a new file in blob storage? Kick-off Data Factory to process it through. Someone’s added a new file to SharePoint? Execute the processing pipeline. We can be incredibly dynamic and reactive, starting to process the data as we receive it. Something we’ve used successfully in the past is implementing “Data Contracts” with third-party data suppliers – when they’ve sent us new data, they drop a message in a queue which both defines how we process that data and triggers the data loading process.
You can definitely go too far with this approach – remember that there are end users who are working with the data. If it constantly changes throughout the day, it’ll be hard for them to understand what’s going on. But you can certainly get the data ready, do all of the processing up-front so when it comes time to refresh business models, it’s a much smaller process.
This gives a huge amount of flexibility to the architecture – if users want that dreaded thing that is a “near realtime” report, they can go to the warehouse layer, if they want speed and flexibility and don’t mind a slower refresh cadence, they can query the semantic layer.
Our ETL period in this case, assuming the semantic layer gets refreshed nightly is now something like the following:
There’s a logical next step, which would be to take the semantic layer and start refreshing that at a lower cadence too – maybe hourly refreshes. This requires a bit of a culture shift in your organisation, but if you’re climbing that analytics maturity curve, maybe that not such a bad thing.
I’m not promising this is easy - there are complexities around late arriving facts, file dependencies and the usual objections. But it’s no longer a problem of technology, it’s a problem of good data modelling, assuming you’re still using Kimball which is a whole different discussion - more on that here!
Disclaimer: If you’re processing data overnight, and have a good reason to do so, that’s not a problem! It’s only when you’re processing overnight because of tradition, and potentially missing out on opportunities because of it, that we have a problem!
If you’d like to know more about augmenting your warehouses with lakes, or our approaches to managing lakes and agile data delivery, please get in touch at simon@advancinganalytics.co.uk. We also offer training on the modern data warehouse toolset in Azure – take a look at our training offerings here.
Till next time.
Simon Whiteley
Director of Engineering
Data Platform Microsoft MVP
You can follow Simon on twitter @MrSiWhiteley to hear more about cloud warehousing & next-gen data engineering.