Architecting a Successful Analytics Platform in Microsoft Fabric
It’s really easy to get started with Microsoft Fabric. We can ingest data, transform it, and have it surfaced in Power BI in no time without creating a single Azure resource. That’s the easy part but how do we turn that into a robust, logical architecture that will give us a successful Analytics Platform?
We first need to think about how Fabric is structured. What’s the hierarchy of artifacts and containers inside Fabric and what does each layer mean?
Workspaces
We start with Workspaces at the core of Fabric. Coming from the Power BI world, workspaces are not just for organising reports, datasets, and dashboards anymore. They are every user’s main window into Fabric. At a base level, Workspaces are where security is set. They are where you define who can create, edit, delete, or just consume everything you create in Fabric.
Workspaces are where many technical configurations are controlled too. Source control over all of your artifacts (only Power BI artifacts are currently supported in the preview) is set at the Workspace level. If we go deeper, it’s also the layer where we control the spark cluster for executing notebooks.
All of these elements position them well to sit inside the umbrella of specific business teams or departments where access to data is aligned, and demand patterns likely align too.
Domains
Workspaces won’t meet the needs of all organisations though and can present a fragmented security model for businesses where many teams come under a single department or regional office. This is where Domains come into play and start to hint towards the Data Mesh architecture that Microsoft is trying to weave into Fabric (pun intended!)
Domains are essentially a container layer above your workspaces, providing a layer of control over many workspaces and facilitating easier collaboration between them. This is the kind of thing that might align to a global sales group with regional sales teams defined at the workspace level.
This approach aligns well to a cornerstone of the Data Mesh architecture:
It’s actually quite clever how Microsoft are supporting the Data Mesh architecture from a business perspective whilst also helping alleviate the technical challenges with all of Fabric sitting on top of OneLake, a single, yet distributed storage layer under the hood.
One of my main concerns with the Data Mesh architecture has always been that it relies on an ideal world of data ownership, curation, and control that exists in so few organisations.
Lakehouse or Warehouse
Aimee has written a great analysis and comparison of the question on everyone’s mind: Lakehouse or Warehouse so I won’t revisit that here but I urge you to check it out. These are probably the two core artifacts in your workspace and the first decision point when architecting your Analytics Platform.
Although that decision is mainly driven by your team’s skillset and the business’ requirements, I’d advocate for a full Lakehouse architecture.
This involves us establishing a Lakehouse artifact for each of our intended zones or layers. You might follow the medallion architecture or our own approach of Raw, Base, and Curated.
Or perhaps something else entirely but the premise remains the same. Data is ingested and curated as it moves through each zone.
Let's look at a quick overview of what we would like to achieve:
Use Pipelines to ingest data into the RAW zone
Use Spark notebooks to apply cleansing rules and transform data to BASE
Model our data in CURATED for Power BI to connect to it directly using Direct Lake
Pipelines, Flows, Notebooks, and SQL
As we keep highlighting, there’s many ways to get to the same answer in Microsoft Fabric and the same applies to data processing. For the full Lakehouse approach, we’re looking at Pipelines to ingest data and Spark notebooks to process it through each zone (Lakehouse). Don’t forget you can also write SQL in notebooks too! You also have Data Flows in the pipelines to do processing but it’s not a tool I would recommend. If you opt for a Data Warehouse, you have stored procedures and queries in there too. There’s a lot of different ways to work with data in Fabric and as before, skillsets will influence this decision, but costs will inevitably make it.
Development Practices
With our Domains set up and our Workspaces holding our Lakehouse artifacts we need to think about how this architecture in Fabric fits into the development practices of a data team. This hasn’t been a strong focus in Power BI in the past, due to the primary audience being business users. That’s set to change though (not quite yet as Fabric is in Preview).
With source control set at the workspace level, the development cycle within Fabric is performed on an entire workspace. This presents the challenge of how to develop without impacting users. Much like the Power BI implementation, this means creating a workspace for each environment which can increase complexity in security management but is the only logical approach.
Basic Principles
As with the Lakehouse vs Warehouse question, there are many options in Fabric to build out an Analytics Platform using different combinations of artifacts and tools with some more likely to be successful than others. Microsoft takes care of a lot of those decision points with everything in Fabric writing to the OneLake in Delta. That alone gives you a solid, (but not vendor-locked) Data Lakehouse platform to build out the parts on top that meet your needs.
Beyond these basic Lakehouse principles, which tools you use to transform data, what types of layers you process that data through, and how you surface and visualise it, will always depend on a vast array of factors specific to your business. Microsoft have simplified the Where and the Why, but you still need to build out the How.
At Advancing Analytics, we live and breathe the Data Lakehouse. Review our other Fabric blogs and videos for more information and get in touch for us to help you assess if Fabric meets the needs of your business.