Skip to main content
search

Getting Your Data Organized in a Centralized Location with Microsoft Fabric’s OneLake

By December 6, 2024December 17th, 2024AI, Automation
Featured

Microsoft Fabric launched last year, bringing a host of exciting features that we explored. Even with countless options in the data and analytics market, Microsoft recognized that these choices still left data teams struggling to effectively share and utilize outputs between departments.

We’ve experienced this firsthand in our recent projects, where we faced challenges when trying to create data products because we had to pull data from various teams. 

Even though we used one main vendor for data tools, they still found it difficult to combine and utilize data from different sources effectively.

Since its debut, we’ve been closely following Fabric’s product architecture and are particularly impressed with its innovative approach to data storage. 

In this article, we will look at OneLake, focusing on traditional data lake challenges and OneLake’s approach to modernizing data management and storage solutions.

What are Fabric and OneLake?

What are Fabric and OneLakeKeeping up with the annual wave of news and updates at Microsoft Build helps us see what the company values for its developer audience. 

At Build 2023, AI and ML took center stage as Microsoft introduced a comprehensive method for creating AI solutions, starting from your data and building upward, meaning you begin with data management and then add layers of functionality.

One major announcement was the launch of Microsoft Fabric, a set of cloud-based tools designed to handle large amounts of data, specifically geared toward data science and engineering. Building custom AI applications starts with identifying and providing the right data for designing and training ML models. 

Fabric isn’t just about building applications; it also focuses on managing those applications and providing real-time analytics, which is crucial for operating a modern business effectively.

Microsoft is setting a new standard for analytics with Fabric which provides an intelligent data foundation to handle various analytic workloads efficiently. 

It combines several Microsoft tools – Power BI (for business intelligence and visualization), Data Factory (for data integration and orchestration), and the next generation of Synapse (for big data and analytics) – into a single, cohesive solution. 

This means users can create visualizations, perform data integration, and run large-scale analytics all within one environment. 

The platform is designed to offer good performance relative to its cost, meaning it delivers high efficiency and value for the price paid by customers. It also aims to simplify the management of analytics processes, making it user-friendly for organizations to handle their data needs without extensive overhead. 

In a nutshell, Fabric is a modern analytics solution that simplifies data handling and analytics tasks, making them more cost-effective and easier to manage.

At the heart of this platform is OneLake, a smart, unified data lake service that aims to simplify data management, eliminate silos, and accelerate analytics across the entire organization. It essentially serves as the storage account for all the data you use within Fabric.

As the name implies, it acts as a unified “data lake” that supports nearly all your Fabric workloads. For those who enjoy analogies, Microsoft has likened it to “OneLake is to your data what OneDrive is to your files and documents”.

Typically, in data management, data has to be moved between different systems or duplicated for various uses, which can be time-consuming and costly. 

OneLake avoids this by providing a unified platform where data can be accessed without the need for such movement or duplication, minimizing data management time and effort. It aims to give you one spot for all your data, and along the way, eliminate silos and streamline management (security, governance, and data discovery) for your data.

OneLake is designed to efficiently handle all types of data analysis activities like generating reports, running complex calculations, or finding patterns in the data.

The Challenge of Traditional Data Lakes

The Challenge of Traditional Data LakesBefore the introduction of OneLake, organizations struggled with data management due to fragmented data lakes. Their vision was to create a single, pristine data lake that served as a centralized repository where all organizational data could be consolidated.

This approach aimed to dismantle data silos, facilitating easier data blending, analysis, and access across the organization. The goal was to simplify management, governance, and discovery, ensuring that users and applications could easily retrieve the data they needed.

However, realizing this vision proved challenging. 

The experience of working with data lakes often mirrored the early days of file sharing before cloud storage services like OneDrive. 

Back then, organizations had to invest in physical servers and manually organize files into folders. This process required a lot of hands-on work to ensure that files were properly stored and accessible.

When OneDrive and similar services emerged, they revolutionized file sharing by simplifying the process. Instead of investing in and managing your own servers (data lakes), you pay for cloud storage and let the service handle the technical details. 

If you need more sophisticated data management (like a custom data lake for complex data analysis), you can build it using the storage provided by cloud services. 

However, in practice, building a well-organized data lake involves significant challenges including coordinating efforts through a central team to ensure everything is properly integrated and managed.

In essence, while data lakes offer a streamlined way to manage and analyze data, getting them to function as envisioned involves a lot of manual work and effort.

To address these challenges, the data mesh pattern emerged. This approach advocates for decentralizing data management, allowing individual business units or teams to manage their own data lakes. 

While this can enable teams to work more independently and responsively, it introduces additional overhead and complexity. Each team ends up with its own data lake, leading to fragmented data storage across the organization.

The result of this fragmentation is often multiple isolated data lakes, each serving different business domains. 

While this approach allows individual teams to handle their specific needs, it often results in fragmented data systems that are not easily integrated, each requiring its own governance. This leads to data duplication and, ultimately, higher costs and inefficiencies. 

To overcome this, you often need to develop solutions that break down these silos. This might involve moving data between lakes, which is a cumbersome and complex process.

Even after data is moved or integrated, users and applications may not have direct access to these data lakes. Instead, organizations often need to build additional layers like data marts, data warehouses, and cubes, as well as Power BI datasets to serve the data to users. 

The issue is that these structures are not just references to the original data in the lake but are often copies of the data. 

Sometimes, this results in multiple layers of data copies and complex systems to manage them. The data ends up flying all over the place, complicating data management and access.

Despite these complexities, the effort is justified by the valuable insights and information derived from the data. Organizations invest in building and maintaining these complex systems because of the significant value data can provide.

OneLake, the OneDrive for Data

OneLake, the OneDrive for DataOneLake changes all of that. Built right into Microsoft Fabric, OneLake offers a “Data Lake as a Service” experience, transforming how organizations manage and interact with their data. 

Instead of dealing with complex setups and multiple, siloed data lakes, OneLake provides a unified, streamlined data lake that is built for you.

With OneLake, each Microsoft Fabric tenant (distinct user or organization) gets a single, unified data lake – always one, never more, never less. There’s no need to set up or maintain any infrastructure. 

As a SaaS service, OneLake leverages the concept of a tenant, automatically providing a unique management and governance boundary for your entire organization. 

The tenant admin has full control over this boundary, ensuring that every piece of data in OneLake benefits from built-in governance features like data lineage, protection, certification, catalog integration, and more, right from the start.

Workspaces

WorkspacesWhile all data in OneLake is ultimately under the control of the tenant admin, it’s designed to ensure that different business groups can operate independently without needing constant oversight.

Just like an Office user doesn’t have to go through their admin to create a new Teams channel or SharePoint site, OneLake offers a similar approach to distributed ownership through workspaces (groups or projects).

Each workspace within OneLake functions like a separate data environment where different parts of the organization can manage their own data, set access controls, and handle their specific needs – all while still contributing to the central data lake. 

A workspace can be customized with its own administrator, access controls, region, and even its own billing capacity. 

Fabric offers an intuitive way to manage workspaces. Users can view all existing workspaces and create new ones with minimal effort, thanks to the platform’s lightweight approach to workspace creation. 

Creating a new workspace is simple and lightweight, inheriting the rules and governance set by the tenant admin. This means there’s no need to reimplement the same governance policies or struggle to get different resources to communicate effectively.

Workspace admins also have the flexibility to manage access to data within their specific workspace, ensuring that teams can collaborate securely while maintaining control. 

If your organization operates in multiple countries and has strict requirements for data residency, OneLake has you covered. It supports multiple regions, allowing different workspaces to reside in different countries, ensuring compliance with local data laws and regulations.

Built on Azure Data Lake Storage Gen2, OneLake uses multiple storage accounts across various regions but presents them as a single, logical data lake. This approach enables seamless data management on a global scale, combining the benefits of local data residency with the simplicity of a unified data environment.

Fabric Data Items

Fabric Data Items

In OneLake, all data is stored as part of what’s known as a Fabric data item. These items are designed to be pre-wired to store data in OneLake using open file formats, ensuring compatibility and flexibility.

So, what exactly is a Fabric data item? 

If you’re familiar with Power BI, you’ve already interacted with one type of data item: Power BI datasets. 

Fabric extends this concept with several new types of data items, each tailored to different user needs. For example, there’s a fully transactional data warehouse designed specifically for SQL developers, and a lakehouse that caters to data engineers. 

The lakehouse provides a familiar, lake-like experience for those used to working with traditional storage solutions but also offers additional capabilities.

Within a selected workspace, various data items are displayed, such as data warehouses, which can include multiple tables and schemas.

Regardless of the type of data item you start with, all data is ultimately stored in OneLake. This is similar to how Word, Excel, and PowerPoint documents are saved in OneDrive. 

When you access OneLake directly, you won’t see the data items or workspaces as separate entities; instead, you’ll interact with files and folders just as you would with other data lake solutions.

Delta Lake Parquet

Delta Lake ParquetIn OneLake, all tabular data is stored using the Delta Lake Parquet format. There are no new proprietary file formats created just for Fabric. This approach avoids creating isolated or incompatible data formats (data silos), which can make it hard to share and use data across different systems.

Even for SQL data warehouses (which store and manage large amounts of data using SQL), Fabric uses Delta Lake Parquet. This consistency helps ensure that data management is unified and standardized across different types of data storage.

Versatile Storage

Versatile StorageDespite this standardization, OneLake remains highly versatile. Built on top of Azure Data Lake Storage (ADLS) Gen2, it supports widely used APIs and a broad range of file types, both structured and unstructured, avoiding vendor lock-in.

This makes OneLake not just a data lake for Fabric or Microsoft but an open data lake designed for broad compatibility and flexibility. 

It’s fully compatible with ADLS Gen2 DFS APIs and SDKs. This ensures that existing ADLS applications and workflows that use ADLS Gen2, such as Azure Databricks and Azure HDInsight, can integrate smoothly with OneLake.

Unified Storage

All of the data for a tenant is treated as being part of a single large storage account. This means that no matter how many projects or departments are using the storage, it’s all managed centrally. 

Within this large storage account, you can have different workspaces. Each workspace is treated like a container in the storage system. Within each container, the data is further organized into folders. This hierarchical structure helps keep data organized and accessible.

You don’t have to worry about where your data is physically located or how the storage infrastructure works. Instead, OneLake takes care of managing all that complexity behind the scenes, ensuring scalability and performance.

If you’ve used ADLS APIs (for accessing and managing data in Azure Data Lake Storage), addressing and managing data in OneLake will feel familiar. This is because OneLake supports the same API structure and methods. There’s no need to remember multiple storage accounts – just one virtual account for all OneLake data. 

When you access data in OneLake, the workspace name becomes part of the container portion of the URL. The item name (the name of the file or data object) and the type of data (like a document, image, or database file) are included in the rest of the URL path. 

Data Management

Data ManagementSince OneLake is essentially the OneDrive for data, it allows users to explore and manage all workspaces and data directly from within the Windows environment, without needing to leave the operating system or use complex external tools. 

Specifically, in a data warehouse, a clear structure is presented, with folders for tables and schemas.

An example of its flexibility is when a user creates a new table using T-SQL and loading data into it. The newly created table, complete with its DeltaLog and Parquet data, appears instantly upon refreshing the folder view. 

While T-SQL was used to create and manage the table, the data itself is stored in open formats within OneLake. This means it can be easily integrated with other tools and avoids creating isolated data pockets, improving data accessibility and usability.

Lakehouses

LakehousesFabric also introduces the concept of a lakehouse, a highly flexible data item within the platform. Unlike fully transactional data warehouses that support SQL operations, lakehouses allow for a broader range of data types and loading methods. 

This capability extends to semi-structured and unstructured data, such as images or multimedia files, alongside more traditional structured datasets.

For example, a set of images stored locally on a machine is ready to be used for training an ML model. To facilitate this, the user can navigate to the newly created lakehouse. 

By simply copying and pasting the images from their local machine into the lakehouse’s file section, these files are quickly available in OneLake. This efficient process ensures that the images are seamlessly integrated into the lakehouse environment and ready for use in ML model training within seconds.

Data Pipelines

Data PipelinesOneLake’s open access also facilitates the integration of existing data pipelines with new systems like Fabric lakehouses.

For example, a data pipeline built in Databricks to manage a data lake can be seamlessly adapted to store data within a new Fabric lakehouse. Lakehouses combine the capabilities of data lakes and data warehouses, making them suitable for existing pipelines.

You are currently using storage accounts to store data. You read from one storage account, process the data, and then write it to another. Transitioning to a lakehouse offers improved performance and flexibility.

To integrate data into the lakehouse environment, navigate to your lakehouse in OneLake, click on it, select the “Properties” window, and copy the ABFS path. The lakehouse path tells Databricks where to find and save data in the new lakehouse. 

The next step is to update your Databricks notebook where you have your data pipeline. Replacing the storage path with the new ABFS path ensures that your output goes to the lakehouse rather than the old storage location.

The lakehouse path includes both the workspace name and the lakehouse name along with the data item type. This allows you to seamlessly read from and write to the lakehouse in Delta Lake format. 

Running the notebook is the final step in the process. 

By simply updating the path in the notebook to point to the Fabric lakehouse, the data is seamlessly loaded into the lakehouse. Upon returning to the lakehouse, users can see that the tables are now populated with the newly added data, ready for further analysis.

OneLake ensures that there are no data silos by storing data in open formats. Additionally, open access to data allows users to leverage familiar tools and services, making data management more flexible and efficient.

Domains

OneLake effectively addresses data silos within organizations and supports the implementation of a data mesh pattern. Now, the approach is being enhanced even further. The introduction of domains as a core component of the OneLake experience, offering a true data mesh as a service.

A domain is a logical grouping of data within an organization, designed to align with specific areas or fields of interest. For example, you might define separate domains for marketing data, finance data, and sales data.

This structure not only supports better data management and accessibility but also aligns with the data mesh pattern by organizing data based on business needs. Domains enable federated governance and business-optimized data consumption, making it easier for organizations to manage and utilize their data effectively.

Let’s explore in detail how easy it is to define, associate, manage, and consume domains in Fabric, enabling organizations to organize their data according to their business requirements, and effectively implementing a data mesh as a service.

Defining and Managing Domains

Defining and Managing DomainsDomains are managed by domain admins, who have the ability to add descriptions, configure settings, and oversee the domain’s structure. 

Domains are defined through the admin portal. In the domain screen, all existing domains in the organization are listed, along with their respective administrators. From here, admins can edit or delete domains as needed.

To create a new domain, such as a finance domain for the finance department, the steps are simple. Start by adding a domain name and a description. Custom branding, such as selecting an image, can also be applied to help users easily identify the domain across different data consumption experiences. 

Administrators for the domain can also be defined at this stage, allowing more control from tenant to domain level. Domain contributors can be assigned, determining who can associate workspaces with the domain. The assignment can be open to the entire organization, specific security groups, or just tenant and domain admins.

Associating Workspaces to Domains

Once a domain is created, workspaces can be associated with it. You can assign one or multiple workspaces to a domain either by their names, by specific security groups, or based on how much resource capacity they have.

For example, assigning workspaces by name allows users to easily search for relevant terms and select multiple workspaces at once, streamlining the assignment process. 

Upon returning to the domain screen, all workspaces assigned to the finance domain can be viewed in one central location.

Federated Governance for Business Needs

Achieving true federated governance is possible by delegating settings from the tenant level to the domain level. This means domain admins will have more granular control over their specific business areas. 

For instance, while the Export to Excel feature is enabled for the entire organization, it can be blocked specifically in the finance domain, ensuring tighter data management.

Business-Optimized Data Consumption

Consumers benefit from optimized discovery and consumption experiences in the OneLake Data Hub. Users can filter data by domain, which adjusts branding to reflect the business context and filters the available data to show only the items relevant to their specific needs.

This targeted approach not only simplifies data discovery and use of data but also enhances business-optimized data consumption, making it easier for users to explore and find relevant information across the organization, ultimately improving efficiency and supporting more effective data utilization.

Data Endorsement

To prevent data swamps – where irrelevant or outdated data accumulates – domains enable the endorsement of specific data. By tagging data as certified or promoted within a domain, organizations can highlight valuable and reliable data, ensuring it surfaces to the top. 

This practice encourages regular reviews and ensures that users are engaging with the most pertinent and trusted data.

Final Thoughts

Keeping your data organized and easy to access is super important for streamlining processes across your whole organization. The tidier your data, the easier it is for your team to use it effectively.

Take a salesperson, for instance: if they have all the info they need about a potential client right at their fingertips, they can have a more personalized chat, which ups the chances of closing that deal. 

And for someone in finance, managing payments, invoices, and reports becomes way simpler when everything is neatly compiled in one spot. 

OneLake provides a transformative solution to the long-standing challenges of data silos, duplication, and governance. 

With OneLake, Microsoft is not just offering another data storage solution; they are redefining how we think about and use data in a modern, AI-driven world. It’s time to break down the silos and embrace a new era of data management.

If you’re interested in exploring how OneLake can enhance your data management practices, let’s connect. We would love to discuss your challenges and share insights on how these tools can fit into your organization’s data strategy. 

Raj Sanghvi

Raj Sanghvi is a technologist and founder of BitCot, a full-service award-winning software development company. With over 15 years of innovative coding experience creating complex technology solutions for businesses like IBM, Sony, Nissan, Micron, Dicks Sporting Goods, HDSupply, Bombardier and more, Sanghvi helps build for both major brands and entrepreneurs to launch their own technologies platforms. Visit Raj Sanghvi on LinkedIn and follow him on Twitter. View Full Bio