Skip to main content
search

Achieving Modern Data Management With One Copy in Microsoft Fabric’s OneLake

By December 6, 2024AI, Automation
features image 2

In our previous article, we explored various aspects of how data is organized, stored, and managed within OneLake, so far highlighting its key features.

Next, let’s shift our focus to One Copy, a feature that emphasizes the efficiency of accessing and utilizing data across lakes and clouds without creating copies, effectively reducing data silos and enhancing data governance. 

Organizations often find themselves needing to copy data out of the data lake into various engines to provide access for users and applications. This process not only creates inefficiencies but also leads to data silos and increased management complexity.

With OneLake’s One Copy feature, this challenge is addressed head-on. One Copy is designed to maximize the value derived from a single copy of data, eliminating the need for unnecessary data movement or duplication. 

Let’s check out how One Copy works!

Streamlining Data Access and Management with Shortcuts

Streamlining Data Access and Management with ShortcutsJust like Windows shortcuts, OneLake shortcuts point you to different storage locations, making it super easy to find what you need. They streamline your workflow, so you can get to your data faster and keep things flowing smoothly.

Connecting Data Domains with Shortcuts

Connecting Data Domains with ShortcutsLet’s revisit the various domains within OneLake. In large organizations, it’s common to have numerous data domains, each managed by different data owners. When we zoom out to view these domains collectively, we see how they contribute to the broader data landscape.

To achieve a comprehensive, 360-degree view of your business, it’s essential for a single data item to span across multiple domains. This is where shortcuts come into play. 

Shortcuts act as connectors between domains, enabling data to be virtualized into a unified data product. Importantly, this process occurs without the need for data movement or duplication, preserving the original ownership of the data.

Essentially, a shortcut functions as a symbolic link, directing users from one data location to another seamlessly. This innovative approach allows organizations to maximize their data utility while maintaining the integrity and organization of their data domains.

Seamless Data Access with Shortcuts

Just like creating shortcuts in Windows or Linux, OneLake allows data to appear in the shortcut location as if it were physically there. 

Previously, if you wanted to make tables from a data warehouse available alongside other tables or files in a lakehouse, you would need to copy that data out of the warehouse. 

With OneLake, you simply create a shortcut in the lakehouse that points to the warehouse, and the data will be accessible in your lakehouse as if you had physically copied it.

The beauty of this approach is that because the data isn’t actually copied, any updates made in the warehouse are automatically reflected in the lakehouse. Additionally, shortcuts enable the consolidation of data across various workspaces and domains without altering the ownership of the data. 

Maintaining Data Ownership and Virtualization

Consider that Workspace B retains ownership of the data, maintaining ultimate control over access and updates. Many organizations have existing data lakes stored in ADLS Gen2 or Amazon S3 buckets, and these lakes can continue to operate and be managed outside of Fabric.

OneLake extends the functionality of shortcuts to include lakes outside of OneLake and even beyond Azure, allowing for comprehensive virtualization. 

This means that all data – regardless of its original location – can be mapped to a unified namespace within OneLake. Users can access this data seamlessly using the same ADLS Gen2 APIs, even if it originates from Amazon S3. 

This integration ensures a cohesive data experience, simplifying access while preserving existing data governance and management structures.

Integrating Compute for Enhanced Data Experiences

Integrating Compute for Enhanced Data ExperiencesUp to this point, our discussions have focused primarily on storage. However, it’s important to highlight that compute is what truly powers all the analytical experiences within Fabric. In this framework, compute is completely separate from storage, a concept that, while not entirely new, is leveraged effectively in Fabric.

The Power of Compute in Fabric

Unlike traditional systems that often provide a single, multipurpose compute engine, Fabric offers multiple dedicated compute engines. Each of these engines can access the same copy of data directly, eliminating the need to import data into a separate instance. 

This architecture ensures that users can choose the best engine for their specific analytical tasks, maximizing efficiency and performance. Whether you need advanced analytics, real-time processing, or batch jobs, Fabric’s flexible compute options empower you to get the most out of your data without unnecessary duplication or complexity.

Seamless Integration with Fabric

Consider a scenario where a team of SQL engineers is tasked with building a fully transactional data warehouse. They leverage the T-SQL engine, utilizing its robust capabilities to create tables, transform data, and manage loads efficiently. 

Traditionally, if a data scientist wanted to access this data, they would have to rely on a connector that routed through the SQL engine or resort to copying the data out of SQL and into the lake.

However, with Fabric, this process becomes much more streamlined. The T-SQL engine natively stores data in OneLake using Delta Parquet format. This integration allows data scientists to tap into the full potential of Spark and various open-source libraries, enabling them to read data directly from the data warehouse within OneLake.

The advantages extend beyond data scientists; business users can effortlessly access this data for reporting in Power BI without any intermediary steps. 

This seamless access not only enhances collaboration across teams but also fosters a more efficient data ecosystem, empowering users to derive insights and make informed decisions faster than ever.

Enhancing Power BI with Direct Lake Mode

Enhancing Power BI with Direct Lake ModePower BI reports utilize the analysis services engine to query data, traditionally offering two connection methods: importing data into memory and querying directly from the source. 

Importing data creates an additional copy that must be maintained, while direct querying avoids this duplication but can often be slower due to the lack of in-memory caching.

With the introduction of the new direct lake mode, the analysis services engine can now read Delta Parquet files directly into memory without creating a separate copy. This innovation effectively combines the strengths of both import and direct query methods, optimizing performance and simplifying data management.

For organizations where data engineering teams favor Spark over SQL, this mode allows for the full utilization of Spark’s capabilities, enabling teams to transform and load data into the lakehouse seamlessly using Notebooks. 

Unified Data Strategy for All Teams

The T-SQL engine remains a powerful tool for creating views and serving data to business analysts executing SQL queries. 

With direct lake mode in the analysis services engine, business users can access their Power BI reports using the same copy of data, ensuring consistency and accuracy in reporting.

When defining your organization’s data strategy, there’s no longer a need to optimize for different teams with varying skill sets and preferences. Teams that prefer SQL can seamlessly work within that environment, while those inclined toward Spark can utilize its capabilities without conflict. 

This collaborative approach allows everyone to build and contribute to the same data lake, eliminating silos and fostering a more integrated data culture.

Moreover, with open access to OneLake, even teams using external engines, such as Databricks Notebooks, can efficiently land data directly into the lakehouse using ADLS DFS APIs. 

Virtualizing Data with Shortcuts

Virtualizing Data with ShortcutsTeams can also create shortcuts to existing ADLS Gen2 or S3 accounts set up through Databricks, effectively virtualizing that data within the lakehouse. This means that all engines will operate over the same copy of data, ensuring consistency and minimizing redundancy.

The OneLake team is dedicated to optimizing their engines to work seamlessly with Delta Parquet, which serves as the native format for tabular data. This optimization allows the engines to work directly with Delta Parquet files, improving performance and simplifying data access.

For example, the T-SQL engine for data warehousing and direct lake mode for analysis services have been adapted to leverage Delta Parquet. This enables faster query execution and streamlined data access, allowing users to efficiently analyze large datasets without the need for complex transformations.

Achieving a Unified View with Shortcuts

Achieving a Unified View with ShortcutsThink about an organization where multiple teams are tasked with managing various data sets that reside in OneLake, across Azure, and even in other cloud environments. 

To obtain a comprehensive 360-degree view of the business using a common data mesh, it’s essential to combine data from these diverse domains. 

Traditionally, this process would involve significant data movement, but with OneLake, shortcuts can be leveraged to reference the data without creating additional copies.

One of the business domains has prepared centrally managed and certified data within a lakehouse in OneLake. To reuse this data, a new shortcut will be added, allowing a reference to that data directly within the lakehouse. By selecting OneLake, all accessible lakehouses are visible, and the certified one containing the order table can be chosen.

Once the table is selected, it appears in the lakehouse as if it had been copied. However, since no data has actually been duplicated, there is always access to the most up-to-date information. This streamlined approach enhances data accessibility and promotes a unified view across the organization.

Integrating Data from Amazon S3 with Shortcuts

Integrating Data from Amazon S3 with ShortcutsFor teams utilizing Amazon S3 for data storage, the shortcut creation process remains equally seamless. By linking data outside of OneLake, including Azure Data Lake Storage and Amazon S3, teams can efficiently access diverse data sources.

For example, consider that a second set of data stored in Amazon S3 in Delta Parquet format is required. The path to the customer dataset is copied. When creating the shortcut, S3 is selected as the source for the data.

Next, the necessary connection information is entered to establish access. After that, the specific data location is chosen, pinpointing where the relevant dataset resides within the S3 bucket.

With these straightforward steps completed, the customer table instantly appears in the lakehouse, all without the data ever needing to leave S3.

Now, the lakehouse boasts a comprehensive dataset, including the order table from OneLake, the customer data set from S3, and, thanks to another shortcut created previously, the product data set from Azure Data Lake Storage. 

This integration fosters a more cohesive and accessible data environment, enhancing overall data management and analysis capabilities.

With OneLake’s One Copy approach, the same data can be accessed by multiple compute engines, whether stored natively in OneLake or logically available through shortcuts. 

Data scientists can train their models directly over this data using a Spark Notebook. For instance, a query can run on the three data sets simultaneously. Data warehouse professionals can perform queries and analyses, joining across these data sets seamlessly. 

Business analysts can navigate to the modeling view, developing their data models and creating rich BI reports with exceptional performance using Direct Lake.

OneLake simplifies the organization of a company’s data into a unified logical lake, allowing one copy of data to be utilized across domains and projects by various compute engines. 

This empowers data engineers, data scientists, SQL analysts, and BI analysts to collaborate effectively within a common data mesh, all while avoiding data movement or duplication.

Empowering Discovery and Integration in the Data Hub

Empowering Discovery and Integration in the Data HubThe OneLake Data Hub serves as the central hub within Fabric for discovering, managing, and reusing data, catering to a wide range of users, from data engineers to business professionals. 

Organized for Easy Discovery

One of the standout features of the OneLake Data Hub is its organization by domain – whether it’s finance, HR, or sales, users can easily locate information relevant to their needs. 

This domain-specific approach not only simplifies navigation but also enhances data discovery through robust features like advanced search, filters, and sorting. Users can traverse a hierarchy of workspaces, making access to essential data straightforward and intuitive.

Once a user selects an item of interest, they can delve deeper into its details, exploring and reusing related items seamlessly. 

The detail page provides valuable metadata, including descriptions, endorsements, and sensitivity labels, along with a comprehensive view of all related items – both downstream and upstream – that leverage that specific data artifact.

Users can perform various actions on the selected item, such as previewing data, conducting explorations, analyzing in Excel, and creating reports, all of which can be accomplished even by those without technical expertise. 

Understanding Data Flow with Lineage

The Data Hub plays a crucial role in understanding data flow within Fabric. It provides access to a lineage view, enabling users to conduct lineage and impact analysis. This feature helps assess the potential effects of any upcoming changes on data dependencies and usage.

A Consistent Experience Across Fabric

The Data Hub offers a consistent and pervasive experience throughout your organization. 

The same intuitive data discovery capabilities are available throughout Fabric, ensuring that users can easily find the data they need in different contexts. This context-aware functionality helps teams collaborate better and work more efficiently by leveraging the right data at the right time in their processes.

For example, the Data Hub serves as a compact view during key operations such as creating a shortcut in OneLake, importing data in a dataflow, connecting to a KQL database, creating datasets, and attaching a Notebook to a lakehouse.

By leveraging the OneLake Data Hub across these diverse applications, organizations empower users to discover and utilize data efficiently, fostering collaboration and informed decision-making throughout the enterprise.

Integration with Microsoft Office

The Data Hub serves as a vital connection to Office applications, enabling users to seamlessly discover, use, and explore OneLake data directly within their familiar workflows. 

Currently available in Microsoft Teams, this integration enhances accessibility for both technical and non-technical users. It acts as the central hub within Fabric for discovering, managing, and reusing data. Users can easily find data that is pertinent to their specific business domain, making the discovery process intuitive and efficient. 

Within the Data Hub, you can delve into metadata to understand the context and details of the data, as well as access lineage information to trace data origins and transformations. This visibility empowers users to gain valuable insights, allowing them to make informed decisions and take actionable steps based on the data at hand. 

Optimized Filtering for Enhanced Exploration

Optimized Filtering for Enhanced ExplorationUsers can easily filter data by HR or finance categories, for example, facilitating business-optimized consumption and reinforcing a data mesh paradigm. The Data Hub offers various filtering options, including recommended items and navigation by workspace, allowing exploration of properties like type, owner, and sensitivity. 

Additionally, the “Endorsed in My Org” view helps identify curated data that has been certified as reliable sources of truth. Users can also apply filters based on specific data item types like datasets, lakehouses, warehouses, and more, empowering teams to access the most relevant and trusted data for their needs.

For example, users can filter for lakehouses and navigate to a workspace of interest. Upon selection, they can explore its details, reviewing all related items that utilize this lakehouse and taking necessary actions. 

Tracking Data Connections with Lineage View

Tracking Data Connections with Lineage ViewThe Lineage view is another powerful feature of the Data Hub. 

By opening the Lineage view, users can track the connections to relevant notebooks and pipelines associated with it. 

This functionality also allows for impact analysis, showing how changes could affect various items across all workspaces, facilitating informed decision-making and effective data management.

Consistent Data Access Across Applications

Consistent Data Access Across ApplicationsThe Data Hub offers a consistent experience across the Fabric environment. 

The Lakehouse editor can be opened to perform various tasks, including creating a new shortcut to the data stored in the lake. When accessing data, selecting OneLake will bring up the same data hub experience in a compact mode, ensuring consistency and completeness while exploring data across Fabric. 

Users can filter their search by business domain or item type, making it easier to find the specific data needed for their projects. Select an item to proceed with creating a shortcut. 

The same Data Hub experience is also accessible in Power Query Online, widely used for data manipulation. In this case, opening the data flow editor allows users to click on “Get Data.” Here, users can either connect to external sources like Excel or access existing data through the Data Hub.

To gather more information, users can click on the dedicated tab to browse data items and their details, just as previously mentioned, and select an item to connect with and retrieve data from. 

Data Discovery in Microsoft Office Applications

Data Discovery in Microsoft Office ApplicationsThe Data Hub is not limited to Fabric or hosted environments; it also serves as the gateway for discovering and exploring data within Microsoft Office applications.

In Microsoft Teams, the Data Hub is readily accessible, providing a comprehensive view of all Fabric data across different item types. Users can explore this data by properties or specific business domains, allowing for meaningful insights and actionable outcomes. 

This integration ensures that data discovery and interaction remain seamless, empowering teams to leverage their data effectively within the familiar Teams environment.

One Security

One Security is an upcoming data protection feature designed to allow users to secure their data once and use it anywhere. This initiative aims to establish a shared universal security model that can be defined within OneLake, ensuring that these security definitions reside alongside the data itself.

This model is crucial because it means security will be living natively with the data, rather than being managed downstream in the serving or presentation layers. 

To achieve this, OneLake will offer robust security features at the lake level, supporting a diverse range of analytics scenarios. Users will be able to define more granular data security at the data item level within OneLake.

This will encompass table, column, and row-level security. For instance, if security is defined on a data warehouse, that security will automatically extend to any shortcuts referencing that data, ensuring enforcement across all engines. 

This includes access through T-SQL for analysts and engineers, Spark for data engineers and scientists, Power BI for business users, and even in non-Fabric engines using the ADLS VFS API.

Final Thoughts

That wraps up our look at OneLake and how the OneLake Data Hub can help your organization break down data silos. With these tools, teams can easily discover, manage, and share data, enabling teams to collaborate more effectively and access the information they need. 

This unified approach not only boosts accessibility but also helps everyone make better decisions based on solid data. 

As businesses keep changing and growing, using something like OneLake will be key to getting the most out of your data and building a culture where teamwork thrives. It’s not just about storing data; it’s about creating a foundation where data is accessible, manageable, and most importantly, usable by everyone in the organization.

If you’re looking to optimize your data strategy or have questions about implementation, we invite you to connect with us. Let’s explore how OneLake can support your organization in achieving a unified, efficient approach to data. 

Raj Sanghvi

Raj Sanghvi is a technologist and founder of BitCot, a full-service award-winning software development company. With over 15 years of innovative coding experience creating complex technology solutions for businesses like IBM, Sony, Nissan, Micron, Dicks Sporting Goods, HDSupply, Bombardier and more, Sanghvi helps build for both major brands and entrepreneurs to launch their own technologies platforms. Visit Raj Sanghvi on LinkedIn and follow him on Twitter. View Full Bio