Navigating Data Management with Microsoft Purview

Navigating Data Management with Microsoft Purview

Article Data Platform Solutions Data & AI Technology & Platforms

As our clients see their data volume and velocity grow, there is a big need to establish a catalog of these valuable assets. Having implemented several Azure Data Platforms, sooner or later the default Microsoft tooling appears as an option. Will it help us find the way in our ‘data estate’? This story will teach you what to expect when you are the one setting it up: what design considerations to make before you start.

In September 2021, Microsoft first released its new unified data governance solution which is now named Microsoft Purview. Purview offers a technological solution for unified governance and management of information by assisting in managing data resources. Whether they are on-premise or in the cloud, across companies and organizations, insight into all of this should be possible. That would in its turn help improve the quality, compliance, security, and decision-making. Amazing, right?

Based on client requests and years of BI experience, we put Purview to the test on the following 3 topics:

  • Data Catalog – What data does our organization have
  • Data Lineage – Where does our data come from and where does it go
  • Data Glossary/Dictionary – What does our data mean

As a test case, we recreated business intelligence & analytical ETL dataflows similar to the practical use cases and projects that companies and organizations face in their day-to-day data journey. We extract data from an on-prem SQL database and let it land in data lake storage, Synapse, Azure SQL database and Fabric Data Warehouse. Data factory (ADF) was used to move data and Power BI was used as a visualization tool in each case.

1.1 Think about what you want to scan

In general, we recommend planning what you aim to achieve with a scan of your Data Estate first. Purview connects to most Azure tools and we will detail out some specific scan findings below. For instance, SQL server can be scanned, but you might as well connect to the Data Factory instance that loads from this server. Similarly, you can point to your gold data layer but you could connect to the Power BI dataset that consumes it as well. Both scans will return overlapping or related sets of objects: The Power BI dataset sourcing from Data Lake Storage might show a storage container, whereas a scan of the storage account returns the individual files or delta tables. The goal, or required level of detail of the data catalog, must be determined before embarking on the Purview journey.

1.2 Manage expectations of the scan results for catalog and lineage

Before judging the lineage and catalog we should consider what to expect. Documenting data flows and other data assets is an unpopular activity, so we hope that Purview provides both business users and IT folks with detailed insights. A business user might ask ‘Do these sales figures come from our order intake system or from our accounting department’s software’? or ‘What is the formula used to calculate the Net Margin in this report?’. In IT we might wonder ‘If we stop feeding data into this bronze table, what data products are impacted?’ or ‘where in the flow of data should we make an adjustment to fulfil the change requested on the Customer Life Time Value calculation’.

One main finding is that the business and IT lineage questions do not get solved automatically. In our investigation, we were never presented with an end-to-end lineage chart. Generally, the Power BI scan displays the direct source of the PBI dataset, the ADF scan displays the lineage from on-premise to cloud storage, but the combination of the two is not connected. As an example, the left three and right three items in below image were connected, but the complete flow was not. In many cases a data steward can add the lineage manually. However, when we only used the new Fabric components, we could not manually connect items. Relationships could only be established through automated scanning.

Second, for business users we were looking for a way to display lineage to the original source. They might not be interested in all the storage and transformation steps the data passes through. We tried using the metamodel for this as well, which allows you to display the business assets and connect to your technical assets. For instance, system A (business asset) feeds Power BI dataset B (data asset). It is possible and the relationship is displayed clearly on the asset’s overview pages. However, the link has to be maintained manually by a data steward.

What’s good to know is that a lot can be manipulated via the API. If you manage to collect some dependencies in (for instance) Excel, a savvy data engineer can make them appear in Purview. But there is no way around some manual work.

On another aspect of the catalog and lineage side we were positively surprised. Purview is able to identify the dependency between different measures and calculated columns created in our Power BI example dataset. Have a look at the image below: when selecting the “TaxAmt” column or “Cash Ratio” measure, it displayed the related columns (“Amount”) and measures (“Cash”, “Liability”), respectively – but not the formula.

Last but not least, when a data asset is sufficiently covered in your data catalog, you can decide to assign it status ‘certified’ to indicate that it is approved by a data steward or owner.

1.3 Curb maintenance expectations: the glossary

Quite some parts of the catalog get filled automatically after running a scan. Tables, fields, and technical dataset attributes such as database and schema are all there. We still need to research, but the classification and scan rule set capabilities look promising, and the resulting labels and tags could even be consumed by Security Copilot. But first, what about the glossary?

Data glossaries in Microsoft Purview should help to standardize the business language, understand the definition of particular columns and fields, and see the relation with associated/dependent fields. It should help the business answer what formula is used to calculate a certain metric. We created several data glossary terms within a “Finance Glossary” and linked them with earlier identified data sources and related schemas. That linkage allows for a two-sided search of both terms and assets where these terms have been utilized. Additionally, we created parent-child relationships between the associated terms, e.g. “Amount” – “TaxAmt” or “Amount” – “Cash” – “Cash Ratio”. Then, using the search bar on Microsoft Purview’s home page, we have looked for “TaxAmt”. The search results were as follows:

We could easily find our glossary term and click through into the details, where we find the description, hierarchy nodes and linked catalog assets easily. Last but not least, a contact page displays the experts and stewards in our organization that know more about the glossary term. Note that we set up all glossary items and linkages manually. 

Takeaways

Purview can support your business when you are looking for a data catalog and glossary. End-to-end data management insights on your data can be achieved. However, the success of your governance tool implementation stands or falls by completeness and correctness of its catalog items, lineage and glossary terms. If you are to start using Purview today, be sure to find dedicated data stewards for the manual tasks mentioned before. Come up with quality standards on catalog, lineage and glossary before calling a dataset ‘certified’. Additionally, if manipulating data via the Purview API is a requirement, get a skilled engineer on board and plan for the future maintenance so that your data estate insight remains up to date.

Looking for help growing in data management? Need help with your data domain setup or technical nitty gritty work in Azure? Eraneos can help! Reach out to our experts.


Stay up to date!

Are you enjoying this content? Sign up for our (Dutch) Newsletter to get highlighted insights written by our experts.


Eser Çomak
By Eser Çomak
Junior Machine Learning Engineer – Data & AI
Jasper Stoop
By Jasper Stoop
Senior Data & AI Consultant
Javid Muradov
By Javid Muradov
BI Data & Warehouse Consultant – Data & AI
Paul Schoondermark
By Paul Schoondermark
Senior Data & AI Consultant – Data & AI

13 Mar 2024
Knowledge Hub overview