Unity Catalog in Databricks is a data governance solution that helps ensure security, access control, and metadata management across your data lakehouse. To make it clearer, let's break down the core concepts with an example:
Example Scenario: Data Governance for a Sales Data Lakehouse
Imagine you have a Sales Data Lakehouse in Databricks with data related to sales transactions, customer information, and inventory data.
Unity Catalog Structure:
-
Metastore:
- You create a metastore to store all your metadata (such as tables, schemas, etc.). This metastore will be connected to your ADLS Gen2 storage account.
- Example: You have a metastore called
sales_metastore
.
-
Catalog:
- A catalog is a top-level grouping of data, such as
SalesData
,CustomerData
, etc. You can think of it like an environment (Dev, Prod). - Example: Create a catalog called
SalesCatalog
.
- A catalog is a top-level grouping of data, such as
-
Schema:
- Inside each catalog, you have schemas that organize the data. A schema is similar to a database in traditional relational databases.
- Example: Inside
SalesCatalog
, you can have schemas likeTransactions
,Customers
,Inventory
.
-
Tables:
- Inside each schema, you store data in tables (Managed or External).
- Example: Inside the
Transactions
schema, you might have a tabletransaction_data
, and in theCustomers
schema, a tablecustomer_info
.
-
Views:
- A view is a read-only object created from one or more tables, which can be used to simplify queries.
- Example: Create a view
monthly_sales_summary
from thetransaction_data
table to summarize monthly sales.
Access Control & Roles:
-
Permissions:
- Unity Catalog allows you to grant specific permissions to users or groups at different levels: catalog, schema, and table.
- Example: You can give read access to the
transaction_data
table to the Sales Analysts group but restrict write access to the Sales Managers group.
-
Roles:
- Unity Catalog uses roles to manage access. You can assign roles like
Admin
,Metastore Admin
,Workspace Admin
, andUser
. - Example:
- Workspace Admin can create and manage clusters but cannot alter the catalog or schema.
- Metastore Admin can create catalogs, schemas, and manage access to them.
- Workspace User can access shared clusters and notebooks but may not be able to create clusters unless given permission.
- Unity Catalog uses roles to manage access. You can assign roles like
-
Role-based Access Control (RBAC):
- Syncing with Azure Active Directory (Azure AD) ensures that user groups and permissions are consistent across the Databricks workspace.
- Example: The
SalesAnalysts
group in Azure AD is synced with Databricks. This group gets read-only access to thetransaction_data
table but cannot modify the data.
SCIM Provisioning:
- SCIM (System for Cross-domain Identity Management) can automatically sync users and groups from Azure AD to Databricks Unity Catalog.
- Example: You set up SCIM so that when new employees join the sales team in Azure AD, they automatically receive the appropriate permissions in Databricks.
Audit and Security:
- Unity Catalog helps with security by tracking who accessed which data and what actions were performed. This audit logging is built-in.
- Example: If someone accesses the
transaction_data
table, Unity Catalog logs the action (e.g., read access) and the user’s identity.
Delta Sharing:
- With Unity Catalog, you can share data securely with external platforms using Delta Sharing.
- Example: You want to share the
monthly_sales_summary
view with a partner using Delta Sharing. This data is shared securely and can be consumed on any platform that supports Delta Sharing.
Data Lineage:
- Unity Catalog captures data lineage to track how data is transformed and derived.
- Example: You create a new column in the
transaction_data
table to calculate the monthly revenue, and Unity Catalog tracks the lineage of that new column to show how it was derived.
Steps to Set Up Unity Catalog:
-
Create Storage Account:
Set up an Azure Data Lake Storage (ADLS Gen2) account to store metadata. -
Create and Link Metastore:
- Create a Metastore and link it to the ADLS storage.
- Once the metastore is created, you can link it to multiple workspaces for shared access.
-
Set Permissions:
- Assign appropriate permissions (e.g., read, write, manage) to users/groups at the catalog, schema, or table level.
-
Enable SCIM Provisioning:
Sync users and groups from Azure Active Directory to manage access automatically.
Step-by-Step Guide to Create a SCIM Connector:
Step 1: Generate SCIM Token in Databricks
- Go to Databricks Account Console > Settings.
- Click User Provisioning > Generate SCIM Token.
- Save the SCIM Token and SCIM URL.
Step 2: Configure SCIM in Azure Active Directory
- Navigate to Azure Active Directory > Enterprise Applications.
- Create a new application by searching for Azure Databricks SCIM Provisioning Connector.
- Go to Provisioning > Automatic Provisioning.
- Enter SCIM URL as Tenant URL and SCIM Token as Secret Token.
- Click Test Connection.
- Under Scope, select Sync only assigned users and groups.
- Turn Provisioning Status on.
- Manually assign users/groups to be synced.
Summary:
- SCIM Token & URL are generated in Databricks and used to securely connect Azure AD with Databricks.
- Automatic Provisioning in Azure AD enables user and group synchronization.
- After configuring, users/groups from Azure AD are synced to Databricks, allowing seamless user management and access control.
Summary of Key Features with Example:
- Centralized Metadata Management: Manage data across multiple workspaces (e.g.,
SalesCatalog
). - Role-Based Access Control: Control access to datasets using roles like
Metastore Admin
orWorkspace User
. - Delta Sharing: Securely share data (e.g.,
monthly_sales_summary
) with external platforms. - Audit and Lineage: Track data access and transformations (e.g., revenue calculation lineage).