Unity Catalog

 Unity Catalog in Databricks is a data governance solution that helps ensure security, access control, and metadata management across your data lakehouse. To make it clearer, let's break down the core concepts with an example:

Example Scenario: Data Governance for a Sales Data Lakehouse

Imagine you have a Sales Data Lakehouse in Databricks with data related to sales transactions, customer information, and inventory data.

Unity Catalog Structure:

  1. Metastore:

    • You create a metastore to store all your metadata (such as tables, schemas, etc.). This metastore will be connected to your ADLS Gen2 storage account.
    • Example: You have a metastore called sales_metastore.
  2. Catalog:

    • A catalog is a top-level grouping of data, such as SalesData, CustomerData, etc. You can think of it like an environment (Dev, Prod).
    • Example: Create a catalog called SalesCatalog.
  3. Schema:

    • Inside each catalog, you have schemas that organize the data. A schema is similar to a database in traditional relational databases.
    • Example: Inside SalesCatalog, you can have schemas like Transactions, Customers, Inventory.
  4. Tables:

    • Inside each schema, you store data in tables (Managed or External).
    • Example: Inside the Transactions schema, you might have a table transaction_data, and in the Customers schema, a table customer_info.
  5. Views:

    • A view is a read-only object created from one or more tables, which can be used to simplify queries.
    • Example: Create a view monthly_sales_summary from the transaction_data table to summarize monthly sales.

Access Control & Roles:

  1. Permissions:

    • Unity Catalog allows you to grant specific permissions to users or groups at different levels: catalog, schema, and table.
    • Example: You can give read access to the transaction_data table to the Sales Analysts group but restrict write access to the Sales Managers group.
  2. Roles:

    • Unity Catalog uses roles to manage access. You can assign roles like Admin, Metastore Admin, Workspace Admin, and User.
    • Example:
      • Workspace Admin can create and manage clusters but cannot alter the catalog or schema.
      • Metastore Admin can create catalogs, schemas, and manage access to them.
      • Workspace User can access shared clusters and notebooks but may not be able to create clusters unless given permission.
  3. Role-based Access Control (RBAC):

    • Syncing with Azure Active Directory (Azure AD) ensures that user groups and permissions are consistent across the Databricks workspace.
    • Example: The SalesAnalysts group in Azure AD is synced with Databricks. This group gets read-only access to the transaction_data table but cannot modify the data.

SCIM Provisioning:

  • SCIM (System for Cross-domain Identity Management) can automatically sync users and groups from Azure AD to Databricks Unity Catalog.
  • Example: You set up SCIM so that when new employees join the sales team in Azure AD, they automatically receive the appropriate permissions in Databricks.

Audit and Security:

  • Unity Catalog helps with security by tracking who accessed which data and what actions were performed. This audit logging is built-in.
  • Example: If someone accesses the transaction_data table, Unity Catalog logs the action (e.g., read access) and the user’s identity.

Delta Sharing:

  • With Unity Catalog, you can share data securely with external platforms using Delta Sharing.
  • Example: You want to share the monthly_sales_summary view with a partner using Delta Sharing. This data is shared securely and can be consumed on any platform that supports Delta Sharing.

Data Lineage:

  • Unity Catalog captures data lineage to track how data is transformed and derived.
  • Example: You create a new column in the transaction_data table to calculate the monthly revenue, and Unity Catalog tracks the lineage of that new column to show how it was derived.

Steps to Set Up Unity Catalog:

  1. Create Storage Account:
    Set up an Azure Data Lake Storage (ADLS Gen2) account to store metadata.

  2. Create and Link Metastore:

    • Create a Metastore and link it to the ADLS storage.
    • Once the metastore is created, you can link it to multiple workspaces for shared access.
  3. Set Permissions:

    • Assign appropriate permissions (e.g., read, write, manage) to users/groups at the catalog, schema, or table level.
  4. Enable SCIM Provisioning:
    Sync users and groups from Azure Active Directory to manage access automatically.

        

Step-by-Step Guide to Create a SCIM Connector:


Step 1: Generate SCIM Token in Databricks

  1. Go to Databricks Account Console > Settings.
  2. Click User Provisioning > Generate SCIM Token.
  3. Save the SCIM Token and SCIM URL.

Step 2: Configure SCIM in Azure Active Directory

  1. Navigate to Azure Active Directory > Enterprise Applications.
  2. Create a new application by searching for Azure Databricks SCIM Provisioning Connector.
  3. Go to Provisioning > Automatic Provisioning.
  4. Enter SCIM URL as Tenant URL and SCIM Token as Secret Token.
  5. Click Test Connection.
  6. Under Scope, select Sync only assigned users and groups.
  7. Turn Provisioning Status on.
  8. Manually assign users/groups to be synced.

Summary:

  • SCIM Token & URL are generated in Databricks and used to securely connect Azure AD with Databricks.
  • Automatic Provisioning in Azure AD enables user and group synchronization.
  • After configuring, users/groups from Azure AD are synced to Databricks, allowing seamless user management and access control.

Summary of Key Features with Example:

  • Centralized Metadata Management: Manage data across multiple workspaces (e.g., SalesCatalog).
  • Role-Based Access Control: Control access to datasets using roles like Metastore Admin or Workspace User.
  • Delta Sharing: Securely share data (e.g., monthly_sales_summary) with external platforms.
  • Audit and Lineage: Track data access and transformations (e.g., revenue calculation lineage).
Akash

I am working as a Data Engineer

Post a Comment

Previous Post Next Post