Skip to main content

Hub

Hub

Hubs represent core business entities, like "Customer" or "Product." They store only unique business keys (e.g., Customer ID) and connect all related data over time. Think of hubs as the anchors for main business concepts.

To generate a hub you requre the medata of the source table and a hub YAML configuration file.

Information Schema

Customer Information Schema
TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME,ORDINAL_POSITION,DATA_TYPE
TPCH_SF1,CUSTOMER,C_CUSTKEY,1,NUMBER
TPCH_SF1,CUSTOMER,C_NAME,2,TEXT
TPCH_SF1,CUSTOMER,C_ADDRESS,3,TEXT
TPCH_SF1,CUSTOMER,C_NATIONKEY,4,NUMBER
TPCH_SF1,CUSTOMER,C_PHONE,5,TEXT
TPCH_SF1,CUSTOMER,C_ACCTBAL,6,NUMBER
TPCH_SF1,CUSTOMER,C_MKTSEGMENT,7,TEXT
TPCH_SF1,CUSTOMER,C_COMMENT,8,TEXT
TPCH_SF1,CUSTOMER,C_CDC_FLAG,8,TEXT
TPCH_SF1,CUSTOMER,C_CDC_TIMESTAMP,8,TIMESTAMP

Hub Customer

The following YAML file is an example of a configuraton that can be used to produce the customer hub definition.

hub_customer.yaml
entity_type: hub
name: HUB_CUSTOMER

concatenate_business_keys: false
requires_source_business_key: false

target_business_key_columns:
- CUSTOMER_ID

entity_sources:
- urn:s2v:hub_source:src_1:
entity_source: (TPCH_SF1, CUSTOMER)
source_filter: ''
source_system_configuration_urn: urn:s2v:source_setting:yaml_interface
business_key_mapping:
- CUSTOMER_ID:
- C_CUSTKEY
source_business_key: ''

Hub DDL Definition

Following are the scrpts generated by the s2v generate command.


-- The generate process creates two objecs

------------------------------------------------------------------------------------
------ STREAMS ON DT HUB_CUSTOMER --------------------------------------------------
------------------------------------------------------------------------------------
CREATE OR REPLACE STREAM "&{TARGET_DATABASE}"."&{TARGET_SCHEMA}"."HUB_CUSTOMER_STREAM"
ON DYNAMIC TABLE "&{TARGET_DATABASE}"."&{TARGET_SCHEMA}"."HUB_CUSTOMER"
SHOW_INITIAL_ROWS = TRUE;

------------------------------------------------------------------------------------
------ HUB_CUSTOMER ----------------------------------------------------------------
------------------------------------------------------------------------------------
CREATE OR REPLACE DYNAMIC TABLE "&{TARGET_DATABASE}"."&{TARGET_SCHEMA}"."HUB_CUSTOMER"
(
"HKY_HUB_CUSTOMER"
,"CUSTOMER_ID"
,"LOAD_DATE"
,"REC_SRC"
)
TARGET_LAG = '&{TARGET_LAG}'
WAREHOUSE = &{TARGET_WAREHOUSE}
REFRESH_MODE = INCREMENTAL
INITIALIZE = ON_SCHEDULE
WITH TAG ("&{TARGET_DATABASE}"."&{TARGET_SCHEMA}".S2V_RELEASE_ID='&{S2V_RELEASE_ID}')
AS
SELECT
"HKY_HUB_CUSTOMER",
"CUSTOMER_ID",
"LOAD_DATE",
"REC_SRC"
FROM
(
SELECT
UPPER(SHA1(CONCAT(IFNULL(TRIM(REPLACE(CAST("C_CUSTKEY" as VARCHAR), '##', CONCAT('\\', '##'))), '-1'),'##'))) AS "HKY_HUB_CUSTOMER"
,IFNULL(TRIM(REPLACE(CAST("C_CUSTKEY" as VARCHAR), '##', CONCAT('\\', '##'))), '-1') AS "CUSTOMER_ID"
,MIN(TO_TIMESTAMP_LTZ("C_CDC_TIMESTAMP")) AS "LOAD_DATE"
,'TPCH_SF1.CUSTOMER' AS "REC_SRC"
FROM "&{SOURCE_DATABASE}"."TPCH_SF1"."CUSTOMER"
GROUP BY ALL
UNION ALL
SELECT
OBJECT_H_KEY AS "HKY_HUB_CUSTOMER",
'-1' AS "CUSTOMER_ID",
"LOAD_DATE",
RECORD_SOURCE AS "REC_SRC"
FROM "&{TARGET_DATABASE}"."&{TARGET_SCHEMA}_PREP"."GHOST_RECORD"
)
;

Explanation

This SQL script is designed to implement a Hub structure for customer data in a Data Vault architecture on Snowflake. The script achieves this by creating a dynamic table (HUB_CUSTOMER) and a corresponding stream (HUB_CUSTOMER_STREAM). Together, these components ensure that customer data is captured, centralized, and continuously updated in real time. The Hub structure is critical in a Data Vault as it acts as the anchor point for uniquely identifying business entities—in this case, customers—across multiple systems.

The HUB_CUSTOMER dynamic table consolidates customer data by generating a hashed key (HKY_HUB_CUSTOMER) for each unique customer. This hashed key is derived using a cryptographic SHA1 function applied to the source system's customer identifier (C_CUSTKEY). By hashing and normalizing these keys, the script ensures data consistency and prevents duplication, even when data originates from disparate sources. Additional metadata fields, such as LOAD_DATE (indicating when a record was added) and REC_SRC (documenting the data’s origin), are included to maintain traceability and support auditability, which are key principles of the Data Vault methodology.

The data for the HUB_CUSTOMER table is sourced from two locations. The primary source is the CUSTOMER table in the TPCH_SF1 schema. Here, the script processes the raw customer data, generating hashed keys and capturing the earliest timestamp (C_CDC_TIMESTAMP) for each record as the LOAD_DATE. The second source, known as the "ghost record," is a placeholder entry added to ensure the Hub is initialized even if no actual data is present. This ensures the structure remains robust and ready to handle downstream integrations, even during initial setup.

In parallel, the script creates a stream (HUB_CUSTOMER_STREAM) to track changes in the HUB_CUSTOMER dynamic table. This stream enables real-time updates by capturing data modifications such as inserts, updates, or deletions. With this functionality, downstream processes can consume the latest data as it becomes available, eliminating the delays associated with traditional batch processing methods.

The HUB_CUSTOMER table is configured with dynamic table properties to optimize its functionality. For example, the refresh mode is set to INCREMENTAL, meaning that only new or modified data is processed during each update, improving efficiency. Additionally, the table is initialized on a schedule, ensuring it is ready for use as soon as the process is deployed. The architecture also includes configurable parameters, such as TARGET_LAG and TARGET_WAREHOUSE, allowing customization based on workload and performance requirements.

In summary, this script creates a foundation for managing customer data in a scalable, auditable, and real-time manner, adhering to Data Vault principles. It centralizes customer records, tracks their origins, and ensures continuous updates, making it easier to integrate data from multiple systems into a unified, secure hub. This approach reduces complexity, supports compliance, and provides a solid basis for analytical and operational use cases.

Configuration

ParameterDescriptionExample
entity_typeData Vault entity typehub
namename of the entityHUB_CUSTOMER
concatenate_business_keysBoolean value indicating if the business keys should be concatenated in the target tablefalse
requires_source_business_keyBoolean value indicating if source business key should be included in the target tablefalse
target_business_key_columnsList of business key columns in the target table. Can have more than one.CUSTOMER_ID
entity_sourcesList of sources to be used to load into the hub. Each source must have a unique nameurn:s2v:hub_source:src_1
entity_sourceCombination of the TABLE_SCHEMA and TABLE_NAME from the information_schema file.(TPCH_SF1, CUSTOMER)
source_filterInclude any specific filtering for extraction of business keys from the source
source_system_configuration_urnReference to the source system yaml for the specific sourceurn:s2v:source_setting:yaml_interface
business_key_mappingBusiness key mapping from source to target tables
source_business_keySource table business key