Hub
Hub
Hubs represent core business entities, like "Customer" or "Product." They store only unique business keys (e.g., Customer ID) and connect all related data over time. Think of hubs as the anchors for main business concepts.
To generate a hub you requre the medata of the source table and a hub YAML configuration file.
Information Schema
TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME,ORDINAL_POSITION,DATA_TYPE
TPCH_SF1,CUSTOMER,C_CUSTKEY,1,NUMBER
TPCH_SF1,CUSTOMER,C_NAME,2,TEXT
TPCH_SF1,CUSTOMER,C_ADDRESS,3,TEXT
TPCH_SF1,CUSTOMER,C_NATIONKEY,4,NUMBER
TPCH_SF1,CUSTOMER,C_PHONE,5,TEXT
TPCH_SF1,CUSTOMER,C_ACCTBAL,6,NUMBER
TPCH_SF1,CUSTOMER,C_MKTSEGMENT,7,TEXT
TPCH_SF1,CUSTOMER,C_COMMENT,8,TEXT
TPCH_SF1,CUSTOMER,C_CDC_FLAG,8,TEXT
TPCH_SF1,CUSTOMER,C_CDC_TIMESTAMP,8,TIMESTAMP
Hub Customer
The following YAML file is an example of a configuraton that can be used to produce the customer hub definition.
entity_type: hub
name: HUB_CUSTOMER
concatenate_business_keys: false
requires_source_business_key: false
target_business_key_columns:
- CUSTOMER_ID
entity_sources:
- urn:s2v:hub_source:src_1:
entity_source: (TPCH_SF1, CUSTOMER)
source_filter: ''
source_system_configuration_urn: urn:s2v:source_setting:yaml_interface
business_key_mapping:
- CUSTOMER_ID:
- C_CUSTKEY
source_business_key: ''
Hub DDL Definition
Following are the scrpts generated by the s2v generate
command.
-- The generate process creates two objecs
------------------------------------------------------------------------------------
------ STREAMS ON DT HUB_CUSTOMER --------------------------------------------------
------------------------------------------------------------------------------------
CREATE OR REPLACE STREAM "&{TARGET_DATABASE}"."&{TARGET_SCHEMA}"."HUB_CUSTOMER_STREAM"
ON DYNAMIC TABLE "&{TARGET_DATABASE}"."&{TARGET_SCHEMA}"."HUB_CUSTOMER"
SHOW_INITIAL_ROWS = TRUE;
------------------------------------------------------------------------------------
------ HUB_CUSTOMER ----------------------------------------------------------------
------------------------------------------------------------------------------------
CREATE OR REPLACE DYNAMIC TABLE "&{TARGET_DATABASE}"."&{TARGET_SCHEMA}"."HUB_CUSTOMER"
(
"HKY_HUB_CUSTOMER"
,"CUSTOMER_ID"
,"LOAD_DATE"
,"REC_SRC"
)
TARGET_LAG = '&{TARGET_LAG}'
WAREHOUSE = &{TARGET_WAREHOUSE}
REFRESH_MODE = INCREMENTAL
INITIALIZE = ON_SCHEDULE
WITH TAG ("&{TARGET_DATABASE}"."&{TARGET_SCHEMA}".S2V_RELEASE_ID='&{S2V_RELEASE_ID}')
AS
SELECT
"HKY_HUB_CUSTOMER",
"CUSTOMER_ID",
"LOAD_DATE",
"REC_SRC"
FROM
(
SELECT
UPPER(SHA1(CONCAT(IFNULL(TRIM(REPLACE(CAST("C_CUSTKEY" as VARCHAR), '##', CONCAT('\\', '##'))), '-1'),'##'))) AS "HKY_HUB_CUSTOMER"
,IFNULL(TRIM(REPLACE(CAST("C_CUSTKEY" as VARCHAR), '##', CONCAT('\\', '##'))), '-1') AS "CUSTOMER_ID"
,MIN(TO_TIMESTAMP_LTZ("C_CDC_TIMESTAMP")) AS "LOAD_DATE"
,'TPCH_SF1.CUSTOMER' AS "REC_SRC"
FROM "&{SOURCE_DATABASE}"."TPCH_SF1"."CUSTOMER"
GROUP BY ALL
UNION ALL
SELECT
OBJECT_H_KEY AS "HKY_HUB_CUSTOMER",
'-1' AS "CUSTOMER_ID",
"LOAD_DATE",
RECORD_SOURCE AS "REC_SRC"
FROM "&{TARGET_DATABASE}"."&{TARGET_SCHEMA}_PREP"."GHOST_RECORD"
)
;
Explanation
This SQL script is designed to implement a Hub structure for customer data in a Data Vault architecture on Snowflake. The script achieves this by creating a dynamic table (HUB_CUSTOMER) and a corresponding stream (HUB_CUSTOMER_STREAM). Together, these components ensure that customer data is captured, centralized, and continuously updated in real time. The Hub structure is critical in a Data Vault as it acts as the anchor point for uniquely identifying business entities—in this case, customers—across multiple systems.
The HUB_CUSTOMER dynamic table consolidates customer data by generating a hashed key (HKY_HUB_CUSTOMER) for each unique customer. This hashed key is derived using a cryptographic SHA1 function applied to the source system's customer identifier (C_CUSTKEY). By hashing and normalizing these keys, the script ensures data consistency and prevents duplication, even when data originates from disparate sources. Additional metadata fields, such as LOAD_DATE (indicating when a record was added) and REC_SRC (documenting the data’s origin), are included to maintain traceability and support auditability, which are key principles of the Data Vault methodology.
The data for the HUB_CUSTOMER table is sourced from two locations. The primary source is the CUSTOMER table in the TPCH_SF1 schema. Here, the script processes the raw customer data, generating hashed keys and capturing the earliest timestamp (C_CDC_TIMESTAMP) for each record as the LOAD_DATE. The second source, known as the "ghost record," is a placeholder entry added to ensure the Hub is initialized even if no actual data is present. This ensures the structure remains robust and ready to handle downstream integrations, even during initial setup.
In parallel, the script creates a stream (HUB_CUSTOMER_STREAM) to track changes in the HUB_CUSTOMER dynamic table. This stream enables real-time updates by capturing data modifications such as inserts, updates, or deletions. With this functionality, downstream processes can consume the latest data as it becomes available, eliminating the delays associated with traditional batch processing methods.
The HUB_CUSTOMER table is configured with dynamic table properties to optimize its functionality. For example, the refresh mode is set to INCREMENTAL, meaning that only new or modified data is processed during each update, improving efficiency. Additionally, the table is initialized on a schedule, ensuring it is ready for use as soon as the process is deployed. The architecture also includes configurable parameters, such as TARGET_LAG and TARGET_WAREHOUSE, allowing customization based on workload and performance requirements.
In summary, this script creates a foundation for managing customer data in a scalable, auditable, and real-time manner, adhering to Data Vault principles. It centralizes customer records, tracks their origins, and ensures continuous updates, making it easier to integrate data from multiple systems into a unified, secure hub. This approach reduces complexity, supports compliance, and provides a solid basis for analytical and operational use cases.
Configuration
Parameter | Description | Example |
---|---|---|
entity_type | Data Vault entity type | hub |
name | name of the entity | HUB_CUSTOMER |
concatenate_business_keys | Boolean value indicating if the business keys should be concatenated in the target table | false |
requires_source_business_key | Boolean value indicating if source business key should be included in the target table | false |
target_business_key_columns | List of business key columns in the target table. Can have more than one. | CUSTOMER_ID |
entity_sources | List of sources to be used to load into the hub. Each source must have a unique name | urn:s2v:hub_source:src_1 |
entity_source | Combination of the TABLE_SCHEMA and TABLE_NAME from the information_schema file. | (TPCH_SF1, CUSTOMER) |
source_filter | Include any specific filtering for extraction of business keys from the source | |
source_system_configuration_urn | Reference to the source system yaml for the specific source | urn:s2v:source_setting:yaml_interface |
business_key_mapping | Business key mapping from source to target tables | |
source_business_key | Source table business key |