Skip to main content

Data Vault

Data Vault is a hybrid data modeling approach that combines the best of 3rd Normal Form (3NF) and Star Schema. It's designed to provide long-term historical storage of data coming from multiple operational systems and to be resilient to changes in the source systems. Data Vault is particularly well-suited for enterprise data warehouses (EDW) and data integration projects.

The methodology focuses on separating structural information (keys and relationships) from descriptive attributes, which makes the model highly adaptable and scalable.

Core Components

The Data Vault model is built around three primary types of entities:

  1. Hubs (Hub Tables):

    • Represent core business concepts or entities (e.g., Customer, Product, Order).
    • Contain a unique business key (or a hash of the business key) that identifies the entity across the enterprise.
    • Store minimal metadata, such as load date and record source.
    • Hubs are designed to be stable and rarely change.
  2. Links (Link Tables):

    • Represent relationships or transactions between Hubs.
    • Contain the hash keys of the Hubs they connect.
    • Can also store metadata like load date and record source.
    • Links capture the associations between business concepts. For example, a Link table might connect a Customer Hub and an Order Hub to represent a customer placing an order.
  3. Satellites (Satellite Tables):

    • Store descriptive attributes (contextual information) about Hubs or Links.
    • Are connected to a single parent Hub or Link table.
    • Track historical changes to attributes over time (Type 2 slowly changing dimensions are a common pattern).
    • Each Satellite typically groups attributes by their source system or rate of change.
    • This separation of attributes allows for flexibility; new attributes or source systems can be added by creating new Satellites without altering existing structures.

Key Principles & Benefits

  • Auditability: Data Vault stores raw, unaltered data from source systems, providing a clear audit trail. Load dates and record sources are tracked for every piece of data.
  • Scalability: The model is designed to scale out. Adding new data sources or attributes often involves adding new Satellites or Links without major refactoring of the existing model.
  • Flexibility & Adaptability: Changes in source systems (e.g., new attributes, modified relationships) can be incorporated with minimal impact on the existing Data Vault structure.
  • Parallel Loading: The decoupled nature of Hubs, Links, and Satellites allows for high degrees of parallel data loading, improving ETL/ELT performance.
  • Integration: Provides a robust framework for integrating data from disparate source systems into a unified view.
  • Historical Tracking: Satellites are designed to capture historical changes to data, enabling point-in-time reporting and analysis.

Data Vault 2.0

Data Vault 2.0 is an evolution of the original methodology, emphasizing aspects like:

  • Hash keys for all primary keys and relationships (improves join performance and integration).
  • Separation of concerns (modeling, methodology, architecture).
  • Emphasis on automation in the build and deployment process.

Interested in learning more? Check out Data Vault Documentation