Data Warehouse

 

The data warehouse is an integrated, subject-oriented, time-variant, nonvolatile collection of data that provides support for decision making. A data warehouse is a system used for reporting and data analysis, which is considered a core component of business intelligence. It is a large store of data collected from a wide range of sources within a company and used to guide management decisions.

Data warehouses are typically used to handle analytics on the data, which can come from a variety of sources, such as transactional databases, logs, external data feeds, etc. The data in a warehouse is usually structured in a way that's more suitable for analysis and reporting.

Key features and concepts in data warehouses include:

  1. Subject-Oriented: A data warehouse is organized around key subjects (like customers, products, sales) of an organization.

  2. Integrated: A data warehouse integrates data from multiple data sources, making the data consistent.

  3. Non-Volatile: Once data is in the data warehouse, it will not change. This is contrary to operational systems where data can be updated or deleted for business operations.

  4. Time-Variant: Data warehouse data represents data over time (historical data) rather than current data, which allows for trend analysis.

  5. Schema Design: Data warehouses often use different types of schema designs like Star Schema, Snowflake Schema, etc., for data modeling.

Data warehouses differ from operational databases in several ways:

  • They are designed for read-intensive workloads (analytics, reports) rather than transactional workloads.

  • They often store historical data, enabling analysts to track changes over time, while operational databases usually only hold current data.

  • They are optimized for complex queries that may involve aggregating large volumes of data, while operational databases are optimized for simple, atomic read and write operations.

Common data warehouse technologies include Google BigQuery, Amazon Redshift, Microsoft SQL Server Analysis Services, and many more.

One of the key processes involved in data warehousing is ETL (Extract, Transform, Load). This involves extracting data from different sources, transforming it to fit the business needs (which might involve cleaning the data, aggregating it, reformatting it, etc.), and then loading it into the data warehouse for analysis.

In the modern data ecosystem, there's also a concept called a data lake, which is a storage repository that holds a large amount of raw data in its native format until it's needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Data lakes and data warehouses often complement each other, with the data lake being used for data exploration and raw data storage, and the data warehouse used for storing cleaned, aggregated data that's ready for analysis.

Rule No.

Description

1

The data warehouse and operational environments are separated.

2

The data warehouse data is integrated.

3

The data warehouse contains historical data over a long time.

4

The data warehouse data is snapshot data captured at a given point in time.

5

The data warehouse data is subject oriented.

6

The data warehouse data is mainly read-only with periodic batch updates from operational data.

7

The data warehouse development life cycle differs from classical systems development.

8

The data warehouse contains data with several levels of detail: current detail data, old detail data, lightly summarized data, and highly summarized data

9

The data warehouse environment is characterized by read-only transactions to very large data sets.

10

The data warehouse environment has a system that traces data sources, transformations, and storage.

11

The data warehouse’s metadata is a critical component of this environment.

12

The data warehouse contains a chargeback mechanism for resource usage that enforces optimal use of the data by end users.

Data Marts

A data mart is a small, single-subject data warehouse subset that provides decision support to a small group of people. Benefits of data marts over data warehouses include the following:

  • Lower cost and shorter implementation time

  • Technologically advanced

  • Inevitable “people issues”