DatawarehousingBasics


 * []

In the data warehousing field, we often hear about discussions on where a person / organization's philosophy falls into Bill Inmon's camp or into Ralph Kimball's camp. We describe below the difference between the two. **Bill Inmon's paradigm **: Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form. **Ralph Kimball's paradigm **: Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model. There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse. Inmon and Kimball have created a great debate in Information Technology during the last decade. They both relentlessly thrived for conceptualizing information management for decision support. They approached the problem with different philosophies, design techniques, and implementation strategies. This article is an analysis of these two approaches based on the issues raised and discovered. Mr. William (Bill) Inmon is known as the “**Father of Data Warehousing**”, entitled for coining the term “Data Warehouse” in 1991. He defined a model to support “single version of the truth” and championed the concept for more than a decade. He also created “Corporate Information Factory” in collaboration with Ms. Claudia Imhoff. Mr. Inmon is known to have published 40+ books and 600+ articles. Mr. Ralph Kimball is known as the “**Father of Business Intelligence**” for defining the concept behind “Data Marts”, for developing the science behind the analytical tools that utilize dimensional hierarchies, and for conceptualizing star-schemas and snowflake data structures. He defined a model to support analytical analysis and championed data marts for more than a decade. Though Kimball’s writings do not exceed Inmon’s by quantity, Kimball’s books are all-time best sellers on data warehousing. Inmon and Kimball are two pioneers that started different philosophies for enterprise-wide information gathering, information management, and analytics for decision support. Inmon believes in creating a single enterprise-wide data warehouse for achieving an overall business intelligence system (TOP DOWN). Kimball believes in creating several smaller data marts for achieving department-level analysis and reporting (BOTTOM UP). APPROACHES Inmon’s philosophy recommends to start with building a large centralized enterprise-wide data warehouse, followed by several satellite databases to serve the analytical needs of departments (later known as “data marts”). Hence, his approach has received the “Top Down” title. Kimball’s philosophy recommends to start with building several data marts that serve the analytical needs of departments, followed by “virtually” integrating these data marts for consistency through an Information Bus. Hence, his approach received the “Bottom Up” title. Mr. Kimball believes in various data marts that store information in dimensional models to quickly address the needs of various departments and various areas of the enterprise data. STRUCTURES Besides the differences in approaches, Inmon and Kimball also differ in the structure of the data. Inmon believes in creating a relational-model (third normal form: 3NF) where as Kimball believes in creating a multi-dimension model (star-schema and snowflakes). Inmon argues that once the data is in a relational model, it will attain the enterprise-wide consistency which makes it easier to spawn-off the data-marts in dimensional-models. Kimball argues that the actual users can understand, analyze, aggregate, and explore data-inconsistencies in an easier manner if the data is structured in a dimensional-model. Additionally, to enable the Information Bus, data marts are categorized [Imhoff, Mastering Data warehouse design] as atomic data marts, and aggregated data marts that both use dimensional-models. Irrespective of the structural differences in the model, both Inmon and Kimball agrees that there is a need to separate the detailed-level data from aggregated-level data. COMMON GOALS Though Mr. Inmon and Mr. Kimball have different philosophies to their approach, they do tend to agree with each other in an indirect manner. Though Inmon’s basis is on a single data warehouse, he stressed on iterative approach and discouraged the “big bang” approach. On the other hand, though Kimball’s philosophy is to quickly create few successful data marts at a time, he stresses on integration for consistency via an Information Bus. **DATA WAREHOUSE vs. BUSINESS INTELLIGENCE ** Business Intelligence = Inmon’s Corporate Data Warehouse + Kimball’s Data Marts + Data Mining + Unstructured Data. An analysis of Inmon vs. Kimball is not complete without referring to the first published work in 1988 on data warehousing by Barry Devlin and Paul Murphy of IBM Ireland. They coined the less-widely know term “Information Warehousing” which is defined as “A structured environment supporting end users in managing the complete business and supporting Information Systems in ensuring data quality”. Mr. Devlin has finally published his work as a book in 1997 called “Data Warehouse – from Architecture to Implementation”.
 * INTRODUCTION TO WILLIAM INMON AND RALPH KIMBALL **
 * PHILOSOPHIES: QUEST FOR A COMMON GOAL **
 * INMON, KIMBALL, … AND THE OTHERS **

What do these rules mean when contemplating the practical design of a database? It’s actually quite simple. The first rule dictates that we must not duplicate data within the same row of a table. Within the database community, this concept is referred to as the atomicity of a table. Tables that comply with this rule are said to be atomic. Let’s explore this principle with a classic example – a table within a human resources database that stores the manager-subordinate relationship. For the purposes of our example, we’ll impose the business rule that each manager may have one or more subordinates while each subordinate may have only one manager. Intuitively, when creating a list or spreadsheet to track this information, we might create a table with the following fields: However, recall the first rule imposed by 1NF: eliminate duplicative columns from the same table. Clearly, the Subordinate1-Subordinate4 columns are duplicative. Take a moment and ponder the problems raised by this scenario. If a manager only has one subordinate – the Subordinate2-Subordinate4 columns are simply wasted storage space (a precious database commodity). Furthermore, imagine the case where a manager already has 4 subordinates – what happens if she takes on another employee? The whole table structure would require modification. At this point, a second bright idea usually occurs to database novices: We don’t want to have more than one column and we want to allow for a flexible amount of data storage. Let’s try something like this: where the Subordinates field contains multiple entries in the form "Mary, Bill, Joe" This solution is closer, but it also falls short of the mark. The subordinates column is still duplicative and non-atomic. What happens when we need to add or remove a subordinate? We need to read and write the entire contents of the table. That’s not a big deal in this situation, but what if one manager had one hundred employees? Also, it complicates the process of selecting data from the database in future queries. Here’s a table that satisfies the first rule of 1NF: In this case, each subordinate has a single entry, but managers may have multiple entries. Now, what about the second rule: identify each row with a unique column or set of columns (the primary key)? You might take a look at the table above and suggest the use of the subordinate column as a primary key. In fact, the subordinate column is a good candidate for a primary key due to the fact that our business rules specified that each subordinate may have only one manager. However, the data that we’ve chosen to store in our table makes this a less than ideal solution. What happens if we hire another employee named Jim? How do we store his manager-subordinate relationship in the database? It’s best to use a truly unique identifier (such as an employee ID) as a primary key. Our final table would look like this: Now, our table is in first normal form! If you'd like to continue learning about normalization, read the other articles in this series:
 * __First Normal Form (1NF) sets the very basic rules for an organized database:__**
 * Eliminate duplicative columns from the same table.
 * Create separate tables for each group of related data and identify each row with a unique column (the primary key).
 * Manager
 * Subordinate1
 * Subordinate2
 * Subordinate3
 * Subordinate4
 * Manager
 * Subordinates
 * Manager
 * Subordinate
 * Manager ID
 * Subordinate ID

These rules can be summarized in a simple statement: 2NF attempts to reduce the amount of redundant data in a table by extracting it, placing it in new table(s) and creating relationships between those tables.
 * __Recall the general requirements of 2NF:__**
 * Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
 * Create relationships between these new tables and their predecessors through the use of foreign keys.

Let's look at an example. Imagine an online store that maintains customer information in a database. They might have a single table called Customers with the following elements: A brief look at this table reveals a small amount of redundant data. We're storing the "Sea Cliff, NY 11579" and "Miami, FL 33157" entries twice each. Now, that might not seem like too much added storage in our simple example, but imagine the wasted space if we had thousands of rows in our table. Additionally, if the ZIP code for Sea Cliff were to change, we'd need to make that change in many places throughout the database.
 * CustNum
 * FirstName
 * LastName
 * Address
 * City
 * State
 * ZIP

In a 2NF-compliant database structure, this redundant information is extracted and stored in a separate table. Our new table (let's call it ZIPs) might have the following fields: If we want to be super-efficient, we can even fill this table in advance -- the post office provides a directory of all valid ZIP codes and their city/state relationships. Surely, you've encountered a situation where this type of database was utilized. Someone taking an order might have asked you for your ZIP code first and then knew the city and state you were calling from. This type of arrangement reduces operator error and increases efficiency.
 * ZIP
 * City
 * State

Now that we've removed the duplicative data from the Customers table, we've satisfied the first rule of second normal form. We still need to use a foreign key to tie the two tables together. We'll use the ZIP code (the primary key from the ZIPs table) to create that relationship. Here's our new Customers table: We've now minimized the amount of redundant information stored within the database and our structure is in second normal form!
 * CustNum
 * FirstName
 * LastName
 * Address
 * ZIP

If you'd like to ensure your database is normalized, explore our other articles in this series:

There are two basic requirements for a database to be in third normal form: Imagine that we have a table of widget orders that contains the following attributes: Remember, our first requirement is that the table must satisfy the requirements of 1NF and 2NF. Are there any duplicative columns? No. Do we have a primary key? Yes, the order number. Therefore, we satisfy the requirements of 1NF. Are there any subsets of data that apply to multiple rows? No, so we also satisfy the requirements of 2NF. Now, are all of the columns fully dependent upon the primary key? The customer number varies with the order number and it doesn't appear to depend upon any of the other fields. What about the unit price? This field could be dependent upon the customer number in a situation where we charged each customer a set price. However, looking at the data above, it appears we sometimes charge the same customer different prices. Therefore, the unit price is fully dependent upon the order number. The quantity of items also varies from order to order, so we're OK there. What about the total? It looks like we might be in trouble here. The total can be derived by multiplying the unit price by the quantity, therefore it's not fully dependent upon the primary key. We must remove it from the table to comply with the third normal form. Perhaps we use the following attributes: Now our table is in 3NF. But, you might ask, what about the total? This is a derived field and it's best not to store it in the database at all. We can simply compute it "on the fly" when performing database queries. For example, we might have previously used this query to retrieve order numbers and totals:
 * __Third Normal Form (3NF) sets the very basic rules for an organized database:__**
 * Already meet the requirements of both [|1NF] and [|2NF]
 * Remove [|columns] that are not fully dependent upon the [|primary key].
 * Order Number
 * Customer Number
 * Unit Price
 * Quantity
 * Total
 * Order Number
 * Customer Number
 * Unit Price
 * Quantity

SELECT OrderNumber, Total FROM WidgetOrders

We can now use the following query:

SELECT OrderNumber, UnitPrice * Quantity AS Total FROM WidgetOrders

to achieve the same results without violating normalization rules.