Discussion
1. What Is a Data Warehouse
2. The Corporate Information Factory
3. The Multipurpose Nature of the Data Warehouse
4. The Data Warehouse Data Model
Introduction
Welcome to the first book that thoroughly describes the data modeling techniques used in constructing a multipurpose, stable, and sustainable data ware house used to support business intelligence (BI). This chapter introduces the data warehouse by describing the objectives of BI and the data warehouse and by explaining how these fit into the overall Corporate Information Factory (CIF) architecture. It discusses the iterative nature of the data warehouse construction and demonstrates the importance of the data warehouse data model and the justification for the type of data model format suggested in this book. We discuss why the format of the model should be based on relational design techniques, illustrating the need to maximize non redundancy, stability, and maintainability.
Overview of Business Intelligence
BI, in the context of the data warehouse, is the ability of an enterprise to study past behaviors and actions in order to understand where the organization has been, determine its current situation, and predict or change what will happen in the future. BI has been maturing for more than 20 years. Let’s briefly go over the past decade of this fascinating and innovative history.
Data warehousing is another good example of the adoption curve. In fact, if you haven’t started your first data warehouse project, there has never been a better time. Executives today expect, and often get, most of the good, timely information they need to make informed decisions to lead their companies into the next decade. But this wasn’t always the case.
Just a decade ago, these same executives sanctioned the development of executive information systems (EIS) to meet their needs. The concept behind EIS initiatives was sound—to provide executives with easily accessible key performance information in a timely manner. However, many of these systems fell short of their objectives, largely because the underlying architecture could not respond fast enough to the enterprise’s changing environment. Another significant shortcoming of the early EIS days was the enormous effort required to provide the executives with the data they desired. Data acquisition or the extract, transform, and load (ETL) process is a complex set of activities whose sole purpose is to attain the most accurate and integrated data possible and make it accessible to the enterprise through the data warehouse or operational data store (ODS).
BI Architecture
One of the most significant developments during the last 10 years has been the introduction of a widely accepted architecture to support all BI technological demands. This architecture recognized that the EIS approach had several major flaws, the most significant of which was that the EIS data structures were often fed directly from source systems, resulting in a very complex data acquisition environment that required significant human and computer resources to maintain. The Corporate Information Factory (CIF) (see Figure 1.2), the architecture used in most decision support environments today, addressed that deficiency by segregating data into five major databases (operational systems, data warehouse, operational data store, data marts, and oper marts) and incorporating processes to effectively and efficiently move data from the source systems to the business users.
These components were further separated into two major groupings of components and processes:
=> Getting data in consists of the processes and databases involved in acquiring data from the operational systems, integrating it, cleaning it up, and putting it into a database for easy usage. The components of the CIF that are found in this function:
=> The operational system databases (source systems) contain the data used to run the day-to-day business of the company. These are still the major source of data for the decision support environment.
=> The data warehouse is a collection or repository of integrated, detailed, historical data to support strategic decision-making.
=> The operational data store is a collection of integrated, detailed, cur- rent data to support tactical decision making.
=> Data acquisition is a set of processes and programs that extracts data for the data warehouse and operational data store from the operational systems.
=> Getting information out consists of the processes and databases involved in delivering BI to the ultimate business consumer or analyst. The components of the CIF that are found in this function:
=> The data marts are derivatives from the data warehouse used to pro- vide the business community with access to various types of strategic analysis.
=> The oper marts are derivatives of the ODS used to provide the business community with dimensional access to current operational data
=> Data delivery is the process that moves data from the data warehouse into data and oper marts. Like the data acquisition layer, it manipulates the data as it moves it. In the case of data delivery, however, the origin is the data warehouse or ODS, which already contains high- quality, integrated data that conforms to the enterprise business rules.
What Is a Data Warehouse?
Before we get started with the actual description of the modeling techniques, we need to make sure that all of us are on the same page in terms of what we mean by a data warehouse, its role and purpose in BI, and the architectural components that support its construction and usage.
Role and Purpose of the Data Warehouse
As we see in the first section of this chapter, the overall BI architecture has evolved considerably over the past decade. From simple reporting and EIS systems to multidimensional analyses to statistical and data mining requirements to exploration capabilities, and now the introduction of customizable analytical applications, these technologies are part of a robust and mature BI environment.
Different methodologies and data warehouse gurus have given this component various names such as:
=> A staging area.
A variation on the data warehouse is the “back office” staging area where data from the operational systems is first brought together. It is an informally designed and maintained grouping of data whose only purpose is to feed multidimensional data marts.
=> The information warehouse.
This was an early name for the data ware- house used by IBM and other vendors. It was not as clearly defined as the staging area and, in many cases, encompassed not only the repository of historical data but also the various data marts in its definition.
The Corporate Information Factory
The Corporate Information Factory (CIF) is a widely accepted conceptual architecture that describes and categorizes the information stores used to operate and manage a successful and robust BI infrastructure. These information stores support three high-level organizational processes:
=> Business operations are concerned with the ongoing day-to-day operations of the business. It is within this function that we find the operational transaction-processing systems and external data.
=> Business intelligence is concerned with the ongoing search for a better understanding of the company, of its products, and of its customers. Whereas business operations processes are static, business intelligence includes processes that are constantly evolving, in addition to static processes.
=> Business management is the function in which the knowledge and new insights developed in business intelligence are institutionalized and introduced into the daily business operations throughout the enterprise.
Let’s examine each component of the CIF in a bit more detail.
Operational Systems
Operational systems are the ones supporting the day-to-day activities of the enterprise. They are focused on processing transactions, ranging from order entry to billing to human resources transactions.
These operational systems are the source of most of the electronically maintained data within the CIF. Because these systems support time-sensitive real- time transaction processing, they have usually been optimized for performance and transaction throughput. Data in the operational systems environment may be duplicated across several systems, and is often not synchronized. These operational systems represent the first application of business rules to an organization’s data, and the quality of data in the operational systems has a direct impact on the quality of all other information used in the organization.
Data Acquisition
Many companies are tempted to skip the crucial step of truly integrating their data, choosing instead to deploy a series of uncoordinated, unintegrated data marts. But without the single set of business rule transformations that the data acquisition layer contains, these companies end up building isolated, user- or department-specific data marts.
Data Warehouse
The universally accepted definition of a data warehouse developed by Bill Inmon in the 1980s is “a subject-oriented, integrated, time variant and non- volatile collection of data used in strategic decision making”. The data ware- house acts as the central point of data integration—the first step toward turning data into information. Due to this enterprise focus, it serves the following purposes.
Operational Data Store
The operational data store (ODS) is used for tactical decision making, whereas the data warehouse supports strategic decisions. It has some characteristics that are similar to those of the data warehouse but is dramatically different in other aspects:
=> It is subject oriented like a data warehouse.
=> Its data is fully integrated like a data warehouse.
=> Its data is current—or as current as technology will allow
=> Its data is volatile or updatable.
=> Its data is almost entirely detailed with a small amount of dynamic aggregation or summarization.
Data Delivery
Data delivery is generally limited to operations such as aggregation of data, filtering by specific dimensions or business requirements, reformatting data to ease end-user access or to support specific BI access software tools, and finally delivery or transmittal of data across the organization. The data delivery infra- structure remains fairly static in a mature CIF environment; however, the data requirements of the data marts evolve rapidly to keep pace with changing business information needs. This means that the data delivery layer must be flexible enough to keep pace with these demands.
Data Marts
Data marts are a subset of data warehouse data and are where most of the analytical activities in the BI environment take place. The data in each data mart is usually tailored for a particular capability or function, such as product profitability analysis, KPI analyses, customer demographic analyses, and so on. Each specific data mart is not necessarily valid for other uses. All varieties of data marts have universal and unique characteristics. The universal ones are that they contain a subset of data warehouse data, they may be physically co- located with the data warehouse or on their own separate platform, and they range in size from a few megabytes to multiple gigabytes to terabytes! To maximize your data warehousing ROI, you need to embrace and implement data warehouse architectures that enable this full spectrum of analysis.
Information Workshop
The information workshop is the set of tools available to business users to help them use the resources of the Corporate Information Factory. The information workshop typically provides a way to organize and categorize the data and other resources in the CIF, so that users can find and use those resources.
The library and toolbox usually represent the organization’s first attempts to create an information workshop. The library component provides a directory of the resources and data available in the CIF, organized in a way that makes sense to business users.
Operations and Administration
Operation and administration include the crucial support and infrastructure functions that are necessary for a growing, sustainable Corporate Information Factory. In early CIF implementations, many companies did not recognize how important these functions were, and they were often left out during CIF planning and development. The operation and administration functions include CIF Data Management, Systems Management, Data Acquisition Management, Service Management, and Change Management. Each of these functions contains a set of procedures and policies for maintaining and enhancing these critically important processes.
The Multipurpose Nature of the Data Warehouse
Hopefully by now, you have a good understanding of the role the data ware- house plays in your BI environment. It not only serves as the integration point for your operational data, it must also serve as the distribution point of this data into the hands of the various business users. If the data warehouse is to act as a stable and permanent repository of historical data for use in your strategic BI applications, it should have the following characteristics:
=> It should be enterprise focused
=> Its design should be as resilient to change as possible
=> It should be designed to load massive amounts of data in very short amounts of time
=> It should be designed for optimal data extraction processing by the data delivery programs
=> Its data should be in a format that supports any and all possible BI analyses in any and all technologies
Types of Data Marts Supported
Today, we have a plethora of technologies supporting different analytical needs—Online Analytical Processing (OLAP), exploration, data mining and statistical data marts, and now customizable analytical applications. The unique characteristics come from the specificity of the technology supporting each type of data mart :
=> OLAP data mart.
These data marts are designed to support generalized multidimensional analysis, using OLAP software tools.
=> Exploration warehouse.
While most common data marts are designed to support specific types of analysis and reporting, the exploration ware- house is built to provide exploratory or true “ad hoc” navigation through data.
=> Data-mining or statistical warehouse.
=> Customizable analytical applications.
Types of Data Marts Supported
=> Data-mining or statistical warehouse.
The data-mining or statistical ware- house is a specialized data mart designed to give researchers and analysts the ability to delve into the known and unknown relationships of data and events without having preconceived notions of those relationships.
=> Customizable analytical applications.
These new additions permit inexpensive and effective customization of generic applications. These “canned” applications meet a high percentage of every company’s generic needs yet can be customized for the remaining specific functionality.
Types of BI Technologies Supported
The reality is that database structures for data marts vary across a spectrum from normalized to de normalized to flat files of transactions. The ideal situation is to craft the data mart schemas after the requirements are established. Unfortunately, the database structure/solution is often selected before the specific business needs are known. Those of us in the data warehouse consulting business have witnessed development teams debating star versus normalized designs before even starting business analysis.
Our recommendation for data mart designs is that the schemas should be based on the usage of the data and the type of information requested. There are no absolutes, of course, but we feel that the best design to support all the types of data marts will be one that does not pre establish or predetermine the data relationships. An important caveat here is that the data warehouse that feeds the marts will be required to support any and all forms of analysis—not just multidimensional forms.
To determine the best database design for your business requirements and ensuing data mart, we recommend that you develop a simple matrix that plots the volatility of the data against a type of database design required. Such a matrix allows designers, architects, and database administrators (DBAs) to view where the overall requirements lie in terms of the physical database drivers, that is, volatility, latency, multiple subject areas, and so on, and the analytical vehicle that will supply the information (via the scenarios that were developed), for example, repetitive delivery, ad hoc reports, production reports, algorithmic analysis, and so on.
Characteristics of a Maintainable Data Warehouse Environment
With this as a background, what does a solid, maintainable data warehouse data model look like? What are the characteristics that should be considered when designing any data warehouse, whether for a company just beginning its BI initiative or for a company having a sophisticated set of technologies and users, whether the company has only one BI access tool today or has a plethora of BI technologies available?
The methodology for building a BI environment is iterative in nature. We are fortunate today to have many excellent books devoted to describing this methodology. (See the “Recommended Reading” section at the end of this book.) In a nutshell, here are the steps:
In a nutshell, here are the steps:
1. First, select and document the business problem to be solved with a business intelligence capability (data mart of some sort).
2. Gather as many of the requirements as you can. These will be further refined in the next step.
3. Determine the appropriate end-user technology to support the solution (OLAP, mining, exploration, analytical application, and so on).
4. Build a prototype of the data mart to test its functionality with the business users, redesigning it as necessary.
5. Develop the data warehouse data model, based on the user requirements and the business data model.
6. Map the data mart requirements to the data warehouse data model and ultimately back to the operational systems, themselves.
7. Generate the code to perform the ETL and data delivery processes. Be sure to include error detection and correction and audit trail procedures in these processes.
8. Test the data warehouse and data mart creation processes. Measure the data quality parameters and create the appropriate meta data for the environment.
9. Upon acceptance, move the first iteration of the data warehouse and the data mart into production, train the rest of the business community, and start planning for the next iteration.
The Data Warehouse Data Model
Given that we recommend a relational DBMS for your data warehouse, what should the characteristics of the data model for that structure look like? Again, let’s look at some assumptions before going into the characteristics of the model :
=> The data warehouse is assumed to have an enterprise focus at its heart.
=> As a corollary to the above assumption, it is assumed that the data within data warehouse does not violate any business rules established by the enterprise
=> The data warehouse must be loaded with new data as quickly and efficiently as possible
=> The data warehouse must be set up from the beginning to support multiple BI technologies—even if they are not known at the time of the first data mart project
=> The data warehouse must gracefully accommodate change in its data and data structures
With these assumptions in mind, let’s look at the characteristics of the ideal data warehouse data model.
Non redundant
To accommodate the limited load cycles and the massive amount of data that most data warehouses must have, the data model for the data warehouse should contain a minimum amount of redundancy. Redundancy adds a tremendous burden to the load utilities and to the designers who must worry about ensuring that all redundant data elements and entities get the correct data at the correct time.
Stable
As mentioned earlier, we build the data warehouse in an iterative fashion, which has the benefit of getting a data mart created quickly but runs the risk of missing or misstating significant business rules or data elements. These would be determined or highlighted as more and more data marts came online. It is inevitable that change will happen to the data warehouse and its data model.
Consistent
Perhaps the most essential characteristic of any data warehouse data model is the consistency it brings to the business for its most important asset—its data. The data models contain all the meta data (definitions, physical characteristics, aliases, business rules, data owners and stewards, domains, roles, and so on) that is critically important to the ultimate understanding of the business users of what they are analyzing.
Flexible in Terms of the Ultimate Data Usage
The single most important purpose for the data warehouse is to serve as a solid, reliable, consistent foundation of data for any and all BI capabilities. It should be clear by now that, regardless of what your first BI capability is, you must be able to serve all business requirements regardless of their technologies.
The Codd and Date Premise
Given all of the above characteristics of a good data warehouse data model, we submit that the best data-modeling technique you can use is one based on the original relational database design—the entity-relationship diagram (ERD) developed by Chris Date and Ted Codd.
The resultant data model for your data warehouse is translatable, using any technology, into a database design that is :
=> Reliable across the business
It contains no contradictions in the way that data elements or entities are named, related to each other, or documented.
=> Sharable across the enterprise
The data warehouse resulting from the implementation of this data model can be accessed by multiple data delivery processes and users from anywhere in the enterprise
=> Flexible in the types of data marts it supports
The resulting database will not bias your BI environment in one direction or another. All technological opportunities will still be available to you and your enterprise.
=> Correct across the business
=> Adaptable to changes
=> Correct across the business
The data warehouse data model will provide an accurate and faithful representation of the way information is used in the business.
=> Adaptable to changes
The resulting database will be able to accommodate new elements and entities, while maintaining the integrity of the implemented ones.
Impact on Data Mart Creation
Now that we have described the characteristics of a solid data warehouse data model and have recommended an ERD or normalized (in the sense of Date and Codd) approach, let’s look at the ramifications that decision will have on our overall BI environment.
The most common applications that use the data warehouse data are multidimensional ones—at least today. The dimensions used in the star schemas correlate roughly to the subject areas developed in the subject area model—order, customer, product, market segment—and time.
To answer the questions, “How many orders for what products did we get in the Northeast section from January to June this year?”
So, what we can see about this situation is that data warehouse access will have to be restricted and used by only those business users who are very sophisticated in database design and SQL.
THANK YOU THANK YOU THANK YOU THANK YOU
Tidak ada komentar:
Posting Komentar