Method of conforming a data warehouse to the enterprises variable information needs

(1)

West Pomeranian University of Technology

Summary

In order that data warehouses can accomplish their role in integrated management support computer systems, they must reproduce the state of a company at any point in time and take into account an evolutionary character of a company and its environment, time-variable needs of their users. To reproduce on-line the state of the company, its environment and needs of the data warehouse’s users, it is to extend the function in the data warehouse and make it possible to carry out systematically the measurement as well as to evaluate how the data warehouse is conformed to the needs of its users, to the company, and how the company is conformed to its environment. Owing to that it will be possible to conform iteratively the data warehouse to new company’s states, its environment and new needs of the data warehouse’s users (decision-makers). A concept of such method is presented in this article.

Keywords: data warehouse, information needs, evaluation data schema

1. Introduction

This Data warehouses are essential elements of support computer systems of decision-making processes, in particular those of strategic and tactical character, at companies.

If any information (data) D has been loaded in the data warehouse at the moment t1, its

usefulness in the decision-making process PD at such a moment that t1<t might be different than

at the moment t1.

Factors which affect the variable usefulness of data stored in the company’s data warehouse are:

• _{Time-variable state of the company and its environment.}

• _{Specificity of decision-making methods and the dependence of the decision-making process} on time-variable knowledge and experience of the decision-maker,

• Time-variable access to data sources (new data sources emerge, and decision-makers can give up the data sources used to date).

• _{Variable management and decision-making methods.} • Non-integrity and incompleteness of data.

• Internal capabilities of the data warehouse.

Since the usefulness of data (information) stored in a data warehouse is time-variable, it arises the problem of:

• Determining the influence of the usefulness of the data stored in the data warehouse on decision-making processes with their use at any given point in time.

(2)

• _{Value of the usefulness of any given datum, at any point of time axis.}

• _{Influence of the usefulness of one datum on the usefulness of other data stored in the data} warehouse, or the lack of it.

• _{And so on.}

On account of the time-variable usefulness of data stored in the data warehouse an information deficit (gap) emerges in decision-making process.

2. Information deficit in warehouse aided decision-making process.

The fundamental aim of applying a data warehouse in the decision-making process is to minimize the existing information deficit between the information needed to generate decision and the information which can be received from the data warehouse.

The measure of information deficit in the decision-making process, between the data collected in the data warehouse and the needs of decision-makers, can be the relationship defined by the formula (1), namely: τ ∆ + τ − τ

ℑ ∈ ∆ + ∞ → ∆ P M t d i H i FH i t t t t 2 ) ( ) ( lim (1)

is the set of all information that can be obtained from the data warehouse or is necessary in the decision-making process, and Mi

H

(t) is the chance of getting the information _i∈ℑ_{based on the} data collected in the data warehouse at the moment t.

3. Methods of minimization of information deficit in data warehouse

To minimize an information deficit in the data warehouse one of the following methods can be used:

• _{Design method of a data warehouse oriented to minimize information deficit.}

• _{Method based on the application of special data structures and models in a data warehouse.} • Minimization of information deficit by using special query languages.

• Minimization of information deficit at the stage of use of the data warehouse by measuring and modifying of the warehouse (its structures and algorithms), so that an improvement of the usefulness of the data collected in it is ensured.

The design methods of a data warehouse which are oriented to minimize information deficit are insufficient, because future needs and requirements of its users are unknown at the moment of data designing. Therefore, in the case when the data warehouse is a not very useful tool at decision-making, a new data warehouse is developed at the company. Then such concept is in accordance with the evolutionary or active design method of a warehouse. They have been described in the literature, for example [1], [4] and [5]. These methods are oriented to create a new data warehouse when the information usefulness and efficiency of the data warehouse is little or insufficient. The costs of developing a new data warehouse can be too high and generally they are not necessarily acceptable. Additionally, the design methods of a warehouse, also active and evolutionary methods, have a non-automatic (off-line) character and do not contain the procedure to determine the moment at which the next phase of conforming the warehouse to new, changed needs of decision-makers should be started. In order to reduce the costs of creating a data

(3)

warehouse, the warehouse life cycle model composed of many stages is often used. At each stage of this model the full life cycle of a subject data warehouse, limited to a thematically determined segment of the company’s activity, is realized. When one aspect of decision-making processes is supported by the subject data warehouse, one sets usually about creating the next subject warehouse, connected with other aspects of the company’s activity, at the company. This is the method of developing a data warehouse step-by-step (next stages are next subject data warehouses). The warehouse is then in the phase of creating or restructuring, and subject data warehouses, which have been created earlier (the results of successive stages), also require be reorganizing and conforming to new needs, because of changes in business processes.

Methods based on the multi-version data model are among the methods enabling the information deficit of a warehouse to be reduced. They have been characterized in the literature (for example [4], [6]). They make it possible to simulate business scenarios in the form of prognoses, based on data arisen in the warehouse. Potentially this can increase the usefulness of warehouse data, but not necessarily. The application of multi-version data warehouses to simulate future business scenarios is very essential, but the credibility of such simulation can be low. Usually it is not possible to create a highly useful prognosis from the data the usefulness of which is low, and even when the usefulness of the data stored at this point in time in the warehouse is high, this does not mean that in the future the prognosis would be generated from these (the same) data. These data may be not useful in the future. The methods of generating business scenarios based on multi-version data warehouses also do not identify the points in time, range and way of introducing changes in the form of a new data version. They do not give answers to the questions: when the creating of a new data version is justified, or also how the process of developing a new so-called “real version” of the warehouse, whose data would be more useful, should be supported, and the like. These tasks lay with the data warehouse’s analyst and administrator.

Query languages (interaction of the business-maker with data warehouse) used in warehouses do not provide any mechanisms to identify the usefulness of data. They only make it possible to take into account a non-precise character of the decision by using fuzzy or rule-based query languages. Fuzzy query languages make it possible to apply fuzzy conditions and quantifiers of data selection, whereas rule-based query languages are based on logic principles in the form of A→B. The use of fuzzy and rule-based query languages is possible on the pattern and similarity to relational data bases.

The entering by hand (non-automatic) and control (most frequently by the warehouse’s administrator) of data quality indexes such as usefulness, integrity, reliability, data freshness and the like is possible when the standard DWQ (Data Warehouse Quality), described in the literature, e.g. [2], has been implemented in the warehouse. Owing to it the usefulness and quality of data can be controlled. But, this requires that the warehouse’s administrator should be involved personally in the mechanism of improvement of the data quality, and this also requires re-designing the warehouse, even with the evolutionary or active method.

From the performed analysis of the methods to minimize information deficit in the decision-making process using a data warehouse (design methods, methods applying special data structures and models as well as special query languages), it results that:

• _{They have a non-automatic (off-line) character dependent on a man (warehouse’s} administrator, decision-maker, and so on).

• _{They have not any procedure to determine the points in time at which the next phase of} conforming the warehouse to time-variable users’ needs should be started.

(4)

Therefore the elaboration of such a method which will take into account the time-variable usefulness of the data stored in the warehouse and conform the warehouse structures and methods to new time-variable needs of decision-makers – the warehouse’s users.

4. Adaptive methods of conforming a warehouse to variable needs

The method of conforming the data warehouse to variable information needs of decision-makers has been based on operational system engineering. According to this field [3], a company including a data warehouse is a system operating in its environment. Subsystems of the manufacturing (working), managing (leading) and information type are singled out in the company system. One of such subsystems, which has been singled out in the system of the company having a data warehouse, is the warehouse. It is an information subsystem. The relations of the data warehouse with the company and those of the company with its environment are defined then in accordance with Fig. 1. Each of the company subsystems (among them the warehouse) as well as the company system can be characterized by a pair of equations dependent on time [3], which describe their operation. They are the equations of potential Z(t) and usefulness U(t) of a subsystem (system) at any point in time t. The general form of the equations is in accordance with the formula (2).

Z(t)=Z(0) – A(t) + R*B(t)

U(t)=U(0) –V(t)+C*W(t) (2)

In the formula (2) with the symbol A(t) it has been denoted the use of the system (subsystem) in order to create the so-called outputs of the system (subsystem), B(t) is the so-called renovation of the input potential to the system (subsystem), which is necessary that the processes can run in this system (subsystem), the quantity R has been called a transfer function, V(t) means the expenditure (costs) at the moment t, connected with the conversion of the input potential flow in the output one, and W(t) is the profit (income) connected with the operation of the system (subsystem).

A measure of the degree of conforming in the scope of information, material, energy, technical provisions and finances (business) at the moment t can be the coefficients of use of the capability d(t), defined according to the formulas (3).

d(t) = A(t)/M(t)

h(t) = B(t)/P(t) (3)

The equations (2) and (3) are created separately for the entire system – the company and all its subsystems, therein also for the data warehouse.

For the data warehouse being a company subsystem of information type, it can be determined at any point in time τ within the time interval [t, t+∆t] ( t<=τ<= t+∆t):

• _{Coefficient of meeting the company’s needs in its data warehouse, defined as h}H₍τ₎

(calculated for the system “data warehouse” from the formula (3)).

• Coefficient of using the capabilities of the data warehouse dH(τ) (calculated for the system “data warehouse” from the formula (3)).

The knowledge of the terms of the equation (2) for a data warehouse as well as for a company including its data warehouse, as a whole, and also the knowledge of the coefficients dH(τ) and hH(τ), where (t<=τ<=t+∆t) make it possible to determine the coefficient of conforming the data

(5)

warehouse to the company within any time interval [t, t+∆t], where t<=τ<=t+∆t. This coefficient has been denoted with the symbol fFH(∆t). It can be determined from the formula (4).

) ( ) ( ) ( t P t M t f FH H FH ∆ ∆ = ∆ (4)

Fig. 1. Enterprise (company, organization) model

when: PZF - flows of material, technical and information streams between environment and company, PZH - flows of source data into data warehouse, PZO - flows of products and information streams between company and its environment, PZFH - flows of information streams between data warehouse system and enterprise, PZNZ – external source data - flows of information streams between data warehouse system and enterprise’s environment

(6)

In order to conform (adapt) the data warehouse to variable needs of its users and to reproduce the variability of the company, its environment and users’ information needs over time, it is to:

• Introduce an additional layer of metadata of the data warehouse, called an evolutionary layer, within which additional data for the company model and the evaluation of conforming the data warehouse to variable needs will be contained.

• _{Implement multi-version model into data warehouse tools.}

In the evolutionary metadata layer the indexes of evaluation of conforming the data warehouse to the needs of the company and its environment, determined from the formulas (3) and (4), should be stored. The evolutionary layer can be implemented using ontology or dictionaries.

Then the method of conforming the data warehouse to decision-makers’ variable information needs complies with the algorithm whose block diagram is represented in Fig. 2, in which the fuzzy operator “at least greater equal” has been used. It has been denoted with the symbol >≅. This operator expresses the comparison of numbers in the aspects of approximate numbers. To realize the algorithm represented in Fig. 2, it is to determine the values of the coefficients dH(∆t), hF(∆t), dF(∆_{t)as well as h}H

(∆_{t) and also to calculate the capabilities M}H

(∆_{t) and the needs P}H

(∆_{t) of the}

data warehouse and the capabilities MF(∆t) and the needs PFH(∆t) of the company within the time interval [t, t+∆t], where t is the moment of creating the recent version of the real-world data warehouse.

All the above discussed coefficients and parameters of the equations (2), (3) and (4) should be stored within metadata of a data warehouse or in the so-called identification tables of a company including its data warehouse as well as the identification tables of the data warehouse. These tables should contain additionally such characteristics like: UF0, Z

F 0, Z H k0, C H k0, U H k0, Z H k0 calculated

from the formula (2) for the point in time t, at which the recent real-world version of the data warehouse was created, where the symbol k in the algorithm in Fig. 2 has been used to denote the index of dimension of the data defined in the logic model of warehouse data. Additionally, the membership function for the fuzzy operator “at least greater equal” should be defined and stored within the evolutionary metadata layer of a warehouse. Owing to the conforming algorithm in Fig. 2 and the implemented evolutionary metadata layer of the warehouse the functionality of the data warehouse will be extended.

Apart from the basic tasks and functions of the data warehouse, which were assigned to it to date, (among other things integration, access to historical data, analytic processing data, decision support, knowledge deduction), now it will carry out automatically the measurement of conforming of the data collected in it with reference to the company’s variable information needs.

After designing:

• _{the structure of the data warehouse (for example using the company model, active warehouse} design method, multi-dimensional data modeling method (for example by means of modified diagrams UML, realized in three layers – levels)),

• _{the structure of data within the metadata layer to describe the company model,}

as well as after determining (defining) identification parameters necessary for the completeness of the company model and the realization of the algorithm from Fig 2, it will be possible to start creating the first version of the data warehouse, so-called real-world version. This stage will be followed by full use of the real-world version of the data warehouse together with measuring systematically the confirming of the data warehouse to the company’s information needs.

(7)

) ( ), ( ), ( ), ( t hF t dF t fFH t H d ∆ ∆ ∆ ∆ ) ( ), ( ), ( ), ( t P t M t M t PH _∆ FH _∆ H _∆ F _∆ 1 + ∆ = ∆t t ) ( ) ( t dH t H h ∆ >≅ ∆ ) ( ) ( t M t PFH _∆ _>≅ H_∆ ) ( ) ( t d t hF _∆ _>≅ F _∆ H k H k H k H k F F _Z _Z _C _U _Z U₀, ₀, ₀, ₀, ₀, ₀

(8)

If one of the following elements:

• relations “at least greater equal” (the symbol >≅ in the algorithm in Fig. 2) between the respective coefficients hHand dH, hF as well as dF and also between the company’s needs PFH and warehouse capabilities MH within the time interval from the moment at which the recent version of the real-world data warehouse was created,

• company model the evaluation of which results from the capabilities MF(∆t) and company’s needs PF(∆t),

• _{access to data on the side of from its users,}

• access to data from the environment of the company,

will not be satisfactory (in the expert’s opinion the coefficient fFH(∆_{t) is bad), it is to take the}

following actions appropriately, as the ensuing situation requires, and namely:

• _{if the company model has changed (other flows of material, technical and information streams} in the model from Fig. 1) or a change of the company environment, which has an influence on the functioning of the company, has occurred, it is to make changes in the company model, • _{if a new demand for information in the data warehouse has arisen, as a result of the change of}

decision-making methods, or if a new data source has emerged, it is to change the data structures and the methods of provision of the warehouse.

The realization of these actions will cause that the next version of the real-world data warehouse will come into existence. This version can be implemented in the data warehouse owing to the tools existing in the management system of the multi-version data warehouse. The described concept of creating a multi-version data warehouse containing an evolutionary metadata layer and the described algorithm of conforming the data warehouse to the company and its business environment will make it possible to:

• _{Simulate business scenarios.}

• Monitor and evaluate the company model and data quality indexes. • _{Acquire new data sources in the environment of the company.} • _{Verify the needs of warehouse’s users.}

5. Conclusions

The method based on the algorithm represented in Fig. 2 differs from the methods and concepts of conforming the data warehouse to the company’s time-variable information needs, which can be found in the literature, in the following features:

• _{It combines business modeling with a data model.}

• Within business modeling, it covers the field of determining the so-called requirements of a system including its data warehouse based on the integrated identification method and creating praxiologic, cybernetic, mathematical and evaluation models of a company.

• _{It takes into account the changeability of the future with reference to the past and present (the} creating of business scenarios by applying multi-version data model in the warehouse). • _{It integrates the application of multi-version data model with the evolution of the data}

warehouse scheme.

• _{It makes it possible to make automatic the “adjusting” and conforming of the data warehouse} to new needs, owing to that metadata of the warehouse, the evaluations of the company and data warehouse at the successive points on time axis are stored in the evolutionary layer, and trough the software acting over the metadata, created for the above mentioned purpose, it

(9)

permits the system’s analyst to modify, in the form of interaction, the company model, and the data warehouse’s administrator to generate a new real-world version of the data warehouse. • It makes it possible to conform the warehouse even to sudden changes of the state, purposes,

missions and strategies of the company in its variable environment. Bibliography

1. Bbel B., Morzy M. (2002): Projektowanie schematów logicznych dla magazynów danych, Ploug, Pozna

2. Jarke M., Lenzerini M., Vassiliou Y., Vassiliadis P. (2003): Hurtownie danych. Podstawy organizacji i funkcjonowania. Wydawnictwo szkolne i pedagogiczne. Warsaw.

3. miałkowska B. (1985): Enterprise's identification modeling for integrated control. Midzynarodowa konferencja "Cybernetyka'85", Warszawa.

4. miałkowska B. (2005): Multi-version metadata model at enterprise’s data warehouse. Image Analysis, Computer Graphics, Security Systems and Artificial Intelligence Applications, pp. 245-253.

5. Smiałkowska B. (2007): Multi-version model for enterprise’s data warehouse. Studies & Proceedings of Polish Association for Knowledge Management, Ciechocinek, 2007. 6. Wrembel R. Bbel,B. (2005): Metadata Management in a Multi-version Data Warehouse,

In Proceedings of Ontologies, Databases, and Applications of Semantics (ODBASE), Cyprus.

Boena miałkowska

West Pomeranian University of Technology ołnierska 49, 71-210 Szczecin, Poland e-mail: bsmialkowska@wi.zut.edu.pl

Method of conforming a data warehouse to the enterprises variable information needs

