An overview of the methods and tools used in selected areas of sourcing knowledge from text databases and data warehouses

(1)

their structure and diversity of liaisons defined in stored information recourses. It also contains content that concerns standardized concepts occurring in literature related to this problem that ends with conclusions that are a result of implementation experiences. One can describe different methods and also logic and technical structures, whose scope essentially exceeds the solutions applied in cases discussed in this monograph. In the first part are quoted issues closed to the types and diversity of the data being analysed and in the second basic mechanisms of processing are described.

2.1 The notion and types of data warehouses

The term data warehouse appeared for the first time in 1988 and was introduced by Barry Devlin and Paul Murphy of IBM to serve, on one hand, as an environment intended to support users of business information and, on the other hand, as enablers for IT specialists to take care of the quality of data. Now, the generally accepted definition is the one proposed by Bill Inmon in 2005, according to which a data warehouse is a subject-oriented, integrated, time variant and non-volatile collection of data (database) – essentially supporting decision-making processes.

More specifically, “Subject oriented” means that the data collection is limited to a specific business aspect, e.g. sales, rather than to business operations or processes. In practice, a data warehouse may embrace a number of business aspects. Therefore, a data warehouse, which focuses on one business area, or even one component, is usually referred to as a data mart, i.e. a local data warehouse, typically being a subset of a corporate warehouse. “Integrated” means standardization of the data into a specific format and range of acceptable values, and – in a wider sense – data mapping obtained from different sources; “Time-variant” means that all data uploaded into a data warehouse is time-stamped, which allows users to monitor changes and to perform analytical evaluations at given time intervals.

- “Non-volatile” means that data placed in a data warehouse is recorded and kept there as read-only records, because reading is the only available operation, at least formally. [2] The idea of a data warehouse is based on the assumption that its user can freely access the data at any level, down to the minutest detail. Therefore, it is a centralized base, consisting of cleaned and standardized data, organized according to hierarchy, content and time. In the database, archived data is collected to meet the various needs of a number of users in an organization. In the collection process, the data recorded in the organization’s operating systems is standardized and aggregated according to pre-defined codes. Information stored in a data warehouse can be divided into four separate classes of factual information (facts). These facts describe the occurrence of specific events in the real world, and they are the actual subject of analysis, expressed predominantly in numerical values (e.g. value of sales, number of units, number of breakdowns, complaints, etc.).

- Reference information (descriptions): They describe the dimensions or, in other words, categories, in which the factual data can be analyzed; they represent the key aspects of an organization’s operations, such as time, product, service, geographical area, the customer,

(2)

distribution channel, staff, fixed assets etc. and indicate the cross-sections to aggregate (sum, average) the facts.

- Collective information (aggregated facts) is collected data or facts, aggregated at numerous levels (e.g. weekly, monthly or quarterly sales value totals), and stored to accelerate the time of response to users’ queries.

- Metadata, which describe the data, stored in a warehouse, explaining its meaning and indicating the location, the method of retrieval, processing and use [3].

The definitions and descriptions mentioned above show the complexity of the idea of a data warehouse. The complexity results from the need to acquire information from different sources, which requires standardization (Table 2.1) as well as from various information needs of the data warehouse users (Table 2.2) and a large number of stored data and the necessity to provide efficient (fast) access to the data.

Table 2.1 Selected sources of information acquired for a data warehouse to support sales

Data type Source

Sales value Sales operational database Sales volume Sales operational database

Capital cost of goods sold Inventory turnover or purchasing operational database Cost of sales (manufacturing

cost of goods sold)

Inventory turnover or production cost database

Cost of sales of services Service orders database, which includes the direct costs of services or working time and an accounting of the materials used in connection with a payroll database and goods turnover or purchasing operational database. Logistic costs Logistic operational database, which contains cost information or which is

connected with relevant records in a financial and accounting operational database.

Timeliness of payment Financial and accounting operational database

Sales team activity CRM operational database (Customer Relationship Management). Sales promotion and

advertising campaigns

CRM operational database or another operational database with information concerning promotional and advertising actions that have been carried out. Sales team labour costs CRM operational database, connected to a financial and accounting database Cost of promotion CRM operational database or another operational database concerning

promotional and advertising actions that have been carried out, in connection with financial and accounting information.

Number of complaints and their reasons

CRM operational database or another operational database concerning recorded complaints

Cost of complaints CRM operational database or another operational database relating to complaints and connected with information from a financial and accounting database

Customer satisfaction CRM operational database or another operational database concerning the evaluation of customer satisfaction

Competitors’ sales performance

External sources of information Market sales figures External sources of information Market development trends External sources of information Source: Own research, based on market experience.

(3)

Table 2.2 Examples of information needs of different users in a sales data warehouse

User type/User group Required information

Board of Managers (Managing Director)

- Value, volume and profitability of sales in total, within areas and product groups,

- Own sales trends, - Industry sales trends, - Industry market share.

Sales Director - Value, volume and profitability of sales in total, within areas, product groups and customer groups,

- Own sales trends, - Industry sales trends, - Industry market share,

- Value of sales to strategic accounts, - Trends of sales to strategic accounts,

- Efficiency of marketing and advertising/promotional campaigns. Area Sales Manager/

BU Manager

- Value, volume and profitability of sales in total, within areas, product groups and customer groups,

- Own sales trends in the area/industry (business),

- Industry sales trends, esp. with regards to competition in the area/industry, - Value of sales to strategic accounts and other customers in the area/industry, - Trends of sales to strategic accounts and other customers in the area/industry, - Marketing and advertising/promotional campaigns’ efficiency in the area/industry,

- Average timeliness of a customer’s payment in the area,

- Value, volume and profitability of sales generated by area/industry sales representatives,

- Efficiency of sales representatives in the area/industry.

Sales Staff - Value, volume and profitability of sales in total, within areas, product groups and customer groups,

- Value of sales to strategic accounts and other customers in the area/industry, - Trends of sales to strategic accounts and other customers in the area/industry, - Average lead times and availability of the offered products,

- Average timeliness of payment from customers served in the area, - Detailed information concerning the customer’s transaction history and offered products.

Production planning - Product sales throughout the year,

- Long-term product sales trends (upward and downward). Procurement - Quantitative distribution in the annual sale of products and goods,

- Upward and downward trends in the long-term sale of products and goods. Services - Quantitative distribution in the annual sales of services,

- Upward and downward trends in the long-term sale of services.

Logistics - Quantitative distribution in the annual sale of products and goods by volume and weight,

- Upward and downward trends in the long-term sale of products and goods by volume and weight,

- Average turnover of product and goods inventories throughout the year and long-term turnover trends.

Finances - Value distribution in the annual sale of products, goods and services, - Upward and downward trends in the long-term sale of products, goods and

(4)

User type/User group Required information services,

- Average annual accounts receivable turnover and its long-term trends. HR/Personnel - Quantitative distribution in the annual sale of products, goods and services

with a consideration for the required qualifications of the personnel,

- Upward and downward trends in the long-term sale of products, goods and services with a consideration for the required qualifications of the personnel, - Average annual personnel turnover and its long-term trends.

Administration/ Maintenance

- Annual activity distribution of the sales team with a consideration for their need for technical resources,

- Upward and downward trends in the long-term sale of products, goods and services with a consideration for the technical requirements.

Marketing - Annual activity distribution of the sales team with a consideration for their requirements for marketing resources,

- Upward and downward trends in the long-term sale of products, goods and services with a consideration for marketing resources.

Source: Own research, based on market experience.

- The characteristics of a data warehouse also show a large amount of stored data, which has to be made efficiently accessible to users. For example, in a standard operational database the data are normally stored for 3 to 5 years or less, whereas data warehouses are designed to store data infinitely. Therefore, it is essential that a warehouse stores only the data that are necessary from the point of view of intended analyses and that the data are suitably prepared in technical terms. Because of this, as well as due to the scope and variety of stored data, when developing a data warehouse, one should choose the right solution for the specific circumstances. In practical application of the idea of a data warehouse in business, the following examples (types) can be observed: one data warehouse is designed to meet the needs of all decision-makers;

- a few smaller data warehouses (data marts), subject-oriented, each one of which is designed to meet specific information needs in one area (discipline), e.g. marketing, finance, production, or sales;

- a centralized data warehouse, Enterprise Data Warehouse (EDW), designed to supply information to subject-oriented warehouses and not to be used for direct analysis.

In the case of a multi-branch organization (corporation, holding, etc.), each of the relevant branches may also operate its own comprehensive data warehouse. This type of solution is called a Departmental Data Warehouse (DDW). On the HQ level, complex organizations may have an additional data warehouse, storing data from the entire corporation. Such a central warehouse can be called a Corporate Data Warehouse (CDW) [3].

2.2 Logic and technical structure of data warehouses and methods of data analysis Structure (or Architecture) is a framework made from rules and structures, on which a system will be built [4], whereas data structure defines the data sources supplied by the system, the location and transfer of data within the system, and the ways in which the data are applied by users. The structure of a system, based on a data warehouse, has a number of characteristic properties [4]:

(5)

- data are extracted and uploaded from source systems, databases and files;

- data from source systems are integrated before they have been transferred to a warehouse;

- a data warehouse is a separate, permanent database, designed to support the decision-making processes;

- users can access the data warehouse through front-end tools.

Figure 2.1 General structure of a data warehouse

Source: Poe, V., Klauer, P., and Brobst, S. (2000), Tworzenie hurtowni danych, Wydawnictwa Naukowo-Techniczne, Warsaw.

In practice, there are many different variants of the structure of a decision-support system (DSS) based on a data warehouse and OLAP analytics. The structure of a data warehouse is mainly determined by two

– factors: the types of data warehouses and data stores within the system;

- types of structures of warehouse databases and database management systems. The structure of a data warehouse primarily depends on which one of the following databases is to be included:

- a regular data warehouse (central); - data marts (stores);

- an operational data store.

Systems which include both a central data warehouse and data marts (departmental or subject-specific data stores) are often used in practice (Fig. 2.2), as well as such systems where only data marts are included (Fig. 2.3). In some cases, before going to the warehouse, the data from source systems are transferred to an Operational Data Store (ODS) which has a relational model (Fig. 2.4). When a number of warehouses are used in one system, one should make sure the data stored in individual warehouses are consistent so that results of analyses, based on the data, can be compared and evaluated.

(6)

Figure 2.2 System based on a central data warehouse and data marts

Source: Poe, V., Klauer, P., Brobst, S. (2000), Tworzenie hurtowni danych, Wydawnictwa Naukowo-Techniczne, Warsaw.

Figure 2.3 System based only on data marts

(7)

Figure 2.4 System based on an ODS, a central data warehouse and data marts

It is also possible to extract data from source systems or an operational data store directly to multi-dimensional virtual cubes, which are stored in the operating memory and undergo dynamic modifications in analytical processes. This kind of solution is referred to as a ‘virtual data warehouse’ and requires the storage of a data repository containing descriptions of methods of development of the virtual content. A well-known form of this solution implies MS Excel OLAP cubes. The other factor which affects the structure of a data warehouse is the adoption of specific technological solutions with regards to database management systems and the types of database structures and schemas linked with them. Warehouse databases can be managed by means of a relational system (ROLAP) or a multi-dimensional database management system (MOLAP), or both systems at the same time. The adoption of a specific approach will substantially affect the database which is to store data aggregates (i.e. consolidated data). In the ROLAP structure (Fig. 2.5), elementary data and data aggregates are stored in relational database tables, whose records are used by analytical processing services (OLAP), making them available to users to carry out multi-dimensional analyses.

(8)

Figure 2.5 ROLAP structure of Microsoft Analysis Services

Source: Januszewski, A. (2008), Funkcjonalność informatycznych systemów zarządzania, Tom 2 – Systemy Business Intelligence, Wydawnictwa Naukowe PWN.

On the other hand, in the MOLAP structure (Fig. 2.6) all data to be processed in multi-dimensional analyses (elementary and aggregated data) are placed in a special multi-multi-dimensional database (in numerous data cubes). If a given table is used in different data cubes, it needs to be multiplied, which takes up disk space and extends transfer times. That is why so-called virtual cubes are used, allowing one copy to be used in all the analyzed cubes.

(9)

Figure 2.6 MOLAP structure in Microsoft Analysis Services

Source: A. Januszewski (2008), Funkcjonalność informatycznych systemów zarządzania, Tom 2 – Systemy Business Intelligence, Wydawnictwa Naukowe PWN.

- In practice, there are also hybrid solutions, combining both structures described above. A technical infrastructure comprises all elements which are necessary to implement a data warehouse system and to ensure its efficient performance and expected functionalities. Such elements are the following:

- A computer system (servers, workstations), a local and wide area network, and a means of communication;

- A database management system (for relational and multi-dimensional bases); - Analytical processing servers (OLAP);

- Data conversion tools; - Data access tools;

- Metadata repository administration tools;

- Tools and equipment for the visualization of the results of analyses; - Training for decision-support system designers;

- Training for administrators and users.

The configuration of a technical infrastructure for a data warehouse is determined by the arrangement of its four key elements: Data sets, Metadata repository, DSS software (i.e. the OLAP server), and End-user software. The basic components of the aforementioned technical infrastructure can be supplemented with special equipment for visualization of the results of analyses. There are three principal configurations of the infrastructure [5]:

(10)

- one-tiered structure; - two-tiered structure; - three-tiered structure.

In the case of a one-tiered structure, all elements are based on a client workstation, and the entire system can be referred to as a Desktop OLAP. With the two-tiered type, the client workstation is used to store only user’s or users’ software, whereas databases, the repository and the OLAP server are all installed on a separate computer (remote server). The most sophisticated is the three-tiered structure, where the data warehouse is supported by a thorough database server, which can also manage subject-specific warehouses. The meta data repository and DDS software are installed on a separate application server, and client software is installed on user workstations. The three-tiered infrastructure is the most efficient and the safest configuration, allowing users to perform complex analyses in the background by queuing them on the application server and, thus, reducing the load on the workstation.

In view of the large variability of business surroundings and the results from it become difficult in terms of defining information needs, solutions ensuring large scalability and the possibility of the dynamic definition of analysed indexes are mostly applied in a process of knowledge management. It causes a tendency to prefer the ROLAP structure which gives just such possibilities. In the opinion of the authors, the abovementioned thesis generates confirmation of real applications with regards to data warehouses.

Bibliography

[1] Łakomy, M. (2000). Hurtownie danych dla przyszłości, Computerworld, October 1st. [2] Surma, J. (2009). Business Intelligence – Systemy wspomagania decyzji biznesowych,

Wydawnictwa Naukowe PWN.

[3] Januszewski, A. (2008). Funkcjonalność informatycznych systemów zarządzania, Tom 2 Systemy Business Intelligence, Wydawnictwa Naukowe PWN.

[4] Poe, V., Klauer, P., and Brobst, S. (2000). Tworzenie hurtowni danych, Wydawnictwa Naukowo-Techniczne, Warsaw.

[5] Gorawski, M. and Konopacki, A. (2000). Metodyka projektowania systemów wspomagania decyzji. Computerworld Raport, March.