Automated Computational Framework for Efficient Knowledge Utilization in Data Mining Process

(1)

Wydział Ekonomiczno-Rolniczy Katedra Ekonometrii i Informatyki

Summary

The outline of the article is the need of automated approach in building data mining models to manage the information and the knowledge for the purpose of bet-ter management and efficient allocation of the resources within the company. The valuable knowledge is needed at all stages of the process and this article will show a study how to efficiently utilize expert’s knowledge in a typical data mining process. Based on this, architecture for embedding the expert knowledge into a process through automatic computational framework will be proposed.

Keywords: Data Mining, Automated Approach, Expert Knowledge, Meta-Learning 1. Introduction

Applying the data mining in the company helps to utilize the information from the databases into the knowledge, which can be applicable for many diverse business problems as customer valuation, fraud detection, customer retention, risk analysis and bad debt reduction. Thanks to data mining analysis [1][6][8] we can uncover previously unknown patterns that can be utilized as a business advantage. The structure of the mentioned problem is so complex that very often a kind of expert’s knowledge is needed at every single step of the analysis. It is well known in the indus-try that expert’s input in data mining process is very valuable but at the same time the knowledge he possesses is rather expensive and cannot be wasted for not relevant tasks. This article will show a study how to cope with a great number of decisions to undertake and efficiently utilize expert’s knowledge in data mining process. For the purpose of this study an architecture that incorporates to some extend meta-learning approach will be discussed broadly. Additionally, some numerical experiments will be presented to confirm the validity of the approach.

2. Problem statement

After the business problem is chosen and data are collected and prepared we come to the modelling stage. This part seems to be the most difficult and time consuming to prepare. The very first step in modeling is to select the most adequate technique. Whereas we have already selected a tool in business understanding, this task refers to the specific modeling technique, e.g. decision tree building, logistic regression or neural network. With any analytical tool, there are often a large number of parameters that can be adjusted. We also need to rank the models according to the evaluation criteria. The complexity of the problem is due to the number of decisions to undertake. Some of the most important are deciding about the learning process, selecting the model, adjusting

(2)

tion nique

Multilayer perceptron Tanh Conj. Gradient Number of layers Generalized Liner

Model

Logistic Double Dogleg Number of neurons in layer -ranging from 1 to infinity (in practice limited only by software) Ordinary RBF Square Newton-Raphson Optimisation technique

Normalized RBF Softmax Levenberg-Marquardt

Model selection criteria (profit/loss, misclassifi-cation rate, average error) Exponential Quasi-Newton Initial weights

Arctan Trust-Region Number of iterations

Beside the fact of many decisions, artificial neural networks are capable of performing a wide variety of tasks, yet in practice sometimes they deliver only marginal performance. Inappropriate topology selection and learning algorithm are frequently blamed. In fact, there is no reason to expect that one can find a uniformly best algorithm for selecting optimal structure with necessary parameters in neural network model. This is in accordance with so called “no free lunch” theorem [10], which explains that for any algorithm, any elevated performance over one class of problems is exactly paid for in performance over another class. If the complexity of the problem increases, manual design becomes more difficult and unmanageable. Nowadays, in data mining applications a promising strategy is applied very frequently which is to find the system that works for a given data trying many different approaches. It is quite advisable to use such a searching engine in com-pany analytical environment to perform analysis more efficiently and at the same time saving the expert’s time for more elaborate problems.

3. Architecture for efficient workflow

The figure below presents the overview of the data mining problem [8]. At the same time the diagram shows the methodology how to utilize the expert’s knowledge through automatic compu-tational framework.

(3)

Figure.1 Proposed automatic computational framework in a data mining process The first three steps like data collecting, data preparation and variable selection are not con-suming a lot of the time in terms of expert’s tasks [7]. The expertise is needed to decide about some crucial things like what data to use, what variables will be useful, what criterion to use for variable selection etc. The next fundamental task in data analysis is building a model to deal with a business problem. In this case, working with neural network models involves a number of chal-lenges [2][5]. As the complexity of methods, techniques increases, manual design becomes very difficult and unmanageable for single expert. The field of automatic framework and meta-learning has seen continuous growth in the past years with interesting new developments in the construction of practical model-selection assistants, task-adaptive learners, and solid tools for automatic pattern recognition. In this context, automatic decision making methods that support practitioners in model selection or method combination should be developed. A solution, which is strongly recommended in this paper, is to propose an automated computational framework for the purpose of efficient knowledge utilisation. It is desirable to have this kind of a meta-learning archi-tecture that meets two key features. First, such system must produce an accurate final classification or prediction. This means that proposed architecture must produce a final outcome that is at least as accurate as a conventional algorithm applied to all available data. Second, it must be fast and operate in a reasonable amount of time.

4. Automatic computational framework

Over last years, interest in automation and meta-learning in data mining [3][4][9], especially in the area of prediction and classification, has been growing rapidly, both in business and science. As a result, the techniques developed by scientists have found plenty of applications in a real life. Many commercial data mining tools are available, giving the technology and business great advan-tages. However, such tools remain of limited use to end users who are not able to use all of the features and techniques that are available. This is due, mainly, to the fact that data mining systems are not trivial and their content keeps increasing in newly developed features. It is worth to men-tion that current data mining tools are as powerful as their users can manage all these funcmen-tional- functional-ities. A multiple algorithms and techniques are provided within a single and integrated system and the selection and combination of mentioned techniques is left to the user. This is clearly

(4)

unsatis-solution.

In case the business problem is complex and it needs to be solved with more complicated techniques like neural networks then it is recommended to use automatic computational frame-work for the optimization of artificial neural netframe-works where the netframe-work architecture, activation function, connection weights; learning algorithm and its parameters are adapted according to the problem. Such algorithms are nowadays easily available with some statistical and data mining software, e.g. SAS system.

A general idea of automatic computational methods is based on a search tree in the space of all models for the simplest and most accurate model. Obviously, the system requires a reference model that should be placed in the root of the tree. The reference model should be as simple as possible.

Optimization should be done using validation sets to improve generalization abilities of the model. Starting from the simplest model available, such as the simple MLP 2-2-1 structure for instance, step-by-step, a new property is added within the space of a set of parameters or proce-dures that were chosen by an expert to search from. The model may be more or less complex than the previous one. If several models give similar results the one with the lowest complexity is se-lected. The search in the space of all similar models is stopped when no significant improvements are achieved by new extensions.

The evaluation function (E) of the model E(Mi) returns the classification accuracy of the model Mi calculated on a validation set. Let choose n as the number of possible extensions of the reference model (Table 1) that were limited by an expert.

The model selection algorithm is as follows:

1. Limit the space of the parameters to a set with n elements. 2. Take the initial model as the best one Mo.

3. Do the steps until the set of elements is empty:

3.1. Create a set of n models, M = {Mi}, i = 1. . .n applying all elements to the best model Mo.

3.2. Optimize all models in the current set.

3.3. Evaluate all the models E(Mi) and arrange them in a decreasing order of accuracy. 3.4. Select the best model Mb from the M set as the reference.

The procedure conducts limited searches in order to find better network configurations. Sim-plest models are created at the beginning and new types of parameters and procedures are added, allowing to explore more complex models. Neural networks increase model complexity by adding the same type of parameters, generating

different models within a single method. There are several options to control the algorithm. These options share the following common features:

(5)

2. For each training iteration, one estimate vector and one fit vector are retained according to error criteria.

3. The training is initialized with the current estimates. The new node is initialized with output weights = 0. Therefore, the beginning error is the same as the final error of the previous level. 4. An adjustable setting for the maximum number of iterations is used. The training maximum

iterations are adjusted higher if the selected iteration equals the Maximum Iterations. The maximum iterations are adjusted lower if the selected iteration is significantly lower than the maximum iterations setting.

5. An adjustable setting for the amount of time after which training stops is used. This time property enables to set the maximum amount of time that can be used for training. 6. The default combination functions, error functions and model selection criteria are used

(Table 1.)

The algorithm described above is not resistant to local minima, as any so called “best-first" search algorithms. Therefore, re-optimization of the models in the set may be desirable but it would increase the computational costs. However, the algorithm accelerates an analyst’s duty by dozen of times eliminating much of the decision problems and it is quite advisable to use such a searching engine to perform analysis in more efficient way.

5. The validation of the study

For the purpose of this study and to validate the concept presented in this article some numeri-cal experiments with the neural networks were undertaken to benefit from the value added. In case of such complex problems where neural networks are applied, the pool of all possible parameters to adjust is in fact unimaginary huge. Taking the number of possible combinations only in few thousands, the modeling task is extremely challenging for a single expert to perform. The data mining problem similar to the one explained in [12] was examined. An automated tool available in SAS software was used to build a neural network models on the 200 000 cases data set. The prob-lem was to predict the customers’ risk of not paying the invoice (binomial target variable). The variables that were used in numerical experiments are presented in Table 2.

Table 2. Variables and descriptions. Variable Description

Average payment delay Number of days after due date, on average, the customer pays the invoice. Indicator of blocked calls Indicator of who didn’t pay the invoice and as a result have blocked calls. Segment number The higher number the more valuable client is for the company. Open invoice amount The difference between the invoice amount and the paid amount. Frequency The number of times the client was late with the monthly payment. Recency The last time the customer paid the invoice late.

The best model was supposed to achieve the highest evaluation criterion, which was the measure of percentage correctly classified cases (PCC) and at the same time not to exceed the time of 1,5 hour of computer calculations. The search in the space of the models with different parame-ters was searched according to the algorithm presented in paragraph no. 4. After given time for

(6)

analytical system. Based on some assumptions, a system was develop for searching within the models’ space, which could do a modelling job surprisingly fast. As a result, time needed for the modelling stage was reduced by 30%. At the same time expert’s skills were utilized in other areas, which were model verification and comparing the results.

As it was presented, automation systems help to solve important problems in the application of machine learning and data mining tools. The successful use of these tools outside the science world is conditioned on the appropriate selection of a suitable predictive model, according to the business context. Without some kind of assistance, model selection and combination can turn into solid obstacles to users who wish to access the technology more directly and cost-effectively. End-users often lack not only the expertise necessary to select a suitable model, but also the availability of many models to proceed on an error free pattern. A solution to this problem is attainable through the construction of automated and meta-learning systems. These systems can provide automatic and systematic user guidance by mapping a particular task to a suitable model. The area of the automating systems and meta-learning is very deeply researched. Many authors expect this filed to be one of the most promising in terms of the possible applications. Duch, W., Adamczak, R., and Diercksen G.H.F propose for example a "meta-learning" strategy with a framework for Similarity-Based Methods (SBM) [3][4] to construct automatically the best model for the given data. Others, like Vilalta R., Giraud-Carrier C, Brazdil P., Soares C studied the number of the meta-learning techniques to support data mining [11]. All of these aim at developing meta-learning and automated computational frameworks for the purpose of efficient knowledge utilisation and better management.

6. Conclusions

Presented approach is supposed to give some new light in terms of efficient knowledge utilisa-tion within a data mining process, in general. It is recommended to use database software for pre-modelling activities and accessible analytical tool for efficient pre-modelling and prediction. The aim of the paper is also to underline the important role of meta-learning as an automatic assistant tool in the tasks of model selection.

Classification and regression tasks are common in daily business practice across a number of sectors. Hence, a decision support offered by a meta-learning and automation has the potential of bearing a strong impact in future applications. Since expert knowledge is often expensive and not always readily available, a fully automated and to some extend meta-learning system can serve as a useful tool in data mining applications. Concluding, such solution would help to perform analy-sis more precisely and more efficiently. As a result, we can focus on post-modelling stage, which is checking and verification of the results.

(7)

Bibliography

1. Agresti A., 1990. Categorical Data Analysis. John Wiley & Sons, Inc. New York. 2. Bishop C., 1995. Neural Networks for Pattern Recognition. Oxford University Press.. 3. Duch, W., Adamczak R., Diercksen G.H.F., 2000. Classification, Association and

Pattern Completion using Neural Similarity Based Methods. Applied Mathematics and Computer Science 10, 101–120.

4. Duch W., Grudziski K., 2001. Meta-learning:searching in the model space. Proceedings of the International Conference on Neural Information Processing (I) Shanghai, 235-240.

5. 5.. Duch W., Korbicz J., Rutkowski L., Tadeusiewicz R., 2000. Biocybernetics and Biomedical Engineering (6): eural Network, EXIT Publisher Warsaw.

6. Groth R., 2000. Data Mining. Building Competitive Advantage. Prentice Hall Inc., Upper Saddle River New Jersey.

7. Karanta I., 2000. Expert systems in forecast model building. Publications of the Finnish Artificial Intelligence Society, 9th Finnish Artificial Intelligence Conference, 77-85. 8. Kennedy R. L., 1997. Solving Data Mining Problems with Pattern Recognition. Edited

by Y. Lee, B. Van Roy, C. Reed and R.P. Lippman, Prentice Hall. 9. Mitchell T., 1997. Machine Learning. McGraw-Hill Boston.

10. Wolpert D., Macready W.G., 1997. No Free Lunch Theorems for Optimization. IEEE transactions on evolutionary computation 1(1), 67-82.

11. Vilalta R., Giraud-Carrier C., & al., 2004. Using Meta-Learning to Support Data-Mining. International Journal of Computer Science Applications, 1(I), pp. 31-45. 12. Zbkowski T., Szupiluk R., Wojewnik P., 2004. Efficient knowledge utilization in Data

Mining modelin. Informatyka ekonomiczna-aspekty naukowe i dydaktyczne Czstochowa,121-124.

TOMASZ ZBKOWSKI e-mail: tzabkowski@poczta.fm

Szkoła Główna Gospodarstwa Wiejskiego Wydział Ekonomiczno-Rolniczy

Katedra Ekonometrii i Informatyki ul. Nowoursynowska 159 bud. 34 02-787 Warszawa