Data Mining

(1)

Data Mining

Piotr Paszek

piotr.paszek@us.edu.pl

Introduction

(Piotr Paszek) Data Mining DM – KDD 1 / 43

(2)

Recommended Reference Books

1 J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. 2011.

2 I. Witten, E. Frank, and M. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 3rd ed. 2011.

3 P. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Wiley, 2nd ed. 2016.

4 X. Wu, V. Kumar: The Top Ten Algorithms In Data Mining.

Chapman & Hall, 2009.

(3)

Data Mining – Etymology

In the 1960s, statisticians used terms like data fishing or data dredging to refer to what they considered the bad practice of analysing data without an a-priori hypothesis.

The term data mining appeared around 1990 in the database community.

Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc.

Gregory Piatetsky-Shapiro coined the term knowledge discovery in databases and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities.

Currently, the terms data mining and knowledge discovery are often used interchangeably.

(4)

Data mining (definition?)

Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

An essential process where intelligent methods are applied to extract data patterns. It is an interdisciplinary subfield of computer science.

The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use.

https://en.wikipedia.org/wiki/Data_mining

(5)

Data Mining: Confluence of Multiple Disciplines

(6)

Data Mining

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build

computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely

generalize to make accurate predictions on future data. . . .

Machine learning provides the technical basis for data mining. It is used to extract information from the raw data in databases . . . Data mining is defined as the process of discovering patterns in data.

The process must be automatic or semiautomatic.

The patterns discovered must be meaningful in that they lead to some advantage, usually an economic one.

Ian Witten, Eibe Frank, Mark Hall. Data Mining: Practical Machine Learning Tools and Techniques. Third Edition. Morgan Kaufmann Publishers, 2011.

(7)

Data Mining

Data mining, also popularly referred to as Knowledge Discovery in Databases (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored or captured in large databases, data warehouses, the Web, other massive information repositories or data streams.

Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. 2nd Edition. Morgan Kaufmann Publishers, 2006.

(8)

Knowledge Discovery in Databases (KDD)

field is concerned with the development of methods and techniques for making sense of data.

. . .

At the core of the process is the application of specific data-mining methods for pattern discovery and extraction.

. . .

KDD refers to the overall process of discovering useful knowledge from data, anddata mining refers to a particular stepin this process.

Data mining is the application of specific algorithms for extracting patterns from data.

Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth. From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3): 37–54, 1996.

(9)

KDD process

1. Understand the application domain and the goal of the process 2. Create target dataset as a subset of all the data that is available 3. Data cleaning and preprocessing to remove noise, handling missing

data and outliers

4. Data reduction and projection in order to focus on the features that are relevant to the problem

5. Match goals of process to a data mining method. Decide the purpose of the model such as summarization or classification 6. Choose the data mining algorithms to match the purpose of the

model (from step 5)

7. Data mining, it means run algorithms on data

8. Interpretation of mined patterns to make them understandable by the user, such as summarization and visualization

9. Acting on the discovered knowledge, such as reporting or making decisions

U. Fayyad, G. Piatetsky-Shapiro, P. Smyth. The KDD Process for Extracting Useful Knowledge from Volumes of Data. Communications of the ACM 39, 11, 1996, 27-34.

(10)

KDD Process

1. Data cleaning to remove noise and inconsistent data.

2. Data integration, where multiple data sources may be combined.

3. Data selection, where data relevant to the analysis task are retrieved from the database.

4. Data transformation, where data are transformed and consolidated into forms appropriate for mining by preforming summary or aggregation operations.

5. Data mining, which is an essential process where intelligent methods are applied to extract data patterns.

6. Pattern evaluation to identify the truly interesting patterns representing knowledge based on interesting measures.

7. Knowledge presentation, where visualization and knowledge representation techniques are used to present mined knowledge to users.

Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Second Edition.

Morgan Kaufmann Publishers, 2006.

(11)

KDD Process

Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Second Edition.

Morgan Kaufmann Publishers, 2006.

(12)

Data Mining in Business Intelligence

(13)

KDD Process

This is a view from typical machine learning and statistics communities

(14)

CRISP-DM

Cross-Industry Standard Process for Data Mining (CRISP-DM) is a data mining process model that describes commonly used approaches that data mining experts use to tackle problems.

(15)

Phases of the CRISP-DM reference model

P. Chapman, J. Clinton et al. (2000); CRISP-DM 1.0 Step-by-step data mining guides

(16)

CRISP-DM major phase

1. Business Understanding 2. Data Understanding 3. Data Preparation 4. Modeling

5. Evaluation 6. Deployment

P. Chapman, J. Clinton et al. (2000); CRISP-DM 1.0 Step-by-step data mining guides

(17)

CRISP-DM

Business Understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then

converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard can be used.

(18)

CRISP-DM

Data Understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

(19)

CRISP-DM

Data Preparation

The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modelling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include table, record, and attribute selection as well as

transformation and cleaning of data for modelling tools.

(20)

CRISP-DM

Modeling

In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values.

Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.

(21)

CRISP-DM

Evaluation

At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis

perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

(22)

CRISP-DM

Deployment

Creation of the model is generally not the end of the project.

Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and

presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g. segment allocation) or data mining process.

In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.

(23)

DM Software

Best free DM software (alphabetic order):

KNIME Analytics Platform Orange Data mining

R Software Environment, Rattle GUI RapidMiner Studio

Rough Set Exploration System Weka Data Mining

(24)

KNIME Analytics Platform

The Konstanz Information Miner (KNIME), is an open source data analytics, reporting and integration platform.

KNIME integrates various components for machine learning and data mining through its modular data pipelining concept and provides a graphical user interface allows assembly of nodes for data

preprocessing, for modelling and data analysis and visualization.

KNIME Analytics Platform provides over 1000 data analytic routines, either natively or through R and W eka.

KNIME is written in Java and based on Eclipse and makes use of its extension mechanism to add plugins providing additional functionality.

(25)

Orange Data mining

Orange is an open source data visualization and analysis tool. Orange is developed at University of Ljubljana, Slovenia, along with open source community.

Data mining is done through visual programming or P ython scripting.

Orange is a Python library.

Orange consists of a canvas interface onto which the user places widgets and creates a data analysis workflow. In Orange, data analysis process can be designed through visual programming.

Orange runs on many platforms (Windows, Mac OS X, Linux).

Orange can read files in native and other data formats.

Orange is devoted to machine learning methods for classification, or supervised data mining.

(26)

R Software Environment

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.

R is an integrated suite of software facilities for data manipulation, calculation and graphical display.

The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

R provides a wide variety of statistical and graphical techniques, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and others.

(27)

Rattle GUI

The R Analytical Tool To Learn Easily (Rattle) is a popular GUI for data mining using R. It is Free Open Source Software.

Rattle runs on many platforms (Windows, Mac OS X, Linux).

It presents statistical and visual summaries of data, transforms data that can be readily modeled, builds both unsupervised and supervised models from the data, presents the performance of models

graphically, and scores new datasets.

One of the most important features is that all of the user’s

interactions through the graphical user interface are captured as an R script that can be readily executed in R independently of the Rattle interface.

Through a simple and logical graphical user interface based on Gnome, Rattle can be used by itself to deliver data mining projects.

(28)

RapidMiner Studio

RapidMiner Studio is a visual design environment for machine learning, data mining, text mining, predictive analytics and business analytics.

It provides a deep library of machine learning algorithms, data preparation and exploration functions, and model validation tools to support all your data science projects and use cases.

Data science teams can easily re-use existing R and P ython code, and add new functionality via a large marketplace of pre-built extensions.

RapidMiner supports all steps of the data mining process including results visualization, validation and optimization.

RapidMiner is written in the Java programming language.

RapidMiner provides learning schemes and models and algorithms from W eka and R scripts that can be used through extensions.

(29)

Rough Set Exploration System

Rough Set Exploration System (RSES) is a tool set for analysing data with the use of methods coming from Rough Set Theory. It is a graphical, user-friendly front-end running under Windows and

providing access to methods from RSESlib library.

RSESlib is a core of RSES’ computational kernel.

Both library and GUI are designed and implemented at the Warsaw University.

RSESlib is a library of functions for performing various data exploration tasks such as: calculation of reducts, generation of decision rules, classification, discretization, decomposition, search for patterns in data, data manipulation.

The library is implemented in Java. First version of library was included in the computational kernel of ROSET T A system.

(30)

Weka Data Mining

W EKA is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code.

Weka features include machine learning, data mining, preprocessing, classification, regression, clustering, association rules, attribute selection, experiments, workflow and visualization.

Weka is written in Java, developed at the University of Waikato, New Zealand.

It runs on many platforms (Windows, Mac OS X, Linux).

Weka is open source software issued under the GNU General Public License.

(31)

Data Mining - tasks I

Anomaly detection (outlier/change/deviation detection) The identification of unusual data records, that might be interesting or data errors that require further investigation.

Association rule learning (dependency modelling)

Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits.

Clustering

Discovering groups and structures in the data that are in some way or another similar, without using known structures in the data.

(32)

Data Mining - tasks II

Classification

Building a model that describe how to classify (assign) the data items into one of a predefined classes. For example, an e-mail program might attempt to classify an e-mail as legitimate or as spam.

Regression

Predicting the value of a given (continuous) feature based on the values of other features in the data, assuming a linear or non-linear model of dependency.

Summarization

– providing a more compact representation of the data set, including visualization and report generation.