|
These are the entire factors which cause the growth of data. The ongoing Information Technology revolution adds in the conversion of data which used to disappear once an action was completed, into electronic records for distribution and storage. It is not surprising that more people are sinking in the flood of data.
Some researchers have started to think about how they might better find meaning in these new mountains of data, and if possible to set up plans for future actions based on the growth of present data. This is data mining. Data mining digs out valuable information from large and messy data. But reality is never that easy (Dunham HM, 2003).
Data mining is a knowledge discovery process. It’s the integration of business knowledge, people, information, statistics and computing technology. We can find many Data mining definitions on net, in book, and in other sources as well.
2.0 Data Mining
Data mining is also sometimes called data or knowledge discovery, is the process of analyzing data from different perspectives and summarizing it into useful information and that information can be used to increase revenue and reduce costs, or for both at the same time. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different ways, classify it, and summarize the relationships identified. It can be defined as; it is the process of finding correlations or patterns among dozens of fields in large relational databases. Data mining and Knowledge Discovery selects a collection of methods from a branch of Artificial Intelligence that began its explosive growth very recently. These methods allow one to acquire from data, previously hidden knowledge about relations and patterns in behavior of the investigated object (Kohavi R, 2001).
Data Mining uses technologies such as neural networks, decision trees or standard statistical techniques to search large volumes of data. In doing so, Data Mining builds models for patterns that accurately predict customer behavior. Data mining is a process designed to explore data in an analytical way in order to search of consistent patterns or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data.
The ultimate goal of Data mining is prediction. Predictive Data mining is the most common type of Data mining and it also has the most direct business applications. Is Data mining is a part of knowledge discovery or knowledge discovery is same as data mining? More often Data mining is treated as a synonym of knowledge discovery in databases, but some researchers view Data mining as an essential step of knowledge discovery. This is a debate able topic itself in this assignment I will consider them same. In general, a Data mining or data discovery process consists of an iterative sequence of the following steps:
- Data cleaning
- Data selection
- Knowledge presentation
- Data transformation
- Data integration
2.1 Data Mining
It is the combination of Multiple Disciplines. Figure below show the different disciplines that take part in data mining.
Figure 1 – Show the Multiple Disciplines for data mining
2.2 Basic Data Mining Techniques
The most commonly used techniques in data mining are:
- Artificial Neural Networks
Non-linear predictive models that learn through training and resemble biological neural networks in structure.
- Decision Trees
Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).
- Genetic Algorithms
Optimization techniques that use process such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.
- Nearest Neighbor Method
A technique that classifies each record in a dataset based on a combination of the classes of the k records most similar to it in a historical dataset.
- Rule induction
The extraction of useful if-then rules from data based on statistical significance.
2.3 Data Mining Tasks
Data mining encompasses many different techniques and algorithms. Some of them are describe below.
Predictive
It makes the prediction about the values of data using known results found from different data.
- Classification
Classification is perhaps most commonly applied Data mining technique. Classification maps the data in predefined classes. These classes based on data attribute values. Classification has a special type known as segmentation. In segmentation database divide data into small segment. This process is known as segmentation.
- Regression
Regression assumes that target data fit into some known type of function. And then find the best function of this type that models the given data. In the simplest case, regression uses standard statistical techniques such as linear regression. Unfortunately, many real-world problems are not simply linear projections of previous values.
- Correlation
Correlation is statistically oriented in nature. It is a fast growing technique of data mining.
- Time Series Analysis
In this technique, the value of an attribute is examined as it varies over the time. This value is most of the time is taken of evenly spaced time points.
- Descriptive
It identifies patterns or relationships in data.
- Association
Association identifies relationships between events that occur at one time, determines which things go together.
- Clustering
Clustering identifies groups of items that share a particular characteristic segmenting a diverse group into a number of more similar subgroups or clusters. Clustering differs from classification in that it does not rely on predefined classes or characteristics for each group.
- Sequence Discovery
Sequence discovery is used to determine sequential patterns in data.
- Summarization
It maps data into subsets with associated simple description. It represents information about the data base.
The tasks mention above may be combined to obtain more sophisticated Data mining application.
2.3 Data Mining Process
2.5 Factors that Effect Data Mining
- Scientific computing trends
- Business trends
- Network trends
- Data trends
- Hardware trends
- Wireless communication
- Privacy and Security
2.6 Issues in Data Mining
- Privacy
- Noisy data
- Missing values
- Static data
- Sparse data
- Dynamic data
- Relevance
- Interestingness
- Heterogeneity
- Algorithm efficiency
- Size and complexity of data
3.0 The Evolution of Data Mining
“The current evolution of Data mining functions and products is the result of years of influence from many disciplines, including databases, information retrieval, statistics, algorithms, and machine learning. Another computer science area that has had a major impact on the Data mining process is multimedia and graphics” [Dunham H M, 2003]. In the development from business data to business information, each new step has constructed upon the previous step and so on. For example, dynamic data access is critical for drill through in data navigation applications, and the ability to store large databases is critical to Data Mining. From the user's point of view, the steps listed in Table 1 were revolutionary because they allowed new business questions to be answered accurately, efficiently and quickly.
Evolutionary Step |
Business Question |
Enabling Technologies |
Product Providers |
Characteristics |
Data Collection
(1960s) |
What was my total revenue in the last five years? |
Computers, tapes, disks |
IBM, CDC |
Retrospective, static data delivery |
Data Access
(1980s) |
What were unit sales in New England last March? |
Relational databases (RDBMS), Structured Query Language (SQL), ODBC |
Oracle, Sybase, Informix, IBM, Microsoft |
Retrospective, dynamic data delivery at record level |
Data Warehousing &
Decision Support
(1990s) |
What were unit sales in New England last March? Drill down to Boston. |
On-line analytic processing (OLAP), multidimensional databases, data warehouses |
Pilot, Com share, Arbor, Cognos, Micro strategy |
Retrospective, dynamic data delivery at multiple levels |
Data Mining
(At Present Time) |
What’s likely to happen to Boston unit sales next month? Why? |
Advanced algorithms, multiprocessor computers, massive databases |
Pilot, Lockheed, IBM, SGI, numerous startups |
Prospective, proactive information delivery |
Table1. Steps in the Evolution of Data Mining
4.0 Data Mining Algorithm
4.1 Data Mining Algorithm Components
- Model
Function of the model (e.g., classification, clustering, rule generation) and its representational form.
- Preference Criterion
The selection criterion is all about Basis for preference of one model or set of parameters over another.
- Search Algorithm
Specification of an algorithm for finding particular patterns of interest, given the data, family of models, and preference criterion.
5.0 Three Advance Topics
I do not want to go in the detail of these topics I just give the main idea of these terms.
- Web Mining
Web mining is the mining of data related to the World Wide Web.
- Spatial Mining
Spatial data is about instances located in a physical space.
- Temporal Mining
Temporal data mining concerns the analysis of events ordered by one or more dimensions of time.
6.0 Data Mining and Data Warehousing
A data warehouses are not a necessary for data mining. Data to be mined is first pull out from an enterprise data warehouse into a Data mining database. There is some real benefit if your data is already part of a data warehouse. The problems of cleansing data for a data warehouse and for Data mining are very similar to each other. If the data has already been cleansed for a data warehouse, then we don’t have any need cleaning in order to be mined. The Data mining database may be a logical rather than a physical subset of your data warehouse.
7.0 Data Mining and OLAP
OLAP is part of the spectrum of decision support tools. Traditional query and report tools describe what is in a database. One of the most popular questions from data processing professionals is about the difference between Data mining and On-Line Analytical Processing (OLAP).
8.0 Comparison between Data Mining and DBMS
DBMS (queries based on the data held) e.g. |
Data Mining (infer knowledge from the data held to answer queries) e.g. |
last months sales for each product |
What characteristics do customers share who lapsed their policies and how do they differ from those who renewed their policies? |
Sales grouped by customer age etc. |
Why is the Cleveland division so Profitable? |
list of customers who lapsed their policy |
|
Table2. Comparison of Data Mining With DBMS
8.0 Knowledge Based System (KBS)
A KBS is a computer system which represents knowledge about a specific problem domain and can be used to apply this knowledge to solve the problems from the problem domain. These systems can be divided into following categories.
- Expert System
- Case Based Reasoning System
- Neural Networks
- Data Mining Systems
- Intelligent Agent
8.1 Intelligent Agents
An Intelligent agent is a software or hardware system that can operate independently. Intelligent agent can be conceptualized or implemented i.e. more driven toward the real world. The algorithm is not specific. They have application dependent algorithms i.e. Robot arm motion uses sensory algorithm for movement. It searches the protocol for the output of the signal.
8.2 Neural Networks
Neural networks are modeled on brain. Each processing element has one output and numerous inputs. They are linked with a various other inputs which are providing output for others. They possess weights which represent the strength of the connection. The output of the preprocessing function is passed through the activation function to produce final output.
8.3 Case Based Reasoning Systems
Case based reasoning systems are alike case based expert systems where past cases are taken into consideration. Each case contains a description of problem and its solution. The knowledge of expert is implicit in the system.
9.0 Limitation of Data Mining Tools
Data mining tool are not self sufficient application though they are very powerful data mining product. Data mining requires skilled technical and analytical specialists who can structure the analysis, understand and interrupt the output that is created in order to make data mining successful. The limitations of data mining are primarily data or personnel related, rather than technology-related. Although data mining can help reveal patterns and relationships, it does not tell the user the value of these patterns [Jeffrey W. Seifert, 2004]. Another limitation of data mining is that while it can identify connections between behaviors and variables, it does not necessarily identify a causal relationship.
10.0 Data Mining and Knowledge Based System
The cornerstone of an effective Knowledge-Based System is data mining. Data mining uses statistical analysis to develop better business decisions than could be made using conventional methods. Data mining improves your decision making by giving you insight into what is happening in your business today and by helping you predict what will happen tomorrow. Many data mining tools on the market today can help you build powerful Knowledge-Based Systems.
The common results of data mining are the construction of Knowledge-Based Systems in the following areas:
- Customer Profiling
Understanding of your customers by analyzing their behaviors and preferences, resulting in the development of profitable, customized solutions.
- Target Marketing
Developing better intelligence on who is responding to your marketing campaigns so you can focus your offers more precisely, resulting in lower cost and higher response.
- Risk Management
Identifying the customers most likely to be business risks by building models that anticipate risk events such as early detection of delinquency or fraud.
- Valuation and Loyalty Analysis
Recognizing valuable and loyal customers for purposes of recognizing and rewarding them to insure retention.
11.0 Influence of Knowledge Based System on Data Mining
The Data mining applications
- Association Discovery
- Bayesian Statistics
- Bayesian Networks
- Classification, Classification trees
- Classification and Regression trees
- Conceptual Clustering
- Decision Trees
- Fuzzy Logic
- Genetic Algorithms
- Identification trees
- Induction trees
- K Nearest Neighbor and Nearest Neighbor
- Neural Networks
- Prediction
- Predictive Modeling
- Regression
- Rule Induction
- Rule Sets
- Semantic Query Optimization
- Sequential Pattern Discovery
- Similiar Time Sequence Discovery
- Statistics
- Visualization
(Kurser, 2000) are developed with the features of the knowledge based system. Actually, Data mining theory is developed in correspondence to knowledge based system in mind at first place.
If it wouldn’t be knowledge base it hadn’t been possible to gather the knowledge from the data ware houses. Data mining can be assumed as the evolutionary concept of the Knowledge based system for tackling large database system.
Conclusion
Widespread data warehouses that are used to integrate operational data with customer, supplier, and market information have resulted in an explosion of information. Competition requires timely and difficult analysis on an integrated view of the data. However, there is a growing gap between powerful storage and recovery systems and the users’ ability to effectively analyze and act on the information they contain. Thus data mining technique is developed based on the knowledge based concepts. Its sole purpose was to extract the relevant knowledge from data warehouses. Though, not all but some of the Data mining techniques failed because of the immense size of the data. This huge size data will be stored in the new techniques that will have to be developed.Any algorithm that is proposed for mining data will have to account for out of core data structures. Most of the existing algorithms haven't addressed this issue. Some of the newly proposed algorithms like parallel algorithm are now beginning to look into this. Data mining has a lot of potential and its demand is increasing day by day. Data mining has got the diversity in the field of application. So we can say the future of this technology is quite bright and this technology will flourish with the passage of time more rapidly.
|