The development of computational algorithms for the identification or extraction of structure from data. This is done in order to help reduce, model, understand, or analyze the data. Tasks supported by data mining include prediction, segmentation, dependency modeling, summarization, and change and deviation detection. Database systems have brought digital data capture and storage to the mainstream of data processing, leading to the creation of large data warehouses. These are databases whose primary purpose is to gain access to data for analysis and decision support. Traditional manual data analysis and exploration requires highly trained data analysts and is ineffective for high dimensionality (large numbers of variables) and massive data sets. See Database management system
A data set can be viewed abstractly as a set of records, each consisting of values for a set of dimensions (variables). While data records may exist physically in a database system in a schema that spans many tables, the logical view is of concern here. Databases with many dimensions pose fundamental problems that transcend query execution and optimization. A fundamental problem is query formulation: How is it possible to provide data access when a user cannot specify the target set exactly, as is required by a conventional database query language such as SQL (Structured Query Language)? Decision support queries are difficult to state. For example, which records are likely to represent fraud in credit card, banking, or telecommunications transactions? Which records are most similar to records in table A but dissimilar to those in table B? How many clusters (segments) are in a database and how are they characterized? Data mining techniques allow for computer-driven exploration of the data, hence admitting a more abstract model of interaction than SQL permits.
Data mining techniques are fundamentally data reduction and visualization techniques. As the number of dimensions grows, the number of possible combinations of choices for dimensionality reduction explodes. For an analyst exploring models, it is infeasible to go through the various ways of projecting the dimensions or selecting the right subsamples (reduction along columns and rows). Data mining is based on machine-based exploration of many of the possibilities before a selected reduced set is presented to the analyst for feedback.
Doing It Automatically |
---|
This BusinessMiner analysis determined that the most influential factor common to non-profitable customers was their credit limit. (Image courtesy of SAP.) |