What is Data Mining? Depends Who You Ask ...

Everyone has their own definition of data mining. My favorite is this one I heard at the ACM SIGKDD data mining and knowledge discovery conference a few weeks ago, specifically:

Data Mining, noun 1. Torturing the data until it confesses … and if you torture it enough, you can get it to confess to anything.

Here are some far less humorous definitions:

The Government Accountability Office produced the following definition for data mining:

"The application of database technology and techniques—such as statistical analysis and modeling—to uncover hidden patterns and subtle relationships in data and to infer rules that allow for the prediction of future results."

The Congressional Research Service has defined data mining as:

"Data mining involves the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. These tools can include statistical models, mathematical algorithms, and machine learning methods (algorithms that improve their performance automatically through experience, such as neural networks or decision trees). Consequently, data mining consists of more than collecting and managing data, it also includes analysis and prediction."

The Internet’s popular Wikipedia site defines data mining as:

"Data mining (DM), also called Knowledge-Discovery in Databases (KDD) or Knowledge-Discovery and Data Mining, is the process of automatically searching large volumes of data for patterns such as association rules. It is a fairly recent topic in computer science but applies many older computational techniques from statistics, information retrieval, machine learning and pattern recognition."

Mary DeRosa at the Center for Strategic and International Studies (CSIS) published a report on data mining citing a presentation by David Jensen at the CSIS Data Mining Roundtable on July 23, 2003. In "Data Mining in Networks," David Jensen defined data mining as follow:

"'Data mining' ... has a relatively narrow meaning: it is a process that uses algorithms to discover predictive patterns in data sets."

Kim Taipale at the Center for Advanced Studies in Science and Technology Policy has defined data mining this way:

"The combination of mathematics, statistics, economics, political science, cultural anthropology, sociology, psychology, psychiatry, neuroscience, and other social sciences with computer science techniques such as federated search and retrieval, visualization, knowledge extraction, modeling, and simulation — together referred to expansively for policy purposes as "data mining" — enable the development and application of nonlinear, nondeterministic theories and models of complex human phenomena at all scales to social governance and control problems, including law enforcement and national security."

A soon to be published paper by Jim Harper and me have opted for the following definition:

"Data mining is the process of searching data for previously unknown patterns and often using these patterns to predict future outcomes."

The Department of Defenses TAPAC Report (Technology and Privacy Advisory Committee) defined data mining as:

We define 'data mining' to mean "searches of one or more electronic databases of information concerning U.S. person by or on behalf of an agency or employee of the government."

The TAPAC definition is certainly the broadest. Under this definition, when a doctor at the Veterans Administration searches for a specific patient record (e.g., by name and date of birth) – this would constitute a data mining activity.

And new definitions of data mining will surely continue to appear – some more rational than others. For example, here is a definition pending on Capitol Hill, in an amendment submitted by Senator Feingold to H.R. 5441:

DATA-MINING.-The term "data-mining" means a query or search or other analysis of 1 or more electronic databases, whereas-

(A) at least 1 of the databases was obtained from or remains under the control of a non-Federal entity, or the information was acquired initially by another department or agency of the Federal Government for purposes other than intelligence or law enforcement;

(B) a department or agency of the Federal Government or a non-Federal entity acting on behalf of the Federal Government is conducting the query or search or other analysis to find a predictive pattern indicating terrorist or criminal activity; and

(C) the search does not use a specific individual's personal identifiers to acquire information concerning that individual.

Maybe I’m confused. But per this definition it would appear that analysis to find predictive patterns not related to terrorist or criminal activity would not be considered data mining. Also, if the data is owned entirely by the intelligence or law enforcement community no form of analysis could be construed as data mining.

Postings on this site don’t necessarily represent IBM’s positions, strategies or opinions.

Jeff Jonas is the chief scientist of IBM Software Group’s Threat and Fraud Intelligence unit and works on technologies designed to maximize enterprise awareness. Jeff also spends a large chunk of his time working on privacy and civil liberty protections. He will be writing a series of guest posts for Netcrime Blog.

For more on Entity Analytics, click here.