110th Congress Debates Data Mining

Wednesday, January 10th, 2007, the Senate Judiciary Committee held a hearing entitled "Balancing Privacy and Security: The Privacy Implications of Government Data Mining Programs."

This session again proved that what data mining means depends on whom you ask. And, as such, this poses a real problem for those trying to have a rational conversation on the subject. And I worry that if lawmakers get this wrong … poor laws will follow.

Jim Harper of Cato Institute submitted written testimony, which referenced the paper he and I recently released titled "Effective Counterterrorism and the Limited Role of Predictive Data Mining." Our paper was intended not to describe data mining at large; rather, we selected the term "Predictive Data Mining" to describe a certain kind of data mining, specifically "…the process of searching data for previously unknown patterns and often using these patterns to predict future outcomes." As our paper posits – using machines to find hidden patterns based on historical data is not useful in the context of terrorism when there are so few terrorist incidents from which to draw. We could have just as easily called this "Data Mining for Predictive Patterns."

Kim Taipale, the executive director of the Center for Advanced Studies in Science and Technology Policy, submitted this written testimony. Kim argues that, broadly speaking, data mining is any automated analysis of information that reveals output that otherwise would "remain unnoticed using traditional manual means of investigation." Therefore, data mining is "simply a productivity tool that when properly employed can increase human analytic capacity and make better use of limited security resources." This definition includes link-based analysis (e.g., who’s talking to whom, who’s financing whom, etc.), pattern-based analysis (e.g., anticipated signatures of terrorist planning) and predicate-based analysis (e.g., higher interest in those who graduated from Afghanistan terror training camps). He also goes on to say "… patterns can be inferred from lower-level precursor activity – for example, illegal immigration, identity theft … attendance in training camps, targeting and surveillance activity…."

Leslie Harris, the executive director of the Center for Democracy and Technology, submitted this written testimony. Leslie chose this definition for data mining: "use of computer tools to extract useful knowledge from large sets of data." Leslie differentiates data mining into two categories: pattern-based data mining and subject-based data mining. Pattern-based is then described as data mining "which seeks to find a pattern, anomaly or signature among oceans of personal transactional data." Subject-based data mining is described as a form "which seeks information about a particular individual who is already under suspicion." Her testimony goes on to say, "As a general matter, the value of subject-based approaches is more readily apparent, and there are fewer privacy concerns associated with data searches that begin with particularized suspicion."

Robert Barr, the executive director of Liberty Strategies, submitted this written testimony in which he expresses concern over various government programs and notes that "Data mining presents many serious threats to the First, Second, Fourth and Fifth Amendments to the Constitution." Although data mining is not defined by his testimony, it appears (based on the programs he mentions) that he uses the term "data mining" to mean any effort by the government to access and/or collect data.

Dr. James Carafano a senior research fellow at the Heritage Foundation specializing in national security, defense and counter terrorism, submitted this written testimony. In part he writes, "Because technology is going to be an important part of any set of counterterrorism tools, and because our lives in the information age are so dependent on many of the systems and databases in which these technologies will look for information about terrorists, we also need a set of rules to guide how we implement the basic principles of long-war fighting in the electronic world." And while this testimony does not attempt to define data mining, neither does it imply a broad nor narrow definition.

Despite the fact there is no agreement on what data mining means, I cannot help but notice a high degree of consensus (e.g., watch listing, link analysis and predicate-based analytics can be useful and are less invasive). In any case, when the government starts writing data mining laws … these things come to my mind:

1. We should be talking about authorization, oversight and accountability related to programs involving U.S. persons only. There is much less concern with respect to analytics and information collected abroad (unrelated to U.S. persons). New data mining policy related to program disclosure that does not differentiate between U.S. and non-U.S. persons would be a huge mistake.

2. Data mining has many valuable uses at both the aggregate and person-centric level in areas outside of counter-terrorism. For example, healthcare research, bio-surveillance, benchmarking efficacy of various educational programs, and so on. Any government policy stating data mining should only be used in support of counter-terrorism would also be a huge mistake.

3. And finally, any policy that emerges that regulates data mining or mandates reporting better define it. Because under one definition of data mining even something as simple as using a computer to lookup your name on a reservation list (e.g., at the hotel during check-in) is considered data mining. So if this type of activity gets added to the data mining reporting requirements, those in charge of monitoring data mining programs will have to sift through so many reports (i.e., false positives) they may never find or have time to appropriately respond to the programs that are more problematic.

By the way, the debate about what data sets can the government peer into is another debate – an important debate – but (in my opinion) not a data mining debate.

And for the record, in my opinion, at least in relation to programs designed to target specific people, predicting which people should be targeted for additional scrutiny or action should not be based on machine-discovered patterns when so little historical training data exists. However, this method does become useful when first starting with qualified predicates (e.g., subjects who attended terrorist training camps). This can materially assist organizations/governments focus their finite investigatory resources.

Furthermore, in my opinion, I don’t know of a single federal program that is attempting to detect hidden terrorist patterns using historical terrorist event data. Either they are hidden from sight or more likely these organizations already recognize that there are better ways to attack the counter-terrorism mission.

Postings on this site don't necessarily represent IBM's positions, strategies or opinions. Jeff Jonas is the chief scientist of IBM Software Group's Threat and Fraud Intelligence unit and works on technologies designed to maximize enterprise awareness; Jeff also spends a large chunk of his time working on privacy and civil liberty protections. He will be writing a series of guest posts for Security Blog.

For more on Entity Analytics; click here.