When a key piece of data changes in the enterprise, one must first treat this new data like a query (i.e., what does this new data mean in relation to what the enterprise already knows). And if new data is not treated first like a query, one will never know if this new information matters unless someone asks. I often refer to this notion as Perpetual Analytics – a world where the "data finds the data and the relevance finds the user."
So exactly how would such a system be constructed? Many folks have suggested that this can be solved using "federated queries." Federated queries are solutions that interact with all of the islands of operational, reference and historical data scattered across the enterprise (often leveraging very smart middleware). This approach uses a query to interrogate enterprise data stores in order to gather related records. Think of federated queries as an example of "just-in-time-context."
If you want to evaluate new information against what the enterprise already knows, it so happens that federated queries don’t cut it for most missions. And the greater the number of data silos and queries the more impossible federated query systems become.
To explain why federated search breaks down with any scale I’ll need to get a bit technical here … so if you are not technical, the balance of this post is not for you.
TWO PRIMARY REASONS FEDERATED QUERY SYSTEMS DON'T SCALE
1. Operational systems and their underlying silos were originally designed to handle a specific operational mission. And the larger these systems, the more constrained their computational cycles. In other words, they do not have the free processing (or disk I/O) cycles to answer hundreds, thousands or millions of additional inquiries a day. Additionally, because operational systems were designed only to handle queries necessary to deliver specific business functionality, they cannot efficiently answer queries that they were not designed to support. This is because they do not have the indexes needed for fast lookup on every relevant field, which in turn necessitates the use of database table scans for record location. (If you are not technical and are still reading this, table scan=very slow.) Let’s take a payroll system for example. Payroll systems are designed to locate employee records based on mission specific fields, e.g., employee ID, name, tax ID, etc. A payroll system will not generally have an index to enable the efficient search on such fields as phone number or address. And if it did have an index to support a search on employee phone number, it would not likely have an index on the phone number of the employee’s emergency contact! This incomplete index problem holds true for most operational systems – from reservation systems, to sales and order entry systems, to accounts payable systems, and so on. In short, most operational systems cannot answer the necessary queries, or in any case, not quickly.
2. Even if all of the operational systems could answer all the queries quickly, there is a secondary scalability problem that necessities recursive processing. This is easiest to explain by example. If one performs a federated query to discover enterprise records related to a specific person – say starting with a specific person’s name and date of birth – should the federated query return some new attributes for this person, e.g., a few addresses and phone numbers – one has just learned something. To be thorough one must take what one has learned about this person and perform another enterprise-wide federated query in case there are some additional records that can now be located based on the new data points. Now, what if during this second federated query another address, a few more ways to spell the name, and an aka or two are discovered. To be thorough, each time something is learned that might enable the discovery of a previously missed record, the process must perform another federated query. I have seen this at scale where the organization had something like 2,000 internal data sets, all tethered together with very smart middleware. Their recursive process had an artificial time limit at which point it would abandon additional attempts to locate the remaining records even though there was possibly more records in the enterprise for the same person!
These two points make federated query systems challenging at scale.
Now imagine perpetual analytics where every new piece of key enterprise data is first treated like a query. How exactly is one going to use this federated approach at the scale of hundreds or thousands of queries a second? Thus, I say after observing the behavior of such systems up close and personal, scalable intelligent systems cannot be achieved via federated query. Those attempting to enable enterprise discovery or enterprise intelligence through the use of a federated query solution will very likely come up short despite Herculean investment.
So if federated search does not answer the mail, what does? You guessed it: Persistent Context. Persistent context solves the scalability and accuracy challenges associated with trying to assemble context just-in-time using federated queries.
Perpetual Analytics requires persistent context. And persistent context is all about the librarian and the central index (catalog, directory or whatever you want to call this thing).
Persistent context enables instant, enterprise-wide discovery. And discovery enables the essential federating activity – "federated fetch." Simply speaking, once one finds related records, Source Attribution is used to determine where the records are physically located. One then fetches specific records from specific data stores in a federated manner. This form of federation scales.
Whether an enterprise is interested in improving its use of disparate information assets to improve health care outcomes, better service customers, fight fraud or protect the country, this (solutions involving persistent context) is how I think it will have to be done at the end of the day.
[One final technical point: Even if the operational systems expose their metadata in a fully cross-referenced index (e.g., a specialized search/discovery "appliance") to solve the missing index problem, the recursive costs to construct just-in-time context (each time new information is discovered) still make federated queries an unattractive approach. To boot, there are a few other incremental risks associated with using externalized indexes conjoined to each operational system. If you care to discuss this point, drop me an email.]
Postings on this site don't necessarily represent IBM's positions, strategies or opinions. Jeff Jonas is the chief scientist of IBM Software Group's Threat and Fraud Intelligence unit and works on technologies designed to maximize enterprise awareness; Jeff also spends a large chunk of his time working on privacy and civil liberty protections. He will be writing a series of guest posts for Security Blog.
For more on Entity Analytics; click here.