OPINION | The Data Abyss by Michael Best

IndraStra Global Friday, May 08, 2015 Edit this post

Originally posted at NatSecGeek.com on 16th April 2015 By Michael Best When it comes to intelligence work, and presumably any ...

Originally posted at NatSecGeek.com on 16th April 2015

By Michael Best

When it comes to intelligence work, and presumably any data intensive field outside of the hard sciences, there’s an unstated assumption that more is better. The more information we have, the more analyses can be performed and the more detailed those analyses can be. On the surface, this makes sense. There are constant complaints in every aspect of human life about not having enough information to make a well-informed decision, and about how just one more piece of information could provide the missing clue. The problem is that it’s never quite enough, yet it quickly becomes too much.

Many people even outside the Intelligence Community are aware of the datacenter maintained by the National Security Agency in Utah and foreign intelligence agencies around the world. To some it’s a frightening prospect which has some surface similarities to George Orwell’s vision of surveillance and control in 1984 (despite the critical differences), while to others its a way of keeping up with the world and making sure that there’s “enough” information. There are a few problems with this thinking.*

First, there’s the unstated assumption that somewhere within the mountain of data is the clue that’s needed and that this information can be recognized and retrieved. This relies on the AI algorithms that’ve been developed, which are beyond useful when used correctly and applied to very specific tasks and dividing the complex tasks into single-step tasks that can be tackled individually. As a result, straightforward tasks which require only mechanical thinking become quite possible while anything more remains quite illusive.

Second, there’s the assumption that more information is better. On the surface, it seems obvious and logical that this would be true. However, even granting the assumption that the accuracy rate of the information is consistent throughout, more information quickly becomes too much or useless information. Earlier this year, I was asked to put together a timeline covering the major technical and social developments relating to a particular piece of technology as part of a demonstration that it was not, as used, part of a secure platform. As constructed, the timeline covered 10-15 years and stopped several years short of the present day since the software’s architecture had solidified by that point and the changes that had been made to it since then were either secondary or only of interest to technicians and engineers, none of whom were part of the audience that the brief was designed for.

Unfortunately, Management wanted the timeline to include information coming up to the present day because it looked better and gave the appearance of being more complete, while in reality it only diluted the meaningful information in the rest of the brief. While I argued for bringing the pitcher to the mound, Management wanted to bring the mountain of data to Mohammed. The problem was that what looked like a mountain from a distance would more accurately be described as an abyss of data.

“It’s not as if nothing happened in the last ten years; update the timeline to include it.”

“By my count, more than ten thousand things happened in the last ten years, and none of them are meaningful to [the intelligence consumer]. Each of those things are so tiny that they’re meaningless and since they’re not directly related there’s nothing meaningful when we aggregate those instances. It’s only meaningful to engineers and specialists.

“We can’t have the brief appear to be incomplete.”

“We can’t dilute the intelligence with meaningless data that looks impressive from a distance.”

Eventually I convinced Management to allow me to add only a single line to the timeline, informing the audience that once the software became stable it remained largely unchanged on the macro scale. At first glance this struck me as nothing more than a simple problem, but on reflection it became clear that there is more to it. In some part of the IC there’s an attitude that every tool should be used and consulted and all available datasource should be tapped for use. While having an arsenal of data is no doubt useful in many situations, there are many more where it becomes a burden for both the teams producing intelligence and the ultimate consumer of the intelligence.

The problem comes down to confusing “enough information” with the “right information.” Chess players may understand this point better than anyone, especially when playing against a machine that has been loaded with every possible combination of pieces on the board. There’s an anecdotal chess story that boils down to this:

An average chess player thinks two or three moves ahead.
A good chess player thinks five moves ahead.
A great chess player thinks ten moves ahead.
A master chess player thinks one move ahead – but it’s the right move.

The moral of the story isn't about not planning or ego, but about not getting caught up in the data of future possibilities and future moves, about avoiding paralysis by analysis. It’s the same reason smart analysts insist on getting the bottom line up front – they want the information, not to drown in the data.

* The data centers are still quite useful, when used properly.