Thursday, October 18, 2012

11 January 2011 – The impossibility of data management


Empirical science has four steps: observation, hypothesis, experimentation and synthesis.  Two of those - observation, which requires noting a previously unexplained phenomenon, and experimentation, which is designed to produce evidence - are based, respectively, on collecting and generating data for further analysis.  Data is the bedrock of the scientific method, because it is only through data that established, often cherished, assumptions about how the world works may legitimately be challenged.  So you’d think there couldn’t be any such thing as ‘too much’ data, right?

“Wrong,” says David Weinberger.  In a new book with the wonderfully expressive title Too Big to Know: Rethinking Knowledge Now That the Facts Aren’t the Facts, Experts are Everywhere, and the Smartest Person in the Room is the Room, Weinberger argues that our ability to generate data is outstripping not only our ability to organize it in such a way that we can draw useful conclusions from it, but even our cognitive ability to grasp the complexities of the phenomena we are trying to understand.

This isn’t a new phenomenon.  As Weinberger notes, in 1963 - right around the same time that Edward Lorenz formulated chaos theory to explain the inherent impossibility of predicting the long-term behaviour of non-linear systems like weather - Bernard Forscher of the Mayo Clinic published a letter in Science entitled “Chaos in the Brickyard”, in which he complained that scientists were generating - of all things - too many facts.  According to Weinberger,

...the letter warned that the new generation of scientists was too busy churning out bricks — facts — without regard to how they go together. Brickmaking, Forscher feared, had become an end in itself. “And so it happened that the land became flooded with bricks. … It became difficult to find the proper bricks for a task because one had to hunt among so many. … It became difficult to complete a useful edifice because, as soon as the foundations were discernible, they were buried under an avalanche of random bricks.” [Note A]

The situation today, Weinberger argues, is astronomically - and I use the word “astronomically” quite deliberately, in the sense of “several orders of magnitude” - worse than it has ever been.  In an article introducing his book, he puts the problem thus:

There are three basic reasons scientific data has increased to the point that the brickyard metaphor now looks 19th century. First, the economics of deletion have changed. We used to throw out most of the photos we took with our pathetic old film cameras because, even though they were far more expensive to create than today’s digital images, photo albums were expensive, took up space, and required us to invest considerable time in deciding which photos would make the cut. Now, it’s often less expensive to store them all on our hard drive (or at some website) than it is to weed through them.

Second, the economics of sharing have changed. The Library of Congress has tens of millions of items in storage because physics makes it hard to display and preserve, much less to share, physical objects. The Internet makes it far easier to share what’s in our digital basements. When the datasets are so large that they become unwieldy even for the Internet, innovators are spurred to invent new forms of sharing.  The ability to access and share over the Net further enhances the new economics of deletion; data that otherwise would not have been worth storing have new potential value because people can find and share them.

Third, computers have become exponentially smarter. John Wilbanks, vice president for Science at Creative Commons (formerly called Science Commons), notes that “[i]t used to take a year to map a gene. Now you can do thirty thousand on your desktop computer in a day. [Note A]

The result, Weinberger argues, is actually worse than Forscher predicted.  We are not merely awash in “bricks/facts” and suffering from a severe shortage of “theory-edifices” to organize them with; the profusion of data has revealed to us systems of such massive intricacy and interdependence that we are incapable of visualizing, let alone formulating, comprehensive theories to reduce them to manageable - by which is meant, predictable - rules of behaviour.  Scientists facing “data galaxies” are being forced down one of two paths.  The first path - reductionism - attempts to derive overarching, general principles that seem to work well enough to account for the bulk of the data.  Historically, searching for “universals” amid the vast sea of “particulars” has been the preferred approach of empirical science, largely because there tend to be many fewer universals than there are particulars in the physical world, and also because if you know the universals - for example, Newton’s laws of motion and the fact that gravitational attraction between two objects varies inversely as the cube of the distance separating them - then you can often deduce the particulars, for example, where you should aim your Saturn V rocket in order to ensure that the command and service module achieves Lunar orbit instead of ending up headed for the heliopause.

The second path - modelling - involves the construction of tunable mathematical models that can be tweaked to produce outputs that mimic, as closely as possible, observational data without attempting to derive an overarching theory of how the system that produced the data actually functions.  This has been the preferred approach for dealing with complex phenomena that do not easily lend themselves to reduction to “universals”.  When it comes to complex systems, however, both paths exhibit crippling flaws.  Overarching theories that explain some data, but not all of it, can help us to approach a solution, but they cannot produce “settled science” because unexplained data are by definition the Achilles Heel of any scientific theory.  Meanwhile, modelling, even if it can come close to reproducing observed data, cannot replace observed data, and is never more than a simulation - often a grossly inaccurate one - of how the real world works, because if we cannot visualize all of the complexities and interdependencies of a given problem set, we certainly cannot design mathematical formulae to simulate them.

As if these two unacceptable paths were not bad enough, perhaps the most alarming deduction that Weinberger draws from his analysis is that these data-driven trends are forcing us towards a “new way of knowing” that is, from a human perspective, almost entirely “virtual.”  According to Weinberger, only computers can generate the vast quantities of data necessary to investigate a given phenomenon; only computers have the capacity to organize and store so much data; and only computers can perform the unthinkable number of calculations necessary to process the data and create modelled outputs.  Our role, such as it is, has been to create the mathematical models for the computers - but even this last redoubt of human involvement is collapsing as we become increasingly incapable of visualizing the scope and interrelationships of the problems we are trying to solve.  In other words, the future of research into complex interdependent phenomena is gradually departing the human domain, simply because we lack the cognitive wherewithal needed to cope with it.

Does this mean that scientific inquiry is doomed to leave the human sphere entirely?  Or that it’s destined to grind to a halt unless we can come up with an AI that mimics human cognition to the point of being able to visualize and craft investigative solutions to huge, complex problems?  I certainly don’t dispute that science has developed the ability to drown itself in data; but there are ways of dealing with preposterous quantities of information.  As I believe I’ve pointed out before, historians are accustomed to being drowned in data.  From the point of view of the science of historiography, this is not a new problem.  Take, for example, a relatively straightforward historical event - the Battle of Waterloo.  Working from basic numbers, Napoleon had about 72,000 troops, and the Allies under Wellington had about 118,000.  Reducing the affair to nothing more than a series of iterative diarchic interactions (which I am the first to admit is a wholly ludicrous proposition, but which frankly is no more ridiculously inappropriate than some of the assumptions made in climate modelling - for example, the assumption that clouds warm the Earth, when observed data suggest that they cool it) suggests that there were about 8.5 billion potential diarchic interactions in the first iteration alone.  And that’s only the individuals.  Nothing is too insignificant to be eliminated from your model, and each increase in the fidelity of the data you input ought to help refine the accuracy of your output (unless you’re trying to model a non-linear system, in which case the fidelity of your input doesn’t matter because inherent instabilities will quickly overwhelm the system (see “Nonlinearity andthe indispensability of data”, 15 June 2011)).  In the quest for greater fidelity/realism, you would need to model the characteristics and behaviour not just of each individual but also of their weapons, their clothing, their boots or shoes, their health and physical condition, their horses, their guns, their limbers, their ammunition, the terrain, the obstacles, the weather, the psychological vagaries of individual soldiers and many details are we talking about here?  Are we getting to the level of a “non-visualizable” problem yet?

Is it even possible to model something this complex?  Well, it’s sort of already been done.  I’m sure many of you have seen The Lord of the Rings: The Two Towers.  The final battle at Helm’s Deep comprised hundreds of Rohirrim and Elves, and ten thousand Uruk-Hai - and yet there were no more than a hundred live actors in any of the shots.  The final digital effects were created by Weta Digital using a programme called “Massive”.  According to Stephen Regelous, the chap who programmed the battle sequences, he didn’t really “program” them at all, or at least, not in the sense of crafting specific movements for every digital effect in the scenes.  That would have been an impossibly daunting task, given the number of individual digital actors involved.

Instead, he made the digital actors into people.  “The most important thing about making realistic crowds”, Regelous explains, “is making realistic individuals.”  To do that, the programme creates “agents”, each of which is in essence an individual with individual characteristics, traits, and most important of all, volition:

In Massive, agents’ brains - which look like intricate flow charts - define how they see and hear, how fast they run and how slowly they die. For the films, stunt actors’ movements were recorded in the studio to enable the agents to wield weapons realistically, duck to avoid a sword, charge an enemy and fall off tower walls, flailing.

Like real people, agents’ body types, clothing and the weather influence their capabilities. Agents aren’t robots, though. Each makes subtle responses to its surroundings with fuzzy logic rather than yes-no, on-off decisions. And every agent has thousands of brain nodes, such as their combat setting, which has rules for their level of aggression. When an animator places agents into a simulation, they are released to do what they will. It’s not crowd control, but anarchy. Each agent makes decisions from its point of view. [Note B]

In other words, outcomes aren’t predetermined; the Agents are designed with a range of options built into their makeup, and a degree of choice about what to do in response to given stimuli.  Kind of like people.  While it’s possible to predict likely responses to certain stimuli, it’s not possible to be certain what a given Agent will do in response to a given event.  “It’s possible to rig fights, but it hasn’t been done,” Regelous says. “In the first test fight we had 1,000 silver guys and 1,000 golden guys. We set off the simulation, and in the distance you could see several guys running for the hills.”

If you happen to own the full-up Uber-GeekPathetic Basement-Dwelling Fan-Boy LOTR collection (I admit nothing!), you can watch the “Making Of” DVDs for Two Towers, and listen to a much more in-depth explanation of how chaotic, complex and unpredictable the behaviour of the Massive-generated battle sequences were.  The programmers ran the Helm’s Deep sequence many, many times, using the same starting conditions, and always getting different results.  If that sounds familiar, it’s because it’s exactly what Lorenz described as the behaviour of a non-linear system.  It’s chaos in a nutshell. 

According to the DVD explanation, the results of individual volition were potentially so unpredictable that the Agents’ artificial intelligence needed a little bit of tweaking to ensure that the battle scenes unfolded more or less in line with the script.  Early on, for example, the designers had given each Agent a very small probability of panicking and fleeing the battle, along with a slight increase to that probability if a neighbouring agent went down hard, and a slightly greater probability if a neighbouring agent panicked and fled.  Reasonable, right?  We all know that panic on the battlefield is contagious.  The problem is that, depending on how the traits of the Agents were programmed, it could be too contagious.  In one of the simulation runs for the battle, one of the Uruk-Hai Agents near the front lines apparently panicked and fled at exactly the same time as a couple of nearby Agents were shot and killed.  This sparked a massive, rapidly propagating wave of panic that resulted in Saruman’s elite, genetically-engineered army dropping their weapons and heading for the hills.  I’m sure King Theoden, Aragorn, and hundreds of unnamed horse-vikings and elf-archers would’ve preferred it if Jackson had used that particular model outcome for the film, but it probably wouldn’t have leant itself to dramatic tension. And it would've been something of a disappointement when Gandalf, Eomer and the Rohirrim showed up as dawn broke across the Riddermark on the fifth day, only to discover that there was no one left for them to fight because Uruk Spearman #9551 had gone wobbly a couple of hours before.

What’s the point of all this?  Well, the link between Forscher’s complaint about the profusion of data-bricks and Weinberger’s conclusion that computers are taking over the process of generating, storing, and figuring out what to make of data is obvious.  The problem of deriving “universals” from the colossal mass of “particulars” that make up modern scientific (and historical) inquiry ought to be obvious, too, and it’s demonstrated by one of the key weaknesses of history: experimental repeatability.  Steve Regelous could re-run the Battle of Helm’s Deep as many times as he liked until he got the result that Peter Jackson wanted - but that’s fiction, and when you’re writing fiction you can do whatever you like because you’re not tied to any objective standard other than some degree of plausibility (which in turn depends on the expectations, experience and gullibility of your audience).  But we can’t re-run the Battle of Waterloo, changing one variable here or there to see what might have happened differently, because there’s no way to account for all possible variables, there’s no way to find out whether some seemingly insignificant variable might have been overwhelmingly important (butterfly wings, or “for want of a nail”, and all that), and finally there’s no way to scientifically validate model outputs other than to compare them to observed data, because historical events are one-time things.

What we can do, though, is try to develop rules that help us winnow the “universals” from the chaff that makes up most of history.  There were 190,000 “Agents” at Waterloo - but do we really need to know everything about them, from their shoe size to how many were suffering from dysentery on the big day?  Do we need all of the details that had to be programmed into Weta’s “Massive” Agents?  Or are there some general principles that we can use to guide our understanding of history so that we can pull out what’s important from the gigantic heap of what isn’t?

This is where a good reading - or hopefully re-reading - of E.H. Carr’s What is History? comes in handy.  In Chapter One, “The Historian and His Facts”, Carr disputes the ages-old adage that “the facts speak for themselves”, arguing instead that the facts speak only when the historian calls on them.  Figuring out which facts to call on is the historian’s métier.  Separating the gold - the “significant” as opposed to the “accidental” facts of a given phenomenon - from the dross of historical data is the historian’s challenge, and Carr’s criteria for judgement of what is “gold” is generalizability.  The key word here, of course, is “judgement” - informed judgement as a function of human cognition.  This is something that can’t be done by computers. 
Not yet, anyway. [Notes C, D] 

So for the time being, at least, there’s still a need for a meat plug somewhere in the analysis chain, if only to provide the informed judgement that our eventual silicon replacements haven’t quite learned to mimic.  That’s something of a relief.


//Don// (Meat plug)


C) Historians looking for a lesson in humility need only read Heinlein’s The Moon is a Harsh Mistress, in which a computer that accidentally becomes artificially intelligent finds itself having to run a Lunar uprising - and in order to design the best possible plan for doing so, reads every published history book in a matter of minutes, analyzes them, and then not only plots a revolution from soup to nuts, but actually calculates the odds of victory, and recalculates them periodically in response to changing events.  Now that’s what I call “future security analysis.”  If only.

D) Some computer programmes are getting pretty good at matching human cognition.  Fortunately, we still hold an edge in important areas.  As long as we can still beat our electronic overlords at “Rock-Paper-Scissors”, we might, as the following chart from the genius webcartoon suggests, be able to hold the Singularity off for a little longer.