Empirical science has four steps: observation,
hypothesis, experimentation and synthesis. Two of those - observation,
which requires noting a previously unexplained phenomenon, and
experimentation, which is designed to produce evidence - are based,
respectively, on collecting and generating data for further analysis.
Data is the bedrock of the scientific method, because it is only through data
that established, often cherished, assumptions about how the world works
may legitimately be challenged. So you’d think there couldn’t be any such
thing as ‘too much’ data, right?
“Wrong,” says David Weinberger. In a new
book with the wonderfully expressive title Too Big to Know: Rethinking Knowledge Now That the Facts
Aren’t the Facts, Experts are Everywhere, and the Smartest Person in the Room
is the Room, Weinberger argues that our ability to generate data
is outstripping not only our ability to organize it in such a way that we can
draw useful conclusions from it, but even our cognitive ability to grasp the
complexities of the phenomena we are trying to understand.
This isn’t a new phenomenon. As
Weinberger notes, in 1963 - right around the same time that Edward Lorenz
formulated chaos theory to explain the inherent impossibility
of predicting the long-term behaviour of non-linear systems like weather - Bernard
Forscher of the Mayo Clinic published a letter in Science entitled “Chaos in
the Brickyard”, in which he complained that scientists were generating -
of all things - too many facts. According to Weinberger,
...the letter warned that the new
generation of scientists was too busy churning out bricks — facts — without
regard to how they go together. Brickmaking, Forscher feared, had become an end
in itself. “And so it happened that the land became flooded with bricks. … It
became difficult to find the proper bricks for a task because one had to hunt
among so many. … It became difficult to complete a useful edifice because, as
soon as the foundations were discernible, they were buried under an avalanche
of random bricks.” [Note A]
The situation today, Weinberger argues, is
astronomically - and I use the word “astronomically” quite deliberately, in the
sense of “several orders of magnitude” - worse than it has ever been. In
an article introducing his book, he puts the problem thus:
There are three basic reasons
scientific data has increased to the point that the brickyard metaphor now
looks 19th century. First, the economics of deletion have changed. We used to
throw out most of the photos we took with our pathetic old film cameras
because, even though they were far more expensive to create than today’s
digital images, photo albums were expensive, took up space, and required us to
invest considerable time in deciding which photos would make the cut. Now, it’s
often less expensive to store them all on our hard drive (or at some website)
than it is to weed through them.
Second, the economics of sharing have
changed. The Library of Congress has tens of millions of items in storage
because physics makes it hard to display and preserve, much less to share,
physical objects. The Internet makes it far easier to share what’s in our
digital basements. When the datasets are so large that they become unwieldy
even for the Internet, innovators are spurred to invent new forms of sharing.
The ability to access and share over the Net further enhances the new
economics of deletion; data that otherwise would not have been worth storing
have new potential value because people can find and share them.
Third, computers have become
exponentially smarter. John Wilbanks, vice president for Science at Creative
Commons (formerly called Science Commons), notes that “[i]t used to take a year
to map a gene. Now you can do thirty thousand on your desktop computer in a
day. [Note A]
The result, Weinberger argues, is actually
worse than Forscher predicted. We are not merely awash in “bricks/facts”
and suffering from a severe shortage of “theory-edifices” to organize them
with; the profusion of data has revealed to us systems of such massive
intricacy and interdependence that we are incapable of visualizing, let
alone formulating, comprehensive theories to reduce them to manageable - by
which is meant, predictable - rules of behaviour. Scientists facing “data
galaxies” are being forced down one of two paths. The first path -
reductionism - attempts to derive overarching, general principles that seem to
work well enough to account for the bulk of the data. Historically,
searching for “universals” amid the vast sea of “particulars” has been the
preferred approach of empirical science, largely because there tend to
be many fewer universals than there are particulars in the physical world,
and also because if you know the universals - for example, Newton’s laws of
motion and the fact that gravitational attraction between two objects
varies inversely as the cube of the distance separating them - then you can
often deduce the particulars, for example, where you should aim your
Saturn V rocket in order to ensure that the command and service module achieves Lunar
orbit instead of ending up headed for the heliopause.
The second path - modelling - involves the
construction of tunable mathematical models that can be tweaked to produce
outputs that mimic, as closely as possible, observational data without
attempting to derive an overarching theory of how the system that produced the
data actually functions. This has been the preferred approach for
dealing with complex phenomena that do not easily lend themselves to reduction
to “universals”. When it comes to complex systems, however, both
paths exhibit crippling flaws. Overarching theories that explain
some data, but not all of it, can help us to approach a solution, but they
cannot produce “settled science” because unexplained data are by definition the
Achilles Heel of any scientific theory. Meanwhile, modelling, even
if it can come close to reproducing observed data, cannot replace
observed data, and is never more than a simulation - often a grossly inaccurate
one - of how the real world works, because if we cannot visualize all of the
complexities and interdependencies of a given problem set, we
certainly cannot design mathematical formulae to simulate them.
As if these two unacceptable paths were not bad
enough, perhaps the most alarming deduction that Weinberger draws from his
analysis is that these data-driven trends are forcing us towards a “new
way of knowing” that is, from a human perspective, almost entirely “virtual.” According
to Weinberger, only computers can generate the vast quantities of data
necessary to investigate a given phenomenon; only computers have the
capacity to organize and store so much data; and only computers
can perform the unthinkable number of calculations necessary to process the
data and create modelled outputs. Our role, such as it is, has been to
create the mathematical models for the computers - but even this last
redoubt of human involvement is collapsing as we become
increasingly incapable of visualizing the scope and interrelationships
of the problems we are trying to solve. In other words, the future of
research into complex interdependent phenomena is gradually departing the human
domain, simply because we lack the cognitive wherewithal needed to cope with
it.
Does this mean that scientific inquiry is
doomed to leave the human sphere entirely? Or that it’s destined to grind
to a halt unless we can come up with an AI that mimics human cognition to the
point of being able to visualize and craft investigative solutions to huge,
complex problems? I certainly don’t dispute that science has
developed the ability to drown itself in data; but there are ways of dealing
with preposterous quantities of information. As I believe I’ve pointed
out before, historians are accustomed
to being drowned in data. From the point of view of the science of
historiography, this is not a new problem. Take, for example, a
relatively straightforward historical event - the Battle of Waterloo.
Working from basic numbers, Napoleon had about 72,000 troops, and the Allies
under Wellington had about 118,000. Reducing the affair to nothing more
than a series of iterative diarchic interactions (which I am the first to
admit is a wholly ludicrous proposition, but which frankly is no more ridiculously
inappropriate than some of the assumptions made in climate modelling - for
example, the assumption that clouds warm the Earth, when observed data suggest
that they cool it) suggests that there were about 8.5 billion potential
diarchic interactions in the first iteration alone. And that’s only the
individuals. Nothing is too insignificant to be eliminated from your
model, and each increase in the fidelity of the data you input ought to help
refine the accuracy of your output (unless you’re trying to model a non-linear
system, in which case the fidelity of your input doesn’t matter because
inherent instabilities will quickly overwhelm the system (see “Nonlinearity andthe indispensability of data”, 15 June 2011)). In the quest for greater
fidelity/realism, you would need to model the characteristics and
behaviour not just of each individual but also of their weapons, their
clothing, their boots or shoes, their health and physical condition, their
horses, their guns, their limbers, their ammunition, the terrain, the
obstacles, the weather, the psychological vagaries of individual soldiers and
commanders...how many details are we talking about here? Are we getting
to the level of a “non-visualizable” problem yet?
Is it even possible to model something
this complex? Well, it’s sort of already been done. I’m sure many
of you have seen The
Lord of the Rings: The Two Towers. The final battle at
Helm’s Deep comprised hundreds of Rohirrim and Elves, and ten thousand Uruk-Hai
- and yet there were no more than a hundred live actors in any of the
shots. The final digital effects were created by Weta Digital using a
programme called “Massive”. According to Stephen Regelous, the chap who
programmed the battle sequences, he didn’t really “program” them at all, or at
least, not in the sense of crafting specific movements for every digital effect
in the scenes. That would have been an impossibly daunting task, given the number of individual digital actors involved.
Instead, he made the digital actors into people. “The
most important thing about making realistic crowds”, Regelous explains, “is
making realistic individuals.” To do that, the programme creates “agents”,
each of which is in essence an individual with individual characteristics,
traits, and most important of all, volition:
In Massive, agents’ brains - which look
like intricate flow charts - define how they see and hear, how fast they run
and how slowly they die. For the films, stunt actors’ movements were recorded
in the studio to enable the agents to wield weapons realistically, duck to
avoid a sword, charge an enemy and fall off tower walls, flailing.
Like real people, agents’ body types,
clothing and the weather influence their capabilities. Agents aren’t robots,
though. Each makes subtle responses to its surroundings with fuzzy logic rather
than yes-no, on-off decisions. And every agent has thousands of brain nodes,
such as their combat setting, which has rules for their level of aggression.
When an animator places agents into a simulation, they are released to do what
they will. It’s not crowd control, but anarchy. Each agent makes decisions from
its point of view. [Note B]
In other words, outcomes aren’t predetermined;
the Agents are designed with a range of options built into their makeup,
and a degree of choice about what to do in response to given stimuli.
Kind of like people. While it’s possible to predict likely responses to
certain stimuli, it’s not possible to be certain what a given Agent will do in
response to a given event. “It’s possible to rig fights, but it hasn’t
been done,” Regelous says. “In the first test fight we had 1,000 silver guys
and 1,000 golden guys. We set off the simulation, and in the distance you could
see several guys running for the hills.”
If you happen to own the full-up Uber-GeekPathetic Basement-Dwelling Fan-Boy LOTR collection (I admit nothing!), you
can watch the “Making Of” DVDs for Two Towers, and listen to a much
more in-depth explanation of how chaotic, complex and
unpredictable the behaviour of the Massive-generated battle sequences
were. The programmers ran the Helm’s Deep sequence many, many times,
using the same starting conditions, and always getting different results.
If that sounds familiar, it’s because it’s exactly what Lorenz described as the
behaviour of a non-linear system. It’s chaos in a nutshell.
According to the DVD explanation, the results of individual volition were potentially so unpredictable that the Agents’
artificial intelligence needed a little bit of tweaking to ensure
that the battle scenes unfolded more or less in line with the script.
Early on, for example, the designers had given each Agent a very small
probability of panicking and fleeing the battle, along with a slight increase
to that probability if a neighbouring agent went down hard, and a slightly
greater probability if a neighbouring agent panicked and fled.
Reasonable, right? We all know that panic on the battlefield is
contagious. The problem is that, depending on how the traits of the
Agents were programmed, it could be too contagious. In one of
the simulation runs for the battle, one of the Uruk-Hai Agents near the front
lines apparently panicked and fled at exactly the same time as a couple of
nearby Agents were shot and killed. This sparked a massive, rapidly propagating wave of panic
that resulted in Saruman’s elite, genetically-engineered army dropping
their weapons and heading for the hills. I’m sure King Theoden, Aragorn,
and hundreds of unnamed horse-vikings and elf-archers would’ve preferred it if Jackson had used that particular
model outcome for the film, but it probably wouldn’t have leant itself to
dramatic tension. And it would've been something of a disappointement when Gandalf, Eomer and the Rohirrim showed up as dawn
broke across the Riddermark on the fifth day, only to discover that there was no one left for them to
fight because Uruk Spearman #9551 had gone wobbly a couple of hours before.
What’s the point of all this? Well, the
link between Forscher’s complaint about the profusion of data-bricks
and Weinberger’s conclusion that computers are taking over the process of
generating, storing, and figuring out what to make of data is obvious.
The problem of deriving “universals” from the colossal mass of “particulars”
that make up modern scientific (and historical) inquiry ought to be obvious,
too, and it’s demonstrated by one of the key weaknesses of history:
experimental repeatability. Steve Regelous could re-run the Battle of
Helm’s Deep as many times as he liked until he got the result that Peter
Jackson wanted - but that’s fiction, and when you’re writing fiction
you can do whatever you like because you’re not tied to any objective standard
other than some degree of plausibility (which in turn depends on the
expectations, experience and gullibility of your audience). But we can’t
re-run the Battle of Waterloo, changing one variable here or there to see
what might have happened differently, because there’s no way to account for all
possible variables, there’s no way to find out whether some seemingly
insignificant variable might have been overwhelmingly important (butterfly
wings, or “for want of a nail”, and all that), and finally there’s no way
to scientifically validate model outputs other than to compare them to observed
data, because historical events are one-time things.
What we can do, though, is try to develop
rules that help us winnow the “universals” from the chaff that makes up
most of history. There were 190,000 “Agents” at Waterloo - but do we
really need to know everything
about them, from their shoe size to how many were suffering from dysentery on
the big day? Do we need all of the details that had to be programmed
into Weta’s “Massive” Agents? Or are there some general principles
that we can use to guide our understanding of history so that we can pull out
what’s important from the gigantic heap of what isn’t?
This is where a good reading - or hopefully
re-reading - of E.H. Carr’s What
is History? comes in handy. In Chapter One, “The Historian
and His Facts”, Carr disputes the ages-old adage that “the facts speak for
themselves”, arguing instead that the facts speak only when the historian calls
on them. Figuring out which facts to call on is the historian’s
métier. Separating the gold - the “significant” as opposed to the “accidental” facts
of a given phenomenon - from the dross of historical data is the
historian’s challenge, and Carr’s criteria for judgement of what is “gold”
is generalizability. The key word here, of course, is “judgement” - informed
judgement as a function of human cognition. This is something that can’t
be done by computers.
Not yet, anyway. [Notes C, D]
So for the time being, at least, there’s still
a need for a meat plug somewhere in the analysis chain, if only to provide the
informed judgement that our eventual silicon replacements haven’t quite learned
to mimic. That’s something of a relief.
Cheers,
//Don// (Meat plug)
Notes:
A) http://m.theatlantic.com/technology/archive/2012/01/to-know-but-not-understand-david-weinberger-on-science-and-big-data/250820/
B) http://archives.theonering.net/perl/newsview/8/1043616073
B) http://archives.theonering.net/perl/newsview/8/1043616073
C) Historians looking for a lesson in humility
need only read Heinlein’s The
Moon is a Harsh Mistress, in which a computer that accidentally
becomes artificially intelligent finds itself having to run a Lunar
uprising - and in order to design the best possible plan for doing so,
reads every published history book in a matter of minutes, analyzes them, and
then not only plots a revolution from soup to nuts, but actually calculates the
odds of victory, and recalculates them periodically in response to changing
events. Now that’s what I call “future security analysis.” If only.
D) Some computer programmes are getting pretty
good at matching human cognition. Fortunately, we still hold an edge in
important areas. As long as we can still beat our electronic
overlords at “Rock-Paper-Scissors”, we might, as the following chart
from the genius webcartoon xkcd.com suggests, be able to hold the Singularity off for a little longer.
(Source: http://xkcd.com/1002/)