Drinking the Big Data Kool-Aid

  • submit to reddit
Electrical conduits are installed overhead in a server room in New York. (AP Photo/Mark Lennihan)

Electrical conduits in a server room in New York City. (AP Photo/Mark Lennihan)

One of the terms that has gotten a lot of play in the media’s NSA surveillance program coverage is “big data.” It’s a relatively new term for data sets that are so large they become hard to process and analyze. The data encompassed by the term is the digital trail of every keystroke we make: in emails, cellphone calls, credit card purchases, Google searches, tweets, Facebook status updates, etc. The list goes on, and on.

In Big Data, A Revolution That Will Transform How We Live, Work, And Think, published earlier this year, authors Viktor Mayer-Schonberger and Kenneth Cukier try to explain just how much data there is in big data. They write that “in 2013 the amount of stored information in the world is estimated to be around 1,200 exabytes, of which less than 2 percent is non-digital.”

What exactly is an exabyte, you might ask? They continue:

There is no good way to think about what this size of data means. If it were all printed in books, they would cover the entire surface of the United States some 52 layers thick. If it were placed on CD-ROMs and stacked up, they would stretch to the moon in five separate piles. In the third century B.C., as Ptolemy II of Egypt strove to store a copy of every written work, the great Library of Alexandria represented the sum of all knowledge in the world. The digital deluge now sweeping the globe is the equivalent of giving every person living on Earth today 320 times as much information as is estimated to have been stored in the Library of Alexandria.

Hundreds — probably thousands — of projects utilizing that data to improve the way the world works are already underway. A recent New York Times article profiled Mayor Bloomberg’s geek squad and the ways they are using data to solve problems around New York City.

An article in The New Yorker by Gary Marcus entitled “Steamrolled by Big Data,” notes that enthusiasm surounding big data in tech circles is “kind of a new religion.”

In your life, you’re probably most familiar with the benefits of big data by way of recommendation engines on shopping or movie sites that tell you what you might like based on what others like you, like. As Lawrence Lessig told Bill this week, he doesn’t mind seeing ads that are curated for him, “The purpose of that profiling is to narrow the information … pushed into my sphere to that information which I want.”

The New Republic’s Leon Wieseltier agrees with Lessig, with one, kind of big, caveat. He writes, “[T]he study of the consumer is one of capitalism’s oldest techniques. But it is not fine that the consumer is mistaken for the entirety of the person.”

The biggest complaint about big data is that while it’s great for correlation, it’s not so great at causality. That concerns many experts who worry about how the government is vetting the data they’ve collected, and whether they are using it to predict future criminal behavior, in a sort of Minority Report nightmare scenario.

Regardless of how the government is making use of big data, this week’s revelations have already begun a debate about personal digital data, privacy and policy that should have happened years ago, which is some good news. As Chris Hughes writes in the New Republic:

Technology may continue to grow and become more complex, but that need not preclude debate — and potentially legislation — about how it can and should be used.

The security and privacy crises that have unfolded over the past week are the perfect moment for us to ask ourselves what public policy we should adopt not only to limit the government’s ability to mine data, but the ability of technological systems to store and process this data in the first place.

  • submit to reddit