I am not sure where the boundary between Data and Big Data exists, but I must be getting close. As the amount of data I deal with grows, my tools and processes have had to significantly adapt to many new challenges. This caused me to start to think about the qualities of working with Big Data as compared to physical laws.

In this post I will explore the challenges of working with Big Data using some analogies with scale, malleability and gravity. As with all analogies, these are not exact and just exploring a different way of thinking about a problem

Scale

“Are the physical laws symmetrical under a change of scale? Suppose we build a certain piece of apparatus, and then build another apparatus five times bigger in every part, will it work exactly the same way? The answer is, in this case, no!” — Richard Feynman Symmetry in Physical Laws

The first and most obvious property of Big Data is its size. As the amount of data increases, the speed and complexity of the tools and processes used increases at an accelerated rate. In this way, Big Data is not symmetric under scale.

For example, tools that analyse 1GB of data could load it entirely into RAM. However, a tool that analyses data 1,000 times larger (1TB) will be slower than 1,000 because the tool will have to use the HDD increasing complexity and reducing speed and/or accuracy. A tool that analyses data 1,000,000 times bigger (1PB) now will have to use network storage, increasing complexity even more.

This asymmetry of data’s scale exists because of the limits in two dimensions of processing power, getting bigger processors (vertical scale) and getting more processors (horizontal scale).

Vertical scale is the performance of an individual computer. To scale vertically you can buy a bigger, faster computer to process more data. The limits on vertical scale occur once you have the biggest, fastest computer, but it is not big or fast enough to process the data. This is a big problem because the amount of data is growing faster than even Moore’s law (which roughly states that computer power doubles every 24 months) can keep up with.

An interesting implication of the accelerated rate of data growth is:

in the future, we will need more hardware resources just to make the same decision!

Horizontal scale is the performance of a group of computers. To scale horizontally you purchase more computers and add them to a cluster to process your data. The limit of horizontal scaling is the bandwidth between the computers, and the way in which the data processing is coordinated and distributed. The limit along the horizontal scale occurs when the bandwidth is saturated. At this point adding more computers will not be able to receive or process data fast enough.

Vertical and horizontal scale cause asymmetries in Big Data processing. Understanding where the limits are in relation to your problem is necessary to avoid running head long into a large wall.

Malleability

“The deployment of millions cannot be improvised” — Moltke the first German commander during WWI

Data resists change as its size increases. For example, a simple task like fixing a spelling mistake that occurs on a 10th of the data would be easily accomplished in small data-sets. As the data-set increases questions must be answered like what if the process fails before it finishes? or what if someone tries to use the data while it is being altered?

As Moltke says above, the bigger the numbers the more you must plan. Changing the shape of the large amounts of data must be well thought out or you risk of fracturing its internal structure.

Gravity

it is all driven by a need to answer a question, what is the question? — Big Data, Big Innovation Evan Stubbs

  1. Collect data to answer questions
  2. Answers raise more questions
  3. Go to 1

Data attracts data like a gravitational force. The price of successfully using your data to answer questions, will be the requirement to collect more data. This is not a problem in and of itself, but in how it makes the problems of scale and malleability worse. Given that data attracts data, and the more data you have the problems you have to overcome, Good Luck!

Conclusion

I am reasonably new to the field of Big Data and I am still working things out. Thinking about a problem from a different perspective often helps me ground the problems I am facing, and look for solutions that from other places.