Some Thoughts about Big Data

“Big Data” gets more and more attention in general and very specific in software development. But, there are some misunderstandings and also some misinterpretations.

The first point of debate is always, what means “Big Data”? There are a lot of definitions around, but the one most useful for my own understanding and also for discussions is:

“As soon as we operate on an amount of data which does not fit easily within one single server, we speak about Big Data.”

I do not know anymore where I found this definition and whether I recite it correctly, but the implication is clear: If the data cannot put into a single server (node) anymore, it needs to be partitioned.

At the time of writing Big Data can be assumed to be data which is roughly >50TB of size. With today’s 12TB disks and assuming a high performance RAID0 with four disks, we get 48TB of total disk space. The availability and reliability is assured with multiple nodes containing the data partition (replication), so a local RAID1 or similar is not needed.

Partitioning (with additional replication) is in my understanding the most crucial, most important, most underestimated and most interesting point of Big Data.

As soon as this point is understood and taken as fact, we can discuss further on what it means for the application dealing with Big Data. As soon as we have a look to the CAP theorem, we find the biggest limitation of all: We see that together with partitioning, we only can have either availability or consistency (or a compromise of both). A lot of software engineers are still not convinced, that not all three can be got, but it is not possible. We have to decide what we need in first place.

Additionally, there is also  a big chance: If we partition all data over a multiple nodes, all nodes can be utilized to search and transform data. This holds true as long as the partitioning is not virtualized to a single storage node. I shortly described this in my article about “From Virtualization to Big Data a Change of Paradigm”.

Additionally, there are architecture concepts, where computation frameworks like Apache Spark are also added to these storage nodes. The big advantage is, that these computation framework can access the local data very fast on the local hardware and do some calculations locally to only send a fraction of data out. The CPU on pure storage nodes is mostly not extensively used, so there is room left for at least simple search, filter and aggregate functionality.

Overall, the layout of all the data over multiple nodes brings a lot of challenges and possibilities…

From Virtualization to Big Data: A Change of Paradigm

I had some discussions lately with people about Virtualization and Big Data in regard to I/O performance. After years of work to bring servers into a virtual environment on VMWare, Citrix and other products, the change to Big Data techniques brings a new paradigm which does not fit to the old…

The last years the Mantra was to bring everything into a virtual environment. Bare metal machines were used only to install the virtualization software to bring onto these machines the virtual servers. All servers run on virtual disks which were provided by a fast, large, central storage to allow the virtual servers to be migrated from one hardware to another in case of hardware issues.

For Big Data, virtualization is not optimal anymore. The paradigm is to use standard hardware (but good and reliable) and to bring the software for storage and calculation onto these bare metal machines. Multiple machines share data and calculation to provide the services needed. The main advantage is that all disk I/O is local only on hardware. Together with partitioned  storage like HDFS, the main I/O bottle neck to central storage is eliminated. This is also one of the main reasons for Apache Spark’s Architecture: “Data is expensive to move, s Spark focuses on performing computations over the data, no matter where it resides.” (from: “Spark: The Definitive Guide” by Bill Chambers & Matei Zaharia)