“Big Data” gets more and more attention in general and very specific in software development. But, there are some misunderstandings and also some misinterpretations.
The first point of debate is always, what means “Big Data”? There are a lot of definitions around, but the one most useful for my own understanding and also for discussions is:
“As soon as we operate on an amount of data which does not fit easily within one single server, we speak about Big Data.”
I do not know anymore where I found this definition and whether I recite it correctly, but the implication is clear: If the data cannot put into a single server (node) anymore, it needs to be partitioned.
At the time of writing Big Data can be assumed to be data which is roughly >50TB of size. With today’s 12TB disks and assuming a high performance RAID0 with four disks, we get 48TB of total disk space. The availability and reliability is assured with multiple nodes containing the data partition (replication), so a local RAID1 or similar is not needed.
Partitioning (with additional replication) is in my understanding the most crucial, most important, most underestimated and most interesting point of Big Data.
As soon as this point is understood and taken as fact, we can discuss further on what it means for the application dealing with Big Data. As soon as we have a look to the CAP theorem, we find the biggest limitation of all: We see that together with partitioning, we only can have either availability or consistency (or a compromise of both). A lot of software engineers are still not convinced, that not all three can be got, but it is not possible. We have to decide what we need in first place.
Additionally, there is also a big chance: If we partition all data over a multiple nodes, all nodes can be utilized to search and transform data. This holds true as long as the partitioning is not virtualized to a single storage node. I shortly described this in my article about “From Virtualization to Big Data a Change of Paradigm”.
Additionally, there are architecture concepts, where computation frameworks like Apache Spark are also added to these storage nodes. The big advantage is, that these computation framework can access the local data very fast on the local hardware and do some calculations locally to only send a fraction of data out. The CPU on pure storage nodes is mostly not extensively used, so there is room left for at least simple search, filter and aggregate functionality.
Overall, the layout of all the data over multiple nodes brings a lot of challenges and possibilities…