I had some discussions lately with people about Virtualization and Big Data in regard to I/O performance. After years of work to bring servers into a virtual environment on VMWare, Citrix and other products, the change to Big Data techniques brings a new paradigm which does not fit to the old…
The last years the Mantra was to bring everything into a virtual environment. Bare metal machines were used only to install the virtualization software to bring onto these machines the virtual servers. All servers run on virtual disks which were provided by a fast, large, central storage to allow the virtual servers to be migrated from one hardware to another in case of hardware issues.
For Big Data, virtualization is not optimal anymore. The paradigm is to use standard hardware (but good and reliable) and to bring the software for storage and calculation onto these bare metal machines. Multiple machines share data and calculation to provide the services needed. The main advantage is that all disk I/O is local only on hardware. Together with partitioned storage like HDFS, the main I/O bottle neck to central storage is eliminated. This is also one of the main reasons for Apache Spark’s Architecture: “Data is expensive to move, s Spark focuses on performing computations over the data, no matter where it resides.” (from: “Spark: The Definitive Guide” by Bill Chambers & Matei Zaharia)
One thought on “From Virtualization to Big Data: A Change of Paradigm”