Accelerating big data analytics with flash caching

This article was originally published on Technology.Info.
As part of our continuing strategy for growth, ITProPortal has joined forces with Technology.Info to help us bring you the very best coverage we possibly can.

The global volume, velocity and variety of data are all increasing, and these three dimensions of the data deluge — the massive growth of digital information — are what makes Hadoop software ideal for big data analytics. Hadoop is purpose-built for analysing a variety of data, whether structured, semi-structured or unstructured, without the need to define a schema or otherwise anticipate results in advance. Enhancements in the open source Hadoop ecosystem now make it possible for IT managers to keep pace with the increased rate of data generation by streaming live data into the cluster for near-real-time analysis. But Hadoop’s biggest advantage is its scalability: Hadoop enables an unprecedented volume of data to be analysed quickly and cost-effectively on clusters of commodity servers.

Hadoop’s ability to distribute MapReduce jobs across a cluster to process data in parallel enables linear, scalable performance through the addition of more servers, and more processor cores, more memory or both in those servers. But there is now a more cost-effective option for scaling performance in Hadoop clusters: high-performance read/write flash cache acceleration cards. The performance improvements possible with flash cache acceleration are best understood with some background on the traditional means of scaling performance in Hadoop clusters.

Scaling Hadoop performance: A historical perspective

On the surface, using a storage area network (SAN) or network attached storage (NAS) for big data (datasets measured in terabytes, petabytes and even exabytes) might seem to make economic sense. After all, SAN and NAS configurations are very efficient, enabling large datasets to be shared readily and cost-effectively among any number of servers and applications.

The problem with storage networks, however, is the distance (and, therefore, the increased latency) they place between the processor and the data. The closer the data to the processor, the faster the performance, and big data analytics benefits from this proximity. This fundamental principle of data proximity is what has guided the Hadoop architecture, and is the main reason for Hadoop’s success as a high-performance big data analytics solution.

To keep the data close to the processor, Hadoop uses servers with direct-attached storage (DAS). And to get the data even closer to the processor, the servers are usually equipped with significant amounts of random access memory (RAM).

With the use of DAS and ample RAM maximising the performance of each server, Hadoop is able to scale both processing performance and dataset size by adding more servers, or nodes, in a cluster. Some or all of the nodes in the cluster are then used to process large datasets in parallel and in more manageable steps via the Hadoop Distributed File System (HDFS) and the JobTracker, which coordinates the concurrent processing of MapReduce jobs.

The Map function uses a master node to read input files, partition the dataset into smaller subsets, and distribute the processing of these subsets to worker nodes. The worker nodes can, in turn, further distribute the processing, creating a hierarchical structure coordinated by the JobTracker. As a result, multiple nodes in any cluster are able to work on much smaller portions of the analysis in parallel, giving Hadoop its linear scalability.

During the Reduce function, the master node accepts the processed results from all worker nodes, and then shuffles, sorts and merges them to an output file, again under control of the JobTracker. The output can, optionally, become the input to additional MapReduce jobs that further process the data.

Depending on the nature of the MapReduce jobs run by the application, bottlenecks can form either in the network or in the individual server nodes. These bottlenecks can often be eliminated by adding more servers (for more parallel processing power), more processor cores (when servers are CPU-bound), or more RAM (when servers are memory-bound).

With MapReduce jobs, a server’s maximum performance is usually determined by its maximum RAM capacity. This is particularly true during the Reduce function, when intermediate data shuffles, sorts and merges exceed the server RAM size, forcing the processing to be performed with input/output (I/O) to disk.

As the need for I/O to disk increases, performance degrades considerably. The reason is the five orders of magnitude difference between I/O latency for memory (at 100 nanoseconds) and hard disk drives (at 10 milliseconds). Slow storage I/O is rooted in the mechanics of traditional hard disk drive (HDD) platters and actuator arms. I/O latency to a SAN or NAS is substantially higher because of the intervening network — the very reason Hadoop uses DAS. But even with the use of DAS and fast-spinning, short-stroked HDDs, the increased latency of I/O to disk imposes a severe performance penalty.

One of the most cost-effective ways to break through the disk-to-I/O bottleneck and further scale the performance of the Hadoop cluster is to use solid state flash memory for caching.


Scaling Hadoop performance with flash caching

Data has been cached from slower to faster media since the advent of the mainframe computer, and it remains an essential function in every computer today. Data is also cached at multiple levels and in different locations throughout a data centre — from the L1 and L2 cache built into server processors to the dynamic RAM (DRAM) caching in the controllers used for SANs and NAS.

The long, widespread use of caching demonstrates its enduring ability to deliver substantial and cost-effective performance improvements. For example, PCs constantly cache data and software from the HDD to main memory to improve I/O throughput and application performance. The operating system’s file subsystem uses special algorithms to continually, automatically and transparently identify and move frequently accessed, or hot data, to the cache to improve the hit rate.

Servers also cache data and software to main memory whenever possible. When a server is equipped with its full complement of DRAM memory and that memory is fully utilised by applications, the only way to increase caching capacity is to add a different type of memory. One option is NAND flash. With an I/O latency of 50 to 100 microseconds, NAND flash is up to 200 times faster than a high-performance HDD.

The principle of data proximity also applies to caching. This is why flash cache acceleration typically delivers the highest performance gains when the card is placed directly in the server on the PCI Express (PCIe) bus. Some flash cache cards are now available with multiple terabytes of solid state storage, which substantially increases the hit rate. And a new class of solution offers both internal flash and serial-attached SCSI (SAS) interfaces to create high-performance DAS configurations consisting of solid state and hard disk drive storage, coupling the performance benefits of flash with the capacity and cost advantages of HDDs.

Testing cluster performance with and without caching

To compare cluster performance with and without caching, LSI used the widely accepted TeraSort benchmark. TeraSort tests performance in applications that sort large numbers of 100-byte records using gigabytes or terabytes of data, and that also require a considerable amount of computation, networking and storage I/O — all characteristics of real-world Hadoop workloads.

The TeraSort benchmark consists of generation, sorting and validation. A TeraSort test can take minutes or hours to complete, with the time required being inversely proportional to the number of servers in the cluster and the processing power (CPU cores and RAM, and in this test, the flash caching) of the servers.

LSI used an eight-node cluster for its 100-gigabyte (GB) TeraSort test. The test configuration consisted of a 10 gigabit/second Ethernet switch connecting the eight servers; each server was equipped with 12 CPU cores, 64GB of RAM and eight 1-terabyte HDDs. Each server was also outfitted with an LSI model NMR-8100-4i Nytro MegaRAID acceleration card containing 100GB of NAND flash memory. The acceleration card’s flash memory was deactivated for the test without caching.


The 100GB of NAND flash memory is beneath the shield with the LSI logo on this Nytro MegaRAID card used in the TeraSort benchmark test.

No software change was required because the flash caching is transparent to the server applications, operating system, file subsystem and device drivers. Notably, RAID (redundant arrays of independent disks) storage is not normally used in Hadoop clusters because of the way the Hadoop distributed file system replicates data among nodes. So while the RAID capability of the Nytro MegaRAID acceleration card would not be used in Hadoop clusters (and was not used in this test), its implementation in a RAID-on-Chip (RoC) integrated circuit adds little to the cost of the card.
When flash caching was activated, the TeraSort test consistently completed approximately 33 per cent faster. Specifically, the job normally completed in about three minutes with caching and about four minutes without caching. The actual times for one typical caching and non-caching test were 3:02 and 4:08, respectively.
Saving cash with cache
The 33 per cent performance improvement from caching scales in proportion to the size of the cluster needed to complete a specific MapReduce or other job within a required run time. Based on results from the TeraSort benchmark performance test, the table below compares total cost of ownership (TCO) of two cluster configurations — with and without caching — that are both capable of completing the same job in the same amount of time.

Without caching

With caching

Number of servers



Servers (MSRP of $6,280 (£3,895))

$6.28m (£3.9m)

$4.71m (£2.9m)

Nytro MegaRAID Cards (MSRP of $1,799 (£1,115))


$1,349,250 (£0.8m)

Total hardware costs

$6.28m (£3.9m)

$6,059,250 (£3.8m)

Costs for rack space, power, cooling and administration over three years *

$19.61m (£12.2m)

$14,707,500 (£9.1m)

Three-year total cost of ownership

$25.89m (£16.1m)

$20,766,750 (£12.9m)

* Computed using data from the Uptime Institute, an independent division of The 451 Group

The tests showed that using fewer servers to accommodate the same processing time requirement can reduce TCO by 20 per cent, or $5.1 million (£3.2 million), over three years.


Organisations using big data analytics now have another option for scaling performance: flash cache acceleration cards. While these tests centered on Hadoop, LSI’s extensive testing with various databases and other popular applications consistently demonstrates performance improvement gains ranging from a factor of three (for DAS configurations) to a factor of 30 (for SAN and NAS configurations).

The big data performance gains with flash caching are also possible with smaller datasets. In virtualised environments, for example, maximising server utilisation requires maximising performance. The higher the application-level performance, the more work the servers can perform; and the more work they perform, the higher their utilisation. The same applies to dedicated servers, where caching improves performance and reduces response times.

Big data is only as useful as the analytics that organisations use to unlock its full value, making Hadoop a powerful tool for analysing data to gain deeper insights in science, research, government and business. And big data analytics will see increasing adoption as more organisations discover that the bigger the data sets, the smarter the analytics. Servers need to be smarter and more efficient too. Flash caching enables fewer servers (with fewer software licenses) to perform more work more cost-effectively for data sets large and small — a perfect fit for IT managers working to do more with less under the growing pressure of the data deluge.