If you typically follow GPU performance as it relates to gaming but have become curious about Bitcoin mining, you’ve probably noticed and been surprised by the fact that AMD GPUs are the uncontested performance leaders in the market. This is in stark contrast to the PC graphics business, where AMD’s HD 7000 series has been playing a defensive game against Nvidia’s GK104 / GeForce 600 family of products. In Bitcoin (BTC) mining, the situation is almost completely reversed – the Radeon 7970 is capable of 550MHash/second, while the GTX 680 is roughly a fifth as fast.
There’s an article at the Bitcoin Wiki that attempts to explain the difference, but the original piece was written in 2010-2011 and hasn’t been updated since. It refers to Fermi and AMD’s VLIW architectures and implies that AMD’s better performance is due to having far more shader cores than the equivalent Nvidia cards.
This isn’t quite accurate, and it doesn’t explain why the GTX 680 is actually slower than the GTX 580 at BTC mining, despite having far more cores.
This article is going to explain the difference, address whether or not better CUDA miners would dramatically shift the performance delta between AMD and Nvidia, and touch on whether or not Nvidia’s GPGPU performance is generally comparable to AMD’s these days.
Topics not discussed here include:
- Investment opportunity.
- Whether or not ASICs (when they arrive next month/this summer/at some point in the future) will destroy the GPU mining market.
These are important questions, but they’re not the focus of this article. We will discuss power efficiency and Mhash/Watt to an extent, because these factors have an impact on comparing the mining performance of AMD vs. Nvidia.
The mechanics of mining
Bitcoin mining is a specific implementation of the SHA2-256 algorithm. One of the reasons AMD cards excel at mining is because the company’s GPUs have a number of features that enhance their integer performance. This is actually something of an oddity; GPU workloads have historically been floating-point heavy because textures are stored in half (FP16) or full (FP32) precision.
The issue is made more confusing by the fact that when Nvidia started pushing CUDA, it emphasised password cracking as a major strength of its cards. It’s true that GeForce GPUs, starting with G80, offered significantly higher cryptographic performance than CPUs – but AMD’s hardware now blows Nvidia’s out of the water.
The first reason AMD cards outperform their Nvidia counterparts in BTC mining (and the current Bitcoin entry does cover this) is because the SHA-256 algorithm utilises a 32-bit integer right rotate operation.
This means that the integer value is shifted (explanation here), but the missing bits are then re-attached to the value. In a right rotation, bits that fall off the right are reattached at the left. AMD GPUs can do this operation in a single step. Prior to the launch of the GTX Titan, Nvidia GPUs required three steps – two shifts and an add.
We say “prior to Titan,” because one of the features Nvidia introduced with Compute Capability 3.5 (only supported on the GTX Titan and the Tesla K20/K20X) is a funnel shifter. The funnel shifter can combine operations, shrinking Nvidia’s 3-cycle penalty significantly. We’ll look at how much performance improves momentarily, because this isn’t GK110′s only improvement over GK104. GK110 is also capable of up to 64 32-bit integer shifts per SMX (Titan has 14 SMXs). GK104, in contrast, could only handle 32 integer shifts per SMX, and had just eight SMX blocks.
We’ve highlighted the 32-bit integer shift capability difference between CC 3.0 and CC 3.5.
AMD plays things close to the chest when it comes to Graphics Core Next’s (GCN) 32-bit integer capabilities, but the company has confirmed that GCN executes INT32 code at the same rate as double-precision floating point. This implies a theoretical peak int32 dispatch rate of 64 per clock per CU – double GK104′s base rate. AMD’s other advantage, however, is the sheer number of Compute Units (CUs) that make up one GPU. The Titan, as we’ve said, has 14 SMXs, compared to the HD 7970′s 32 CUs. Compute Units / SMXs may be far more important than the total number of cores in these contexts.
First, we’ll look at the Titan’s performance against the GTX 680 in an unoptimised openCL kernel (using cgminer 2.11) and a more recent CUDA-optimised kernel based on rpcminer. Rpcminer and cgminer share a common code base – performance between the two is identical when using OpenCL. For the unoptimised test, we opted for the poclbm kernel. The optimised test used a modified CUDA-capable kernel. This kernel was allowed to auto-configure for the Nvidia GeForce cards, but we also tested various manual settings for the number of threads and grid size. Hand-tuning these options failed to meaningfully improve performance.
The baseline testbed was an Intel Core i7-3770K with 8GB of RAM and an Asus P8Z77V-Deluxe motherboard with a Thermaltake 1275W 80 Plus Platinum power supply. The AMD Radeon cards were all configured to use the diakgcn kernel. Performance and power consumption were logged over two hours, which gave the erratic CUDA miner’s performance time to stabilise.
Typically in Bitcoin mining, the hash rate of a given card remains stable. The GTX 680 and Titan both “bounced” when running the CUDA miner, though the cause of the fluctuation is unclear. The performance figures for these cards reflect their average hash rate over time.
The first thing to notice is that the Titan is much faster than simple increased core counts or clock speed would account for. Optimising for CUDA improves performance on both cores by roughly 20 per cent. Nvidia promised that GK110 would deliver significantly improved performance in mathematical workloads, and that fact is born out here as well. The size of the improvement between the two cards identical, at ~17 per cent, which implies that Nvidia’s driver can auto-optimise code to run on the GTX Titan.
GK110 is significantly faster than GK104, but look at what happens when we add Radeon performance data…
Ouch. The Radeon 7790, a near-£100 GPU, offers 80 per cent of the GTX Titan’s performance for a fraction of its price (the Titan is over £800). The Radeon 7970 is twice as fast at less than half the price. Even the CUDA-accelerated kernel doesn’t bring Nvidia hardware into the same league as AMD’s – a point hammered home if we compare system power consumption. Keep in mind that the Titan is a 7.1 billion transistor GPU with a 561mm sq die. The fact that the Radeon 7790 nearly matches its performance at 112mm sq and 2 billion transistors points to a fundamental bottleneck within Titan’s architecture as the source of the problem.
The situation is just as lopsided if we consider GPU efficiency based on power consumption (MHash/Watt) or initial purchase price vs. hashrate, as shown below.
A full discussion of GPGPU performance between AMD and Nvidia is beyond the scope of this article, but some performance checking is in order. The OpenCL-based Luxmark 2.0 benchmark is now running under the Titan, so let’s see how performance compares there. Luxmark 2.0 now runs on a Titan GPU (when we first reviewed the card, the program crashed at launch). The GTX 680, GTX Titan, HD 7970, and HD 7790 are all shown below.
The Titan is a huge improvement on the GTX 680, but it’s still half the performance of the HD 7970.
SiSoft Sandra now includes a number of financial transaction tests, some of which are designed to leverage the floating-point calculations where the Titan, theoretically, should excel.
The new financial tests in SiSoft Sandra 2013 are designed to measure “the metrics of a financial entity, be it a business, asset, option, etc. Here, various models are used to determine the future worth of ‘options’ in organised option trading. An ‘option’ is a contract to buy/sell an asset at a specified price (‘strike price’) at (or before) an expiration date… Mathematical models are employed to estimate option worth and are implemented in most financial or trading software; some are compute intensive, which is where GPGPU acceleration comes in.”
The GTX Titan isn’t much faster than the GTX 680 in this test, possibly due to a need for further optimisations in these workloads. When we flip to 64-bit performance, the match-up changes.
Everyone takes a performance hit, but the GTX Titan goes from less than half AMD’s performance to only about 25 per cent behind it. Other data sets, like these encryption performance estimates linked above, and GPGPU performance in the CLBenchmark downloadable tests show the HD 7970 generally ahead of the GTX Titan in raw performance. Factor in die size or card price, and the HD 7970 is nearly always the better value.
Bitcoin a worst-case example of general trend
There are several reasons why this lopsided performance trend hasn’t gotten more play: GPGPU performance is still in its infancy; games are still the go-to metric for consumer GPU comparisons; workstation applications fill a similar role in the professional space. Then there’s the fact that Nvidia still owns the high-performance GPU computing space.
The relative performance differences between AMD’s GCN architecture and Titan are interesting because they echo the marked performance differences we commonly see in the CPU market. When games were the only metric of interest, GPU performance depended solely on how well the graphics card’s features mapped to DX standards and game engine demands. Our Bitcoin performance and OpenCL tests demonstrate that while the Titan crushes the HD 7970′s performance in gaming, it can lag by up to 50 per cent in other tests, despite a far larger transistor budget, more cores, and it being more than double the price of the Radeon.
Can the gap be closed?
Earlier, we noted that the GTX 680 and GTX Titan had a tendency to “bounce” when benchmarked using a CUDA-optimised mining program. Even if we assume that the miner could be further improved to deliver peak performance at a constant rate, the GTX 680 would only reach 180MHash, while the GTX Titan topped out at 427MHash. £830 for 427MHash/second is never going to be a good deal when two Radeon 7970s can be bought for less than that, with 2.2 times the performance.
For now, we’re betting that the high number of cores per SMX (192 for Kepler, 64 for GCN) is part of the problem. Each SMX has to work harder to extract sufficient parallelism to keep the entire processor block fed, which makes peak utilisation problematic. Further CUDA optimisations might improve the overall performance scenario slightly, but there’s no miracle kernel with 100 per cent increased performance waiting in the wings. Even if there was, it would scarcely matter – GK104′s performance would need to quintuple for the GTX 680 to even be competitive.
Should you mine if you have an Nvidia card? You can, but be aware that power costs make this a losing proposition if Bitcoin prices decline to historic values. Even at $90 (£60) per BTC, and even with a Titan, mining efficiency barely breaks 1.2MHash/watt. Modern AMD cards backed up by efficient power supplies are much better, in the 2.2 to 2.5 range.
Special thanks go to Adrian Silasi of SiSoft (makers of SiSoft Sandra), who helped extensively with the analysis of this data and contributed some of the benchmark results we discussed.