We know PhysX runs faster with GPU acceleration; that's the main reason Nvidia bought the technology from Ageia in the first place. However, the size of the gulf between CPU-calculated PhysX and GPU-accelerated PhysX may have been engineered deliberately by Nvidia.
Real World Technologies has performed a thorough investigation of PhysX running on a 3.2GHz Core i7 920, using Intel's VTune tool to analyse how it uses CPU resources. Interestingly, the site found CPU PhysX was not only failing to use automatic multi-threading, with one thread handling 80 to 90 per cent of the work, but it was also mainly using legacy parts of the CPU architecture.
In particular, it turns out many DLLs, including PhysXCore.dll, only use x87 floating point instructions (remember the days of math co-processors?), rather than SSE. Plus, while SSE was used by some PhysX components, they were often used by insignificant threads in terms of the workload.
You can see the detailed results listed here, showing a number of threads and which resources they use. When running the Dark Basic PhysX Soft Body Demo, for example, the site notes "PhysXCore.dll is the culprit responsible for 91 per cent of all x87 instructions retired in the entire process."
The question is why Nvidia chooses to use x87 rather than SSE. The site points out that it's certainly not because of legacy hardware support. Even a decade-old 1.4GHz Pentium 4 CPU supports SSE2, after all.
With access to twice as many registers, and the ability for SSE2 to perform eight single-precision operations (GPU PhysX runs on GeForce 8-series GPUs, so it clearly doesn't require double-precision) in a cycle, the site reckons there's the potential for CPU PhysX to run at least twice as fast as it does currently.
"The truth is that there is no technical reason for PhysX to be using x87 code," says Real World Technologies following its analysis. "PhysX uses x87 because Ageia and now Nvidia want it that way," says the site. "Nvidia already has PhysX running on consoles using the AltiVec extensions for PPC, which are very similar to SSE. It would probably take about a day or two to get PhysX to emit modern packed SSE2 code, and several weeks for compatibility testing."
Cynically, the site suggests "the sole purpose of PhysX is a competitive differentiator to make Nvidia’s hardware look good and sell more GPUs. Part of that is making sure that Nvidia GPUs looks a lot better than the CPU, since that is what they claim in their marketing. Using x87 definitely makes the GPU look better, since the CPU will perform worse than if the code were properly generated to use packed SSE instructions."
We've asked Nvidia why PhysX is so dependent on x87 instructions, rather than SSE, and we'll update you if and when we get any more information.