My relevant experience is in porting numerical code between CPUs and GPUs. Some of the issues that have caused problems are:
- Different precision of approximate math (transcendental functions, reciprocals, etc.)
- Different rounding behavior of the intermediate result in multiply-add instructions.
- Different handling of exception cases (inf, nan, etc.).
- Aside from correctness differences, some optimization strategies that make things faster on one processor make them slower on another. This happens even within different generations of x86 hardware.
> Which is why you don't see such artificial crippling in open source implementations of LAPACK/BLAS/sundials/etc
Are they as fast as MKL? If so, just use them?
If not, why not? Maybe the reason is you can do better if you optimize for specific CPUs, with different latencies of various instructions?
Porting between different instruction sets is a very different thing from this situation.
> Aside from correctness differences, some optimization strategies that make things faster on one processor make them slower on another. This happens even within different generations of x86 hardware.
This is the one notably relevant part and, yeah, that's fine. Follow the CPUID features. Nobody expects it to be absolutely optimal on AMD. But let it use the code that was optimized for Intel chips with the same features.
Then it truly is irrelevant!
You're not even talking about CPUs vs CPUs.
The differences you are quoting coming from the difference in libraries (sin, exp etc will give different results depending on them libm implementation, that's normal and it has nothing to do with CPU instructions!), not the implementation of IEEE instructions (assuming that you're talking about IEEE floats, otherwise, you shouldn't expect them to behave the same in the first place!), though.
> Are they as fast as MKL? If so, just use them?
I (and a lot of other people) do use them, when I have a choice. Sometimes they are faster, sometimes they aren't. When there is a significant disparity, however, it usually is because of GeniueneIntel checks.
> If not, why not?
Because scientific software geared toward applications is usually closed-source proprietary or too complicated to be modified (remember that users aren't interested in becoming software engineers, in addition to their own jobs as researchers) to add new alternative backends and you don't get to choose.
"Sometimes they are faster, sometimes they aren't. When there is a significant disparity, however, it usually is because of GeniueneIntel checks."
so you are saying that your open-source BLAS/LAPACK is showing performance differences (and worse performance compared to MKL) because of "something Intel". Seems like a lot of people here (including the ones not being able to compile numpy against another BLAS) are a little bit short on actual experience/knowing about the problem...
"scientific software geared toward applications is usually closed-source proprietary or too complicated to be modified"
If it's geared towards applications, it's usually opaque engineering stuff and the results of people claiming to do science with this software are mediocre at best...
In my domain (qunatum-chemistry) nearly all software is delivered as source-distribution. Because modifications of methods are part of science...
That's funny because I was actually talking about your field! (whose math is borrowed from one of the subfields of physics) I haven't so far met even a single chemist or material scientist who actually knows what they're running, even when they have access to the millons of lines of code they're using. And I don't blame them (or call them "mediocre" as you do), because they only have 24 hours in a day and only one life!
I met only one computational physicist so far doing DFT who used to write his own code back in 70s, but he admits he has no idea what VASP and others are doing nowadays.
If you're claiming that you actually know how VASP or Quantum Espresso (or any other similar significant piece of software) works in depth and you can tweak/replace any part as you like (which I'd find very very hard to believe, millions of work hours go into the development of those), you'd nevertheless be the exception in chemistry not the norm.
The most common high-level tools theoretical physicists like me use (such as Mathematica) don't give access to source code, on the other hand, so we can't make Mathematica not use MKL and not suck on AMD.
> Then it truly is irrelevant! You're not even talking about CPUs vs CPUs.
Why does that matter? The bulk of the issues come from implementation defined behavior, of which there is plenty within x86 itself to cause issues.
In general, the IEEE-compliant parts of x86 are also IEEE-compliant on other processors, at least the ones I've dealt with. It's the operations that aren't specified by IEEE that cause problems.
> Which is why you don't see such artificial crippling in open source implementations of LAPACK/BLAS/sundials/etc
Are they as fast as MKL? If so, just use them?
If not, why not? Maybe the reason is you can do better if you optimize for specific CPUs, with different latencies of various instructions?