> Clang has aged it's both slowed down and reached parity in both compile and execution speed
A serious question: Is it correct to say that any compiler with state-of-the-art optimization cannot be fast, and clang's previous performance advantages were actually due to the lack of various optimization techniques and other features? And is there any investigation and research on the tradeoff of compiler speed/complexity vs. execution performance? Many minimalist purists claim that modern compilers are evil and it's possible to write a small and fast compiler. Yes, but at what cost? And is the cost too large to be justified, or is it that computers are acutally fast enough?
Sure, there are some language features that can make optimizations pathological complex, but for the most part if compilation speed isn't continuously tracked and strictly enforced against new changes, you inevitably regress into a slower compiler.
For instance, the Go community is highly sensitive to compilation speed. Although the Go compiler is SSA-based like LLVM, it's very selective about the optimizations applied. The Go compiler tracks and benchmarks each phase of the compiler release-to-release to identify major regressions.
Priorities exactly. And these priorities are not only down to the compiler implementation, but also the language design. I'd argue that Rust being much slower to compile than Go has reasons from both sides, and hence will probably never be as fast as Go to compile.
It would be weird if that was not the case, as Go and all its bells and whistles are open source. I found this after a quick search (clicking around will send you to the source code): https://godoc.org/golang.org/x/tools/cmd/compilebench
edit: Ooh, that's nice, I did not know about that, thanks for sharing (especially useful as I'm transitioning to Rust after 10yrs of Go).
> and clang's previous performance advantages were actually due to the lack of various optimization techniques and other features?
Not lack of optimization techniques; LLVM is relatively fast compared to Clang for builds of the Linux kernel. Is it all of the compiler diagnostics and semantic analysis, or just unoptimized code? WIP to find out.
(I literally just went over a perf record of a linux kernel build with a teammate that's going to look into this more; I organized the LLVM miniconf at plumbers which is the video linked in the phoronix article in the grand parent post; my summer intern did the initial profiling work).
For a compilation to object file from source code, the vast majority of time for most translation units when building
the Linux kernel is spent in the front end of the pipeline, not the middle, or backend.
It's possible, although it's worth saying that compilers are still relatively dumb when it comes to actually using their arsenal of optimizations - you can have too much of a good thing, it's quite difficult to have a usable performance/compilation-speed tradeoff.
This isn't a particularly scientific test, but if you compare the assembly for the generated factorial function, the compilers being asked for absolutely breakneck speed correctly identify parallelism in the code, but fail to recognize that the loop overflows almost immediately - it is defined behaviour, so without human intervention it can't guess where we want performance. So the end result is a pretty enormous SIMD-galore function which is not much faster or even slower than the very simple optimize-for-size output.
You may be thinking, well it's faster sometimes, that's good - that is often a good thing, however, code density is very important. The SIMD-gone-mad version is several times bigger than the simple loop. Modern x86 CPU's have a roughly 1-4k μop cache, that might be a whole load of it gone right there. It's a significant dent in the instruction cache too.
If you look down the software stack, we use programs like compilers to invariants in our data and algorithms, act on it, and then promptly throw it away - when moore's law ends this is where we can find a lot of performance, I reckon. In this case we can bully the compiler into assuming the code overflowing won't happen, or highly unlikely, which lets it schedule the code more effectively for our use-case but this requires relatively intimate knowledge of a given compiler and the microarchitecture one is using - it's not really good enough.
I think with SIMD we have a language problem - the for construct is not what we often want to use, but a map. Also lack of distinction of pure and impure functions on most C based languages, basically disallow a lot of opportunities for proper simd.
Maybe that depends on what "state of the art" means? It sounds like it might mean https://en.wikipedia.org/wiki/Superoptimization, but that's so slow-to-compile that I don't think any production compiler does anything like it. It might be that what "state of the art" really means in practice is "the slowest-to-compile level of optimization that regular folks are willing to put up with on a day-to-day basis." In which case, it cannot be fast kind of by definition :)
No, not superoptimization. When I say "state-of-the-art", I meant "up-to-date techniques currently practiced in the industry", not theoretically optimal. For example, "state-of-the-art" cryptography is TLS v1.3, AES-GCM, or the Signal protocol, while a theoretically optimal one might be an "one-time pad" (or superoptimization in your example).
In other words, "We all know standard compilers are slow. Does it mean that for a compiler to reach ICC/GCC/clang's -O2 performance in benchmarks require inherently expensive optimizations? Or is it simply that ICC/GCC/clang's optimizes are poorly implemented in terms of compile performance?"
When I asked this question, I make the distinction between "currently used optimization techniques and algorithms vs their implementations in GCC/clang". How large is the gap? If the gap is small, it means the algorithms as currently used for optimizations are inherently expensive (i.e. a "good" compiler "must" be slow). if the gap is large, it means a much faster implementation is theoretically possible (i.e. a "good" compiler "can" be fast).
LLVM does indeed use known suboptimal algorithms for increased speed. Its register allocator is an example: a "perfect" register allocator is NP complete, and LLVM uses a relatively simple greedy algorithm instead.
I don’t understand why a fast debug build mode and a really slow, but super-optimizing (not necessarilly in the meaning you mentioned in your link) compiler is not more used (I think zig does that). It would be the best of both words, really fast write-compile-test phase, and a production ready, fast binary on release. (i know that -O0-3 is similar, but on the whole it is not as popular as it could)
It's a combination of factors, optimization certainly takes a lot of time, but it's not the only significant factor. The modular multi-pass architechture also takes a toll. If you look at fast c compilers like TCC, you can see that they get their speed primarily from doing fewer passes over the source code. In the case of TCC it only does a single pass, which severely limits the types of optimizations it can do, but if you compare the performance of TCC to that of gcc or clang with optimizations disabled (-O0), TCC is significantly faster. So you can see that the architechture used to make large extensible optimizing compilers like GCC also inherently removes some of their speed.
I believe it is theoretically possible to write many of the optimizations that larger compilers have in fewer passes, but that comes at the cost of complexity and difficulty in debugging different combinations.
The architectural problem can sort of be resolved by holding one's nose and maintaining a parallel (fast) algorithm, the problem is that this is expensive and bug prone. The algorithms that lead to optimal behaviour often don't easily have a sliding scale of performance vs. quality. If TCC had to support VLIW it would be a lot more complicated and slower
It is a trade off. However, you can control those trade offs via the optimization switches. You can compare either builds with no optimization vs full optimization.
It's more complicated than that. You can see what passes are run by clang with the flags -mllvm -debug-pass=Arguments. It's certainly true that -O0 involves fewer passes than -O1 which involves fewer passes than -O2. But the speed limit is defined by the architecture. A compiler that was engineered for compilation speed with a goal of -O1 tier code quality would have a very different architecture from day 1 (e.g. unified IR, fewer passes, integrated register allocation and code generation). There's been attempts at "punching through the layers" over the years with things like fastisel, but there are limits.
That said, all the major compilers can deliver adequate compilation speed on the order of 50-100 kloc/sec for well-structured C code bases. Speed is much more of a serious concern if you're working on large C++ code bases. Or if you're using LLVM as a backend from a language like Julia which does the moral equivalent of C++ template instantiation at runtime and feeds it to LLVM.
The Utrecht Haskell Compiler[1] tried to break through the barrier by using attribute grammars[2] and attempts WHO. This also has the benefit of being more modular. However, Overall GHC still feels faster, I am unsure whether this is due to more costly optimisations or something else.
Looks more like reliability vs speed. Small & fast compilers skimp out on various safety checks and edge cases. Yeah it might mean segfaults and crashes in weird archs and situations, but it's fast on perfect machines.
Rust’s cargo check appears to be quite fast, and afaik offers no less safety checks than a complete build. Of course, I’ve never verified, but there’s a clear and substantial allocation of time spent compiling rust, after all the safety checks.
A serious question: Is it correct to say that any compiler with state-of-the-art optimization cannot be fast, and clang's previous performance advantages were actually due to the lack of various optimization techniques and other features? And is there any investigation and research on the tradeoff of compiler speed/complexity vs. execution performance? Many minimalist purists claim that modern compilers are evil and it's possible to write a small and fast compiler. Yes, but at what cost? And is the cost too large to be justified, or is it that computers are acutally fast enough?