> *Clang has aged it's both slowed down and reached parity in both compile and e...

smasher164 · on Jan 13, 2021

There isn't any evidence to suggest this. Slow compilation speed is mostly a consequence of priorities. See https://nikic.github.io/2020/05/10/Make-LLVM-fast-again.html and https://www.reddit.com/r/haskell/comments/45q90s/is_anything....

Sure, there are some language features that can make optimizations pathological complex, but for the most part if compilation speed isn't continuously tracked and strictly enforced against new changes, you inevitably regress into a slower compiler.

For instance, the Go community is highly sensitive to compilation speed. Although the Go compiler is SSA-based like LLVM, it's very selective about the optimizations applied. The Go compiler tracks and benchmarks each phase of the compiler release-to-release to identify major regressions.

cies · on Jan 13, 2021

Priorities exactly. And these priorities are not only down to the compiler implementation, but also the language design. I'd argue that Rust being much slower to compile than Go has reasons from both sides, and hence will probably never be as fast as Go to compile.

kaba0 · on Jan 14, 2021

Rust’s compiler does way more work than Go’s - it is not priorities..

atombender · on Jan 13, 2021

Are benchmarks posted publicly anywhere, like Rust does with their compiler?

smnscu · on Jan 13, 2021

It would be weird if that was not the case, as Go and all its bells and whistles are open source. I found this after a quick search (clicking around will send you to the source code): https://godoc.org/golang.org/x/tools/cmd/compilebench

edit: Ooh, that's nice, I did not know about that, thanks for sharing (especially useful as I'm transitioning to Rust after 10yrs of Go).

atombender · on Jan 13, 2021

Sorry, I mean a public web page for viewing results with charts and tables, like Rust has [1].

[1] https://perf.rust-lang.org/

ndesaulniers · on Jan 13, 2021

> and clang's previous performance advantages were actually due to the lack of various optimization techniques and other features?

Not lack of optimization techniques; LLVM is relatively fast compared to Clang for builds of the Linux kernel. Is it all of the compiler diagnostics and semantic analysis, or just unoptimized code? WIP to find out.

(I literally just went over a perf record of a linux kernel build with a teammate that's going to look into this more; I organized the LLVM miniconf at plumbers which is the video linked in the phoronix article in the grand parent post; my summer intern did the initial profiling work).

ToT LLVM, linux-next x86_64 defconfig: https://photos.app.goo.gl/hNwwy2K1Gh7Uc2D67 (yes, I know about disabling kptr_restrict)

Our next big project is to look into lazy parsing.

hamandcheese · on Jan 13, 2021

I’m confused, I thought clang was a frontend for LLVM so how does it make sense to compare clang with LLVM?

From the clang Wikipedia:

> It uses the LLVM compiler infrastructure as its back end and has been part of the LLVM release cycle since LLVM 2.6.

ndesaulniers · on Jan 13, 2021

It's a pipeline; clang starts, hands off to LLVM.

For a compilation to object file from source code, the vast majority of time for most translation units when building the Linux kernel is spent in the front end of the pipeline, not the middle, or backend.

See also my first plot: https://github.com/ClangBuiltLinux/linux/issues/1086#issueco...

mhh__ · on Jan 13, 2021

It's possible, although it's worth saying that compilers are still relatively dumb when it comes to actually using their arsenal of optimizations - you can have too much of a good thing, it's quite difficult to have a usable performance/compilation-speed tradeoff.

A quick example of this: https://gcc.godbolt.org/z/4r38xv

This isn't a particularly scientific test, but if you compare the assembly for the generated factorial function, the compilers being asked for absolutely breakneck speed correctly identify parallelism in the code, but fail to recognize that the loop overflows almost immediately - it is defined behaviour, so without human intervention it can't guess where we want performance. So the end result is a pretty enormous SIMD-galore function which is not much faster or even slower than the very simple optimize-for-size output.

You may be thinking, well it's faster sometimes, that's good - that is often a good thing, however, code density is very important. The SIMD-gone-mad version is several times bigger than the simple loop. Modern x86 CPU's have a roughly 1-4k μop cache, that might be a whole load of it gone right there. It's a significant dent in the instruction cache too.

If you look down the software stack, we use programs like compilers to invariants in our data and algorithms, act on it, and then promptly throw it away - when moore's law ends this is where we can find a lot of performance, I reckon. In this case we can bully the compiler into assuming the code overflowing won't happen, or highly unlikely, which lets it schedule the code more effectively for our use-case but this requires relatively intimate knowledge of a given compiler and the microarchitecture one is using - it's not really good enough.

kaba0 · on Jan 14, 2021

I think with SIMD we have a language problem - the for construct is not what we often want to use, but a map. Also lack of distinction of pure and impure functions on most C based languages, basically disallow a lot of opportunities for proper simd.

oconnor663 · on Jan 12, 2021

Maybe that depends on what "state of the art" means? It sounds like it might mean https://en.wikipedia.org/wiki/Superoptimization, but that's so slow-to-compile that I don't think any production compiler does anything like it. It might be that what "state of the art" really means in practice is "the slowest-to-compile level of optimization that regular folks are willing to put up with on a day-to-day basis." In which case, it cannot be fast kind of by definition :)

segfaultbuserr · on Jan 13, 2021

No, not superoptimization. When I say "state-of-the-art", I meant "up-to-date techniques currently practiced in the industry", not theoretically optimal. For example, "state-of-the-art" cryptography is TLS v1.3, AES-GCM, or the Signal protocol, while a theoretically optimal one might be an "one-time pad" (or superoptimization in your example).

In other words, "We all know standard compilers are slow. Does it mean that for a compiler to reach ICC/GCC/clang's -O2 performance in benchmarks require inherently expensive optimizations? Or is it simply that ICC/GCC/clang's optimizes are poorly implemented in terms of compile performance?"

jgtrosh · on Jan 13, 2021

But what is practised is often “the best that's not slow enough to affect development cycle”, so it is something of a recursive definition.

segfaultbuserr · on Jan 13, 2021

When I asked this question, I make the distinction between "currently used optimization techniques and algorithms vs their implementations in GCC/clang". How large is the gap? If the gap is small, it means the algorithms as currently used for optimizations are inherently expensive (i.e. a "good" compiler "must" be slow). if the gap is large, it means a much faster implementation is theoretically possible (i.e. a "good" compiler "can" be fast).

millstone · on Jan 13, 2021

LLVM does indeed use known suboptimal algorithms for increased speed. Its register allocator is an example: a "perfect" register allocator is NP complete, and LLVM uses a relatively simple greedy algorithm instead.

kaba0 · on Jan 14, 2021

I don’t understand why a fast debug build mode and a really slow, but super-optimizing (not necessarilly in the meaning you mentioned in your link) compiler is not more used (I think zig does that). It would be the best of both words, really fast write-compile-test phase, and a production ready, fast binary on release. (i know that -O0-3 is similar, but on the whole it is not as popular as it could)

MobiusHorizons · on Jan 13, 2021

It's a combination of factors, optimization certainly takes a lot of time, but it's not the only significant factor. The modular multi-pass architechture also takes a toll. If you look at fast c compilers like TCC, you can see that they get their speed primarily from doing fewer passes over the source code. In the case of TCC it only does a single pass, which severely limits the types of optimizations it can do, but if you compare the performance of TCC to that of gcc or clang with optimizations disabled (-O0), TCC is significantly faster. So you can see that the architechture used to make large extensible optimizing compilers like GCC also inherently removes some of their speed.

I believe it is theoretically possible to write many of the optimizations that larger compilers have in fewer passes, but that comes at the cost of complexity and difficulty in debugging different combinations.

mhh__ · on Jan 13, 2021

> architechture

The architectural problem can sort of be resolved by holding one's nose and maintaining a parallel (fast) algorithm, the problem is that this is expensive and bug prone. The algorithms that lead to optimal behaviour often don't easily have a sliding scale of performance vs. quality. If TCC had to support VLIW it would be a lot more complicated and slower

agumonkey · on Jan 13, 2021

Back when Clang popped, I remember gcc devs saying gcc had accumulated too much cruft over the years.

marktangotango · on Jan 13, 2021

> Many minimalist purists claim that modern compilers are evil

Do you have a reference for this statement?

RcouF1uZ4gsC · on Jan 12, 2021

It is a trade off. However, you can control those trade offs via the optimization switches. You can compare either builds with no optimization vs full optimization.

psykotic · on Jan 13, 2021

It's more complicated than that. You can see what passes are run by clang with the flags -mllvm -debug-pass=Arguments. It's certainly true that -O0 involves fewer passes than -O1 which involves fewer passes than -O2. But the speed limit is defined by the architecture. A compiler that was engineered for compilation speed with a goal of -O1 tier code quality would have a very different architecture from day 1 (e.g. unified IR, fewer passes, integrated register allocation and code generation). There's been attempts at "punching through the layers" over the years with things like fastisel, but there are limits.

That said, all the major compilers can deliver adequate compilation speed on the order of 50-100 kloc/sec for well-structured C code bases. Speed is much more of a serious concern if you're working on large C++ code bases. Or if you're using LLVM as a backend from a language like Julia which does the moral equivalent of C++ template instantiation at runtime and feeds it to LLVM.

spockz · on Jan 13, 2021

The Utrecht Haskell Compiler[1] tried to break through the barrier by using attribute grammars[2] and attempts WHO. This also has the benefit of being more modular. However, Overall GHC still feels faster, I am unsure whether this is due to more costly optimisations or something else.

[1]: https://github.com/UU-ComputerScience/uhc [2]: https://en.wikipedia.org/wiki/Attribute_grammar

rdsubhas · on Jan 13, 2021

Looks more like reliability vs speed. Small & fast compilers skimp out on various safety checks and edge cases. Yeah it might mean segfaults and crashes in weird archs and situations, but it's fast on perfect machines.

setr · on Jan 13, 2021

Rust’s cargo check appears to be quite fast, and afaik offers no less safety checks than a complete build. Of course, I’ve never verified, but there’s a clear and substantial allocation of time spent compiling rust, after all the safety checks.