Jumps/calls are actually be pretty cheap with modern branch predictors. Even indirect calls through vtables, which is the opposite of most programmers intuition.
And if the devirtualisation leads to inlining, that results in code bloat which can lower performance though more instruction cache misses, which are not cheap.
Inlining is actually pretty evil. It almost always speeds things up for microbenchmarks, as such benchmarks easily fit in icache. So programmers and modern compilers often go out of their way to do more inlining. But when you apply too much inlining to a whole program, things start to slow down.
But it's not like inlining is universally bad in larger program, inlining can enable further optimisations, mostly because it allows constant propagation to travel across function boundaries.
Basically, compilers need better heuristics about when they should be inlining. If it's just saving the overhead of a lightweight call, then they shouldn't be inlining.
No it's not. Except if you __force_inline__ everything, of course.
Inlining reduces the number of instructions in a lot of cases. Especially when things are abstracted and factored with lot of indirections into small functions that calls other small functions and so on. Consider a 'isEmpty' function, which dissolves to 1 cpu instruction once inlined, compared with a call/save reg/compare/return. Highly dynamic code (with most functions being virtual) tend to result in a fest of chained calls, jumping into functions doing very little work. Yes the stack is usually hot and fast, but spending 80% of the instructions doing stack management is still a big waste.
Compilers already have good heuristics about when they should be inlining, chances are they are a lot better at it than you. They don't always inline, and that's not possible anyway.
My experience is that compiler do marvels with inlining decisions when there are lots of small functions they _can_ inline if they want to. It gives the compiler a lot of freedom. Lambdas are great for that as well.
Make sure you make the most possible compile-time information available to the compiler, factor your code, don't have huge functions, and let the compiler do its magic. As a plus, you can have high level abstractions, deep hierarchies, and still get excellent performances.
doesn't the compiler usually do well enough that you really only need to worry about time critical sections of code? Even then you could go in and look at the assembler and see if it's being inlined, no?
I find the Unreal Engine source to be a reasonable reference for C++ discussions, because it runs just unbelievably well for what it does, and on a huge array of hardware (and software). And it's explicit with inlining, other hints, and even a million things that could be easily called micro-optimizations, to a somewhat absurd degree. So I'd take away two conclusions from this.
The first is that when building a code base you don't necessarily know what it's being compiled with. And so even if there were a super-amazing compiler, there's no guarantee that's what will be compiling your code. Making it explicit, so long as you have a reasonably good idea of what you're doing, is generally just a good idea. It also conveys intent to some degree, especially things like final.
The second is that I think the saying 'premature optimization is the root of all evil' is the root of all evil. Because that mindset has gradually transitioned to being against optimization in general outside of the most primitive things like not running critical sections in O(N^2) when they could be O(N). And I think it's this mindset that has gradually brought us to where we are today where need what what would have been a literal supercomputer not that long ago, to run a word processor. It's like death by a thousand cuts, and quite ridiculous.
> The second is that I think the saying 'premature optimization is the root of all evil' is the root of all evil.
The greater evil is putting a one-sentence quote out of context:
"""
There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.
"""
Indeed, but I think even that advice, with context, is pretty debatable. Obviously one should prioritize critical sections, but completely ignoring those "small efficiencies" is certainly a big part of how we got to where we are today in software performance. A 10% jump in performance is huge; whether that comes from a single 10% jump, or a hundred 0.1% jumps - it's exactly the same!
So referencing something in particular from Unreal Engine, they actually created a caching system for converting between a quaternion and a rotator (euler rotation)! Obviously that sort of conversion isn't going to, in a million years, be even close to a bottleneck. That conversion is quite cheap on modern hardware, and so that caching system probably only gives the engine one of those 0.1% boosts in performance. But there are literally thousands of these "small efficiencies" spread all throughout the code. And it yields a final product that runs dramatically better than comparable engines.
I find that gcc and clang are so aggressive about inlining that it's usually more effective to tell them what not to inline.
In a moderately-sized codebase I regularly work on, I use __attribute__((noinline)) nearly ten times as often as __attribute__((always_inline)). And I use __attribute__((cold)) even more than noinline.
So yeah, I can kind of see why someone would say inlining is 'evil', though I think it's more accurate to say that it's just not possible for compilers to figure out these kinds of details without copious hints (like PGO).
+1 on the __attribute__((cold)). Compilers so aggressively optimize based on their heuristics that you spend more time telling them that an apparent optimization opportunity is not actually an optimization.
When writing ultra-robust code that has to survive every vaguely plausible contingency in a graceful way, the code is littered with code paths that only exist for astronomically improbable situations. The branch predictor can figure this out but the compiler frequently cannot without explicit instructions to not pollute the i-cache.
Another for the pro side: inlining can allow for better branch prediction if the different call sites would tend to drive different code paths in the function.
This was true 15 years ago, but not so much today.
The branch predictors actually hash the history of the last few branches taken into the branch prediction query. So the exact same branch within a child function will map different branch predictors entries depending on which parent function it was called from, and there is no benifit to inlining.
It also means that branch predictor can also learn correlations between branches within a function. Like when a branches at the top and bottom of functions share conditions, or have inverted conditions.
And if the devirtualisation leads to inlining, that results in code bloat which can lower performance though more instruction cache misses, which are not cheap.
Inlining is actually pretty evil. It almost always speeds things up for microbenchmarks, as such benchmarks easily fit in icache. So programmers and modern compilers often go out of their way to do more inlining. But when you apply too much inlining to a whole program, things start to slow down.
But it's not like inlining is universally bad in larger program, inlining can enable further optimisations, mostly because it allows constant propagation to travel across function boundaries.
Basically, compilers need better heuristics about when they should be inlining. If it's just saving the overhead of a lightweight call, then they shouldn't be inlining.