That's not a benefit of async/await as the same could be done with user-mode threads. In fact, that's what we do with virtual threads. But it might be a benefit of async/await in some particular languages.
> But it might be a benefit of async/await in some particular languages.
Rather than saying it's a benefit for particular languages, I'd say it's a benefit in particular contexts, e.g. in contexts where you don't have a heap. Of course it's true that some (most) languages don't support such contexts at all (for a host of good reasons), but the languages that do are shaped by that decision.
The use case of interest here is having many concurrent operations (hundreds of thousands or millions). If you don't have a heap, where do you store the (unbounded number of) async/await frames? There are other use-cases where stackless coroutines are useful without being plentiful — e.g. generators — but that's not the use-case we're targeting here (and is probably a use-case of lower importance in general).
Many languages/runtimes want just a single coroutine/continuation construct to cover both concurrency and generators — which is a good idea in principle — but then they, especially low-level languages, optimise for the less useful of the two. I've seen some very cool demos of C++ coroutines that are useful for very narrow domains, and yet they offer a single construct that sacrifices the more common, more useful, usage for the less common one.
There was one particular presentation about context-switching coroutines in the shadow of cache misses. It was extremely impressive, yet amounted to little more than a party trick. For one, it was extremely sensitive to precise sizing of the coroutine frames, which goes against the point of having a simple, transparent language construct, and for another, it simplifies small code that has to be very carefully written and optimised to the instruction level even after the simplification.
Yes, I am (perhaps a bit sloppily) using "particular contexts" to refer to particular use cases. And while your use case is the C5M problem, since we're bringing up other languages (which optimize for different contexts) I think it's worth emphasizing that these features also lend themselves to other use cases. Here's an example of using Rust's async/await on embedded devices, for reasons other than serving millions of concurrent connections: https://ferrous-systems.com/blog/async-on-embedded/
> Many languages/runtimes want just a single coroutine/continuation construct to cover both concurrency and generators — which is a good idea in principle — but then they, especially low-level languages, optimise for the less useful of the two.
Notably Rust appears to be the opposite here, as it is first focusing on providing higher-level async/await support rather than providing general coroutine support, but its async/await is implemented atop a coroutine abstraction which it does hope to expose directly someday.
I'm sure you don't need to be told most of this, but I bring all this up to help answer the more general question of why not every language builds in a green thread runtime, and why one approach is not necessarily strictly superior to another.
If generators or embedded devices that don't have threads are indeed the reason for picking one design over the other, the question then becomes why did some languages prioritise those domains over more common ones, even for them?
Indeed, to which the answer is: it's a dirty job, but somebody's got to do it. :) As long as C exists, it's worth trying to improve on what C does without giving up on C's use cases. Of course, that doesn't mean that all use cases are equivalently common, nor does it mean that a language like Rust will ever be as widely used as Java, nor does it mean that Java was wrong for integrating virtual threads (I think they're probably the right solution for a language in Java's domain).
A common theme in rust development is the notion that no one could produce more optimal code by hand. This is a great feature, but in the case of async/await we are sacrificing a lot to get it. To the extent that a user trying to make their first http request with reqwest will now get conflicting documentation and guidance on whether they need tokio and other packages to pull in async.
Can you explain how this is done? Is the current stack copied onto the heap (to the size it currently is)? How are new frames allocated once a thread is suspended?
A portion of the stack is copied to the heap when the virtual thread is suspended, and upon successive yields those "stack chunks" are either reused or new ones allocated and form a linked list. When resuming a virtual thread, however, we don't copy its entire stack back from the heap to the stack, but we do it lazily, by installing a "return barrier" by patching the return address, so as you return from a method, its caller (or several callers) is lazily "thawed" back from the heap. This copying of small chunks of memory into a region that's likely in the cache is very efficient.
The entire mechanism is rather efficient because in Java we don't have pointers into the stack, so we don't need to pin anything to a specific address, and stacks can be freely moved around.