Update: I got this working, and dude- it's so awesome in every way. This is the most substantial improvement I've seen yet. Most importantly- it massively reduces memory requirements. Thank you so much. I'll commit within a day or so and make sure to mention you, on Twitter.
Hey, thank you a lot, that's awesome!
One more thing I recently thought about, but didn't get around to mention, is that you can probably reduce the input of your net to the Y (luminance) channel (with UV-only output), to trim it further ;)
But that might already be what you are doing, for all I know. I am just really glad I could be of any help! And this feels like an "free-lunch" improvement.