A good, but wrong, assumption is to assume Reddit's engineers know what they're doing.
The founding Reddit team was non-technical (Even Spez. I've been to UVA; it's not an engineering school; and spez had never done any real engineering before coming onto Reddit). They ran a skeleton crew of some 10ish people for a long time (none from any exceptional backgrounds. One of them was from Sun & Oracle, after their prime).
Same group that started with a node front-end, python back-end monolith, with a document-oriented database structure (i.e., they had two un-normalized tables in Postgres to hold everything). Later they switched to Cassandra and kept that monstrosity -- partly, because back then no one knew anything about databases except sysadmins.
Back then they were running a cache for listing retrieval. Every "reddit" (subreddit and the various pages, like front-page, hot, top, controversial, etc.) listing is kept in a memcache. Inside, you have your "top-level" listing information (title, link, comments, id, etc.). The asinine thing is that cache invalidation has always been a problem. They originally handled it using RabbitMQ queues: votes come in, they're processed, and then the cache is updated. Those things always got backed up, because no one thought about batching updates on a timer (or how to use lock-free) (and no one knew how to do vertical scaling, and when they tried, it made things even harder to reason about). You know what genius plan they had next to solve this? Make more queues. Fix throughput issues by making more queues, instead of fixing the root cause of the back-pressure. Later, they did "shard"/partition things more cleanly (and tried lock-free) -- but they never did any real research into fixing the aforementioned problem (how to handle millions of simple "events" a day... which is laughable thinking back to it now).
That's just for listings. The comment trees are another big bad piece. Again, stored un-normalized -- but this time actually batched (not properly, but it is a step up). One great thing about un-normalized databases and trees, is that there are no constraints on vertices. So a common issue was that you could back-up your queue (again) for computing the comment trees (because they would never get processed properly) (and you could slow the entire site to a crawl because your message broker was wasting its time on erroneous messages).
Later, they had the bright idea to move from a data center to AWS -- break everything up into microservices. Autoscaling there has always been bungled.
There was no strong tech talent, and no strong engineering culture -- ever.
-------
My 2 cents: it's the listing caches. The architecture around it was never designed to handle checking so many "isHidden" subreddits (despite that they're still getting updates to their internal listings) -- and it's coming undone.
I read this as a pretty scathing dressing down of incompetent engineering at reddit. But after having breakfast, what I'm realizing again is that perfect code and engineering are not required to make something hugely successful.
I've been parts of 2 different engineering teams that wrote crap that I would cuss out but were super succesfuly, and most recently I joined a team that was super anal about every little detail. I think business success only gets hindered on the extremes now. If you're somewhere in the middle, it's fine. I'd rather have a buggy product that people use than a perfect codebase that only exists on github.
Agreed. In fact, I believe success stories actually skew the other way. Those that actually build something that gets off the ground and is successful will in many cases not have the time to write perfect code.
The founding Reddit team was non-technical (Even Spez. I've been to UVA; it's not an engineering school; and spez had never done any real engineering before coming onto Reddit). They ran a skeleton crew of some 10ish people for a long time (none from any exceptional backgrounds. One of them was from Sun & Oracle, after their prime).
Same group that started with a node front-end, python back-end monolith, with a document-oriented database structure (i.e., they had two un-normalized tables in Postgres to hold everything). Later they switched to Cassandra and kept that monstrosity -- partly, because back then no one knew anything about databases except sysadmins.
Back then they were running a cache for listing retrieval. Every "reddit" (subreddit and the various pages, like front-page, hot, top, controversial, etc.) listing is kept in a memcache. Inside, you have your "top-level" listing information (title, link, comments, id, etc.). The asinine thing is that cache invalidation has always been a problem. They originally handled it using RabbitMQ queues: votes come in, they're processed, and then the cache is updated. Those things always got backed up, because no one thought about batching updates on a timer (or how to use lock-free) (and no one knew how to do vertical scaling, and when they tried, it made things even harder to reason about). You know what genius plan they had next to solve this? Make more queues. Fix throughput issues by making more queues, instead of fixing the root cause of the back-pressure. Later, they did "shard"/partition things more cleanly (and tried lock-free) -- but they never did any real research into fixing the aforementioned problem (how to handle millions of simple "events" a day... which is laughable thinking back to it now).
That's just for listings. The comment trees are another big bad piece. Again, stored un-normalized -- but this time actually batched (not properly, but it is a step up). One great thing about un-normalized databases and trees, is that there are no constraints on vertices. So a common issue was that you could back-up your queue (again) for computing the comment trees (because they would never get processed properly) (and you could slow the entire site to a crawl because your message broker was wasting its time on erroneous messages).
Later, they had the bright idea to move from a data center to AWS -- break everything up into microservices. Autoscaling there has always been bungled.
There was no strong tech talent, and no strong engineering culture -- ever.
-------
My 2 cents: it's the listing caches. The architecture around it was never designed to handle checking so many "isHidden" subreddits (despite that they're still getting updates to their internal listings) -- and it's coming undone.