Interesting, what do you mean by "inductive indexing"? 10 million inserts per se...

jandrewrogers · on Feb 11, 2022

The 10 million/sec is largely a consequence of the insert path being highly pipelined all the way through storage without a lot of secondary structure. An EC2 instance like i3en clearly has the end-to-end hardware bandwidth to support that write rate. The usual thread-per-core, user-space I/O, etc idioms will get you there with a good design. Of course, you still need to insert an index in there that captures every searchable column you need without slowing things down too much, or query performance will be nonexistent.

Inserting an index into that pipeline means the data must be organized around a single primary index for every key column you want to query, no secondary indexes, and that you generally cannot use a classic multidimensional spatial index to capture multiple columns -- if column types have fundamentally different distributions of data and load, query selectivity degrades materially. A canonical example is trying to index time and space, and maybe an entity id, in the same index e.g. GPS probe data models, which can be both extremely large and extremely fast.

The general strategy is to build an indexing structure that adaptively compresses away all of the non-selective feature space of the dimensions regardless of the characteristic distribution of the data types. The idea is kind of old and follows from the theoretical equivalence between indexing and AI; whereas most succinct indexing structures effect compression via coding theory, this is compression via lossy pattern induction. Storage engines have to be specifically designed for this (ab)use case and you cannot do any kind of short-circuit queries with these indexes, you always have to look at the underlying data the index points to. Generalized induction is pathologically intractable, so these are efficient and practical approximations. The write (and query) throughput is extremely high because the index structure often fits entirely in CPU cache even for very large data models. AFAIK, these architectures have been used in production for around a decade, but current versions are qualitatively better than early versions.

That said, these indexes still suffer from the write sparsity problem that all cache-mediated index-organized storage has, it just pushes that performance problem out a couple orders of magnitude. Those limits are visible on ultra-dense storage available today. There is a radically different research database architecture that appears to qualitatively breach this limit but I don't expect that to show up in production software implementation for at least a few years. It isn't fully baked but people are working on it.

Virtually all of the research in this space happens completely outside of academia, so it rarely gets published, and research groups stopped filing patents over a decade ago because they were effectively unenforceable. It has turned into a boutique business.

dataangel · on Feb 11, 2022

How do I learn more about all of this? Where do I go to work?