Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Databases, Types, and the Relational Model: The Third Manifesto [pdf] (warwick.ac.uk)
178 points by fipar on Oct 4, 2021 | hide | past | favorite | 93 comments


This purely about database language semantics, not other aspects of database systems. In other words, it's a manifesto about languages for databases, and a criticism of SQL in particular.

The challenges of a database language are different than those of a general purpose language. Queries live a long time, spread over all kinds of applications (and different versions) and reports and scripts. And the data is always growing and changing. Constraints and declarative queries help bridge the concerns of the readers and writers of the data.

I read quite a few books by Date & Darwen, in particular, Temporal Databases and the Relational Model. Out of that came my work on Exclusion Constraints and Range Types for PostgreSQL, which allow you to deal with ranges of time (with beginning and end) much more effectively. So I believe the authors have good insights.

But they are definitely in an ivory tower. They make strong claims about superiority of one approach over another without connecting with practice. You can feel their frustration that their ideas aren't put into practice, but they underestimate the difficulty of doing so. And I suspect that to bridge that gap enough to see a net benefit over SQL would require some revisions and amendments to their theories.

I believe haskell has some really nice answers to problems in database languages, and that database queries would be a great application area for a haskell-like language. It's too bad traditional PL theory people and DB people don't mix a little more.


Hello, author of a Haskell library for Postgres[1] here!

I agree with you. Darwen and Date have some very weird requirements, notably:

1. Requirement that "relations" should be sets not bags. Having one means you automatically have the other[2]. There's no point going out of one's way to forbid bags.

2. Requirement that the contents of "relations" should be "tuples". The contents of a "relation" should be allowed to be any data type, whether compound or not.

3. Completely ignoring sum types.

4. Forbidding NULLs even though there's an easy way to permit them sensibly: have nullability be part of the type.

Overall, Darwen and Date's language design is poor from both the point of compatibility with real-world implementations and with current practice in programming language theory.

I think my package Opaleye is a good candidate for what Darwen and Date's proposed "D" language should have been.

[1] https://hackage.haskell.org/package/opaleye

[2] http://h2.jaguarpaw.co.uk/posts/set-bag-irrelevance/


Date & Darwen didn't actually design a language, just a set of prescriptions and proscriptions.

I agree with you on points #3 and #4. Of course they'd argue that NULLs and sum types are possibly conforming, but to not discuss them at all is a major omission.

They do barely hint at the idea of sum types, in An Introduction to Database Systems, 18.6, "Special Values". That's 3 paragraphs, with no examples, and a reference to another work. I already have enough of their books and didn't bother to get that one, in part because the three paragraphs describe the approach as "not very elegant". It was also published in 1998, well after ML-like type systems were in practice, so it's very strange that Date & Darwen tried to go about it on their own.

Regarding your points #1 & #2, I'm not sure I agree. When it comes to computing, any type can be used to represent any other type as long as the domains are equally large. But that's not a very useful discussion when we're talking about languages. Changing from bags to sets or vice versa changes the kinds of operations available and their meaning. If we're talking about the relational model, then let's use relations and tuples. Maybe you don't want to always use relations, and if the system has good support for bags, then great, but it's not quite the relational model and may lose some of the benefits.

Regarding point #2 specifically, you may find Appendix B in The Third Manifesto interesting (page 401 in the linked PDF).

Overall, I think they had some good insights, and they influenced the way I look at databases in a positive way. I also had a very positive experience the one time I interacted with Hugh Darwen, where he showed some genuine curiosity about my question and took it seriously. I'm glad I followed their work, but I'm also glad I didn't join them in their ivory tower.


Regarding #1, sets and bags are both widely used in practice so restricting to just one or the other seems unnecessarily limiting. If there are benefits to restricting oneself to sets then one can restrict oneself to sets and get those benefits. The only possible drawback I can see is making the API surface area too large and unwieldy. But I don't think that applies here.

Regarding #2, that's interesting, they've done the same sort of "if you have one you have the other" analysis that I did for sets vs. bags, and concluded that you only need the more general one. I have to say I don't follow their rationale. It still seems to me that relations should have a single field. If you want to simulate "multiple fields" then make the "single field" a tuple. That's what Opaleye does and it works very naturally.

As an aside, thanks for your work on Postgres. It's a great system!


> It's too bad traditional PL theory people and DB people don't mix a little more.

The only ones who do mix seem to be the ones who work with type theory or category theory! I predict that we’ll see an absolute revolution in about 20 years, after successive cohorts of undergrads are exposed to this stuff and slowly carry it out into academia and industry.


I really wish Date had any understanding of modern type theory, since that's the basis of his desired reform of SQL. His typing model is 50 years out of date.


Heck I wish he even had any understanding of the use of NULL in real-world databases. He spends books arguing, very compellingly I might add, against the use of NULL in SQL dbs, but when asked to come up with an altenrative to NULL, he's suggestions are sentinel values and splitting out every nullable attribute into its own table. Both of these solutions are trivially rejectable.


Splitting every nullable attribute is not "trivially rejectable", you absolutely can do this to represent sum types (and it doesn't mean that those should actually be stored separately: the relational model is logical model, it doesn't matter if a table is implemented with one or with two B-trees or somehow else entirely).

The main problem is that you have to manually type JOINs for all of such sum-typed properties to do any useful queries/updates — and there really is no other way than write it every time anew by hand because the relational theory is just the intuitionistic first-order predicate calculus. You'd really want some second-order tools, to group those tables together... but then you essentially end with tables with sum-typed columns, and since the main idea of the relational model was to never step out of the first-order predicate calculus, the reasonable solution is probably to just introduce sum-types as actual types.


Maybe trivial was the wrong word, but your second paragraph is exactly why I reject the approach to splitting out all nullable values into their own relation. Yes from a data modeling theoretical perspective it works, but from a pragmatic perspective having to join everything back together again is a worse situation than just thinking through the behavior of null in that situation.

But even then things are awful. What do I do when I want to include the value of the nullable table in my result, but I want all of my records, not just the ones with non-null. So even though you've purged null from your data model you're still bringing them back with an outer join.


Joining the thing back together again is not so bad if with recent multi-way join algorithms.

But then again, the resulting model is factually a graph database and far far away from resembling something like SQL or Relational Calculus.

(Which might not be a bad thing though.

"You either die NOT NULL or live long enough to see an instance where you really wish that column was nullable.")


It's the implicit semantics of the tables. We start with modelling the world itself: "user with id ID is AGE years old and lives at ADDRESS", and end with modelling out particular instance of knowledge of the world: "user with id ID exists", "user with id ID is known to be AGE years old", "user with id ID known to live at ADDRESS"; and with the "closed world principle" you can explore more of the limits of the knowledge: "which users we know to exist and know their addresses but don't know their ages?". But query "what is the id of the oldest user?" is simply unanswerable unless you know all of their ages; the best you generally do is answer "what is the id of the oldest of the users that we know ages of?".

It's when we start confusing the map and the territory, that's when lot of miscalculations starts to happen.


I think you meant to write "open world principle", as every non state fact is implicitly false under the closed world assumption.

But yeah I agree, that databases should always be thought of a a model of the world (crisp), and not as the world itself (fuzzy).

But it's also noteworthy that this is a leaky abstraction, and that any kind of database that has to operate in the real world (in contrast to say an entity component game engine) will face this leakiness.

The only way I see to resolve this is to turn the database into something that remains crisp in the face of fuzziness. E.g. by storing only observations or interpretations of the real world processes, since the observation is an "subjective" statement made from the database perspective, it holds true regardless of the actual "objective" condition.

It's just not easy to program and model in such a system.


Sounds like a restatement of the open world / closed world debate.


Well, what's the semantics of your result? Say you have

    TABLE UsersWithKnownAge(id ID, age INTEGER)
    TABLE UsersWithUnknownAge(id ID)

    CONSTRAINT UsersWithKnownAge JOIN UsersWithUnknownAge IS EMPTY
with obvious predicates: "User with id ID is known to be AGE years old" and "User with id ID is unknown". You join them and get the relation with the predicate "User with id ID..." How do you continue?

"We know AGE_KNOWLEDGE about the age of the user with id ID", where AGE_KNOWLEDGE is Option<Integer> that has semantics "number is unknown" or "number is known to be NUM". Okay. What information about the whole dataset can you extract from this join?

If your query needs to know the users' ages, you can't execute it unless you restrict yourself to querying about UsersWithKnownAge, in which case, just run it on UsersWithKnownAge.

If your query doesn't need to know users' ages, you can run it on "UsersWithKnownAge(id) UNION UsersWithUnknownAge(id)", not on an OUTER JOIN.


Nice toy example.

> What information about the whole dataset can you extract from this join?

You could count the number of users.


> splitting out every nullable attribute into its own table

One property of 6th normal form is that null ceases to exist by way of optional joins. This is a good thing.

If you ignore the physical data model for a few seconds, you might see how this is can become a deterministic path for modeling success in any problem domain of arbitrary complexity. You aren't able to make the kinds of assumptions that would typically screw you over in the future if you follow 6NF properly. De-normalization for performance reasons should be undertaken only after you have perfectly modeled the domain in logical terms.


And at this point you may as well be running Datomic. I remember the first time looking at Datomic the whole thing felt like it was coming out of left field. A few years later of data modeling conversations like this and you start to see Datomic is really every fix to the relational model taken to the extreme.

It also explains how we got column databases. If the problem with 6th normal form is the underlying storage, then change the underlying storage to optimize for 6NF.

In a way, the very existence of the normal forms can be taken as a sign that something is seriously wrong with the relational algebra given how easy it is to create a data model that will produce rubbish for certain queries.


> In a way, the very existence of the normal forms can be taken as a sign that something is seriously wrong with the relational algebra given how easy it is to create a data model that will produce rubbish for certain queries.

I am of a growing opinion that normal forms below 6th (i.e. 1-5) are simple mistakes taken as performance optimizations. The mathematical side of me says to keep going too - I suspect that there is a "normal form" beyond the 6th that says something along lines of how all keys must be synthetic and that there is no such thing as a business or domain key. I strongly believe the relational algebra/calculus as applied to domain modeling hasn't quite reached a fundamental conclusion yet.

> If the problem with 6th normal form is the underlying storage, then change the underlying storage to optimize for 6NF.

You're god damn right. There is zero reason we cannot build storage layers that can handle this kind of logical layout. My current favorite solution is to simply event source all the facts and keep an in-memory copy of the active business state. You can store pointers to strings and other BLOBs so you don't bloat RAM over something you could stream from NVMe on demand. Most of the pain in higher normal forms is with the join, so having a working set that can fit into RAM is one way to partially address this problem.


Can you give examples of things this misses?


Always an interesting read. A ton of insights, but I always find it a bit hard to make sense of, and it feels so disconnected from a modern distributed systems context.

In retrospect, I think a lot of the Spark SQL Dataframe workflow comes pretty close to what D/Tutorial D aspire to - static typing, functions on relations, imperative style but fundamentally unordered, etc.; however, it's only a processing system, not a storage system.

I have kept my distance from the "data lake" buzzword circles, but maybe a transactional, Spark-based data lake does approximate what Darwen/Date are going for? The only thing really missing might be nested relations?


I haven't read it but I understand where you're coming from here.

Does this doc talk about the problems with nullability / ternary logic? What about algebraic sum types? Those have always been some of the most difficult aspects of relational data modeling, at least with respect to SQL.


Again, what's the difference between a data warehouse and a data lake? Honest question.


As I understand it, a data lake is a storage space for unstructured data. A data warehouse is a storage + compute layer, usually with data sourced from a data lake, that is ready for querying. This understanding comes from the description in this paper[1]

> To solve these problems, the second generation data analytics platforms started offloading all the raw data into data lakes: low-cost storage systems with a file API that hold data in generic and usually open file formats, such as Apache Parquet and ORC [8, 9]. This approach started with the Apache Hadoop movement [5], using the Hadoop File System (HDFS) for cheap storage. The data lake was a schema-on-read architecture that enabled the agility of storing any data at low cost, but on the other hand, punted the problem of data quality and governance downstream. In this architecture, a small subset of data in the lake would later be ETLed to a downstream data warehouse (such as Teradata) for the most important decision support and BI applications.

[1] http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf


Reasonable enough I guess. Thanks!


In my experience, a data warehouse usually has an ETL process at the beginning. Data comes in from disparate sources and on a regular basis, it is ETLd into a shape that is ready to use by the business. On the other hand, a data lake slurps in all the data as soon as it is available, in whatever form it is in. You have to process it into the business-consumable form when you query/egress it, but you don't have to know your dream schema up front.


My experience is similar: extract process -> raw data -> clean/merge -> model

Normally you extract from source, then load to destination. There is no business logic in this process.

From raw you do all of your transforms to get clean up and merge and then get it into a usable model. With big data sets I've done wtih Hadoop and then moved the clean/merged data to a standard or MPP DB for analysts. For normal sets this can all be done in a standard DB.

The other part is all the data is available from raw and clean/merge for analysts to use and is kept. With the thinking the storage cost are extremely low and heading to zero. Whereas in traditional DW analysts used only the modeled sets and depending on the data earlier sets are deleted as they are for operational purposes only. Storage is considered expensive and limiting.

The move to ELT and using a declarative dataops tool has been mind bending and has been a multiplier in terms of speed to get to something usable. I don't want to see another DW again.


Data lake: you own your files. Data warehouse: database owns your files.


Im 2 years into my career. Is there a benefit in reading a database book cover to cover? Book in question is Database Systems by Garcia-Molina.

I’ve found the material so far interesting. But I’m not sure how much value and how much information there is for me to possibly retain.


I often see people on HN recommend the book, Designing Data-Intensive Applications (2017). I've personally been chewing on the material for a while now, gaining new insights.

Here's the table of contents: https://www.oreilly.com/library/view/designing-data-intensiv...

It seems to cover roughly the same areas and range as the book you mentioned, Database Systems: The Complete Book (2008). http://infolab.stanford.edu/~ullman/dscb.html


Yes, yes, yes, yes, yes! DDIA is amazing. It is _the_ premier reference manual, as far as my brain and I are concerned. It's the only "computer book" I keep next to my computer. (Not that I reference it frequently enough to justify its proximity; it's mostly so I can tell people "DDIA is the only 'computer book' I keep next to my computer." But still... it's great.)


Turns out I am a fool who let his enthusiasm get in the way of being right. I have DDIA on my Kindle. Next to my computer is a book that obviously makes much more sense to be next to my computer, Systems Performance: Enterprise and the Cloud (SPEC) by Brendan Gregg.

Still, DDIA is an incredible resource. But the Kindle is actually better for my usage pattern. I wish I had SPEC on Kindle.


My advice: Yes. Get a good book and a interactive SQL editor. Learn the concepts and try them out. Play, ask questions and try to answer them. Skim over stuff that doesn't interest you at the moment. Write stuff down.

Then after a while you'll encounter new concepts (a new book, an online comment, an interesting tech talk etc.), have new ideas or forget/get rusty on old ones, then you repeat the above.

Continuous learning is part of the job IMO. I did the above several times and will do them several times again in some fashion or another.

For me it has had real value - repeatedly - that I didn't even know existed before I accidentally came across a situation where I could apply some concept I learned this way.

Sure, the first time you apply something "in anger", you will get to the nitty gritty details, which is work but you cannot even recognize an opportunity if you haven't prepared your mind for it and played around with ideas and concepts before - except you have a significant time budget for this kind of thing on the project, or a colleague who can introduce it, but my recommendation is to not rely on that.


You don't need to master it, but it serves as a map in your head. It will bring you in, in the worst case, an intuition on how things work. The "retain" part isn't an essential thing if you have at least the hints. Even if they are few, they are far more helpful than not having a clue at all.

I always recall two fundamental programming books and one DB design book I read years ago. And still, for today, my mind comes back to them, and it makes me faster when looking for documentation or just googling things more properly :)

Knowing a complex subject, even not deeply, is always a benefit.


you and i see i to i on that <g>


I know two DB bibles which explain how a DB engine actually works. It's much more interesting reading than another SQL manual. But it's not awfully practical if you have no particular interest in DB technology. And both are pretty dense. * https://www.amazon.com/Database-Systems-Complete-Book-2nd/dp * https://www.amazon.com/Database-System-Concepts-Abraham-Silb...

There's a very cool not-quite-alternativE: https://leanpub.com/how-query-engines-work. It covers a fair chunk of DB technology but not storage. Definitely check out their repository at https://github.com/andygrove/how-query-engines-work/tree/mai... .

A companion to DDIA would be https://www.amazon.com/Database-Internals-Deep-Distributed-S... (especially its treatment of LSM trees which is harder to come by).


I would suggest something more directly applicable. Textbooks are useful, but if you're looking for something to get the most out of today:

SQL Performance Explained

Art of SQL

Database Design for Mere Mortals


Huge +1 to Database Design for Mere Mortals. This was the second database book I read (the first was SQL Antipatterns, not a bad book but also not a good first book) and far and away the most helpful. It took me 3 days to read, I had a miserable flu after flying home from a vacation (pre-COVID), and I was still able to follow the author without any issue. I highly suggest a physical copy & writing in the margins, underlining etc, though that's something I do in general so YMMV if you don't get anything out of that.

As far as "technical" books go, it's very nontechnical, and if you're already very experienced & good at normalizing it maybe won't help you THAT much, but for someone who's at all junior, I can't recommend it enough. The biggest takeaway (beyond, "what's a normalized database look like") is to keep questioning the real-life system you're working in & to ask questions of stakeholders who understand it better than you, but me saying that isn't at all doing it enough justice compared to how well the author elaborates on the point. (btw, Data And Reality is also good for this, though a lot more theoretical in approach)

If you're really new to databases you'll want to supplement this with something that does normalization more formally, I don't have a great single suggestion for that. I read two of C.J. Date's books, honestly I found them mind-numbingly tedious to get through but I can't say I didn't learn from them. But...I also can't really recommend them because I was so bored by them that it made me stop reading altogether for like 4 months because I was just putting off finishing these books (LPT: always be in the middle of 2 books at the same time, unfortunately for me, both of my books were these lol). I did also read Art of SQL, it was both enjoyable & interesting, but maybe not what you're looking for if you want something super structured that will take you from point A to point B.


If you weren't aware, Database Design for Mere Mortals 25th Anniversary Edition was released last December!


I had seen that! Do you think it's worth buying & reading the updated version? As much as I loved the original, I'm not sure how much I'd get out of reading a book that's 90% the same content at this point. Though tbf I guess I wouldn't mind just supporting the author so maybe I'll pick up a copy and skim it at least.


The first thing I try to do upon encountering any technology/stack people recommend everywhere is try to figure out what’s wrong with it.

I then skim through anything I can find that’s relevant, and dig into the parts I didn’t fully understand (or skimming didn’t fully answer), until I can provide a comprehensive answer as to where it shouldn’t be used.

In general, implementation details from this skimming is only relevant as per the black-box abstraction — I only care about what workloads and impute the stack falls over on. The implementation itself I only keep around as:

1. I trust the looks of this implementation, so if I need it I know where to start

2. I can give a very superficial explanation of the implementation, just enough to defend my answer as to why it falls over in the workloads it falls over on.

So for databases, as an example, I care about the fact that

1. it stores by row (because this is the key differentiator from column-stores, and is the key as to why OLAP and OLTP have different preferred workloads — and why they can make different optimizations)

2. the indexes and roughly why they work/excel/fail (again, their existence changes preferred workloads)

3. That the SQL compiler produces a “plan”, and this plan is based on heuristics and estimates — So it shouldn’t be blindly trusted. You can view the plan and review it.

4. ACID, its guarantees, and implementation only as far as why can’t MongoDB and friends have it too (and why you might want to drop it — where does it only add overhead)

5. Probably other stuff not on the top of my head

Everything is a trade-off. The goal is to know what trade-offs you’ve actually made.


Yes.

I started programming long before the internet and read cover to cover every computer book I could get my hands on. Subsequent years had me prepared with a huge background of ideas I could apply to problems I encountered.

However the depth of understanding I sought differed depending upon the book.

Some books I dug into intensely. When the SICP came out I wrote my own Scheme interpreter (using the information in SICP) just so I could play with the examples.

Other books I just skimmed and indexed (in my memory) for possible future reference. The 80s 12 volume Systems Programming series by IBM is an extreme example.

Before the web I had accumulated 50 bankers boxes of computer books that I had "read" cover to cover. When they were on my shelves it often happened that when I had a problem to solve I could remember which book(s) had relevant information. I was my own search engine!

So, yes. If you are interested in databases start by skimming the paper. You might decide to read for details or you might not. Regardless, the more you know the more you can synthesize solutions when needed.


While that would probably be very valuable, it will probably come with lots of implementation details that won't really matter to you unless you plan to write a database yourself.

I'd suggest this one: https://dataintensive.net/ Its intended for _users_ of databases, that is, developers incorporating databases (and other kinds of data systems) into their applications. It gives just enough explanation on how those systems work and what they can (and can't) provide, and also gives a more general overview of what's available beyond relational SQL databases.


I would also recommend "Database Internals: A Deep Dive into How Distributed Data Systems Work" which gives a good overview of the implementation of a database. I felt that closer to an engineers perspective


Not really, I've learned a lot from the PostgreSQL documentation and various Stack Overflow answers.

Like with everything, getting experience and your hands dirty will make you far more knowledgeable. There is no silver bullet.


> What makes this book different? We feel there are two key features that make it stand out: its strong focus on the relational model, and its thorough treatment of type theory (and together, of course, the relational model and type theory are precisely the foundations referred to in the previous paragraph). To amplify:

> * The relational model: We must stress that what we are not doing is proposing some kind of “new” or “extended” relational model. Rather, we are concerned with what might be called the “classical” version of that model; we have tried to provide as careful and accurate a description of that classical model as we possibly can. It is true that we have taken the opportunity to dot a few i’s and cross a few t’s (i.e., to perform a few minor tidying activities here and there); however, the model as we describe it departs in no essential respects from Codd’s original vision as documented in references [20-22]. (Indeed, we even toyed at one point with the idea of calling the book A Relational Model of Data for Large Shared Data Banks.)

> * Type theory: Relations in the relational model have attributes, and those attributes in turn have types (also called domains). Thus, the relational model certainly assumes that types exist. However, it nowhere says just what those types must be, nor does it have much to say about what properties those types must have; in other words, the relational model and type theory are almost completely independent of each other (they are orthogonal, to use the jargon). In this book, we present a detailed theory of types that complements the relational model, and we explore some of its implications. In particular, we show that our type theory serves as a basis for a comprehensive model of type inheritance.


On the subject of types, I’ve often wondered why relational databases don’t have sum types. Lots of real-world data models have these kinds of relationships, and certainly “modeling data” is well-within the purview of a database. I know that ORMs exist, but they tend to be very bad and there’s no compelling reason to have to break off part of the database responsibility, implement it terribly (it’s divorced from the query planner, so even the best ORMs are subpar) for each application language, and put it on the client side.

Yes, I understand that sum types affect the way the data is arranged, but the same approaches that are used by programming languages to model sum types in memory can be used by databases to model sum types on disk and in memory (it’s all just byte vectors at the end of the day).


I'm making a relational lang (https://tablam.org) and this was one of the question marks for me.

The #1 reason sum types was not presented is because "nobody" thought (seriously) about that for long time. See how much "null" have been with us and still you find defenders of it!

The second and more broad problem, is that the relational model was NOT implemented by the RDBMS makers, not in full. To make sum types work, you NEED to support embed relations inside relations:

    rel Invoice
    rel DetailInvoice

    Invoice[invoice_id:1 customer:123 lines: DetailInvoice[
     [invoice_id:1 line_id:1 price:10 qty: 1],
     [invoice_id:1 line_id:2 price:10 qty: 2] 
    ]
Then with this, do sum types comes from free:

    type Option
      Some ?
      None

    [customer_id:1  phone: [tag:Some 123-23456]]  
    [customer_id:2  phone: [tag:None]]  
This will make things like storing Logs, metrics and event sourcing so much nicer!

How store that is another matter, but I think is not that complicated, just none of the major rdbms have put effort on this.

This is one of my dreams: Making a rdbms with this and other stuff that push the DBs to the modern way of make apps...


Frankly, understanding of what the relational model actually is has been missing from the industry even among many people who have made data management their primary focus. IMHO a lot of the penetration of the NoSQL stuff in the last decade was based on misunderstandings of the capabilities and potentials of the relational model.

I think the early history of SQL was too influenced by ISAM-type stores and COBOL flatfiles and the like. Hence the vocabulary around "tables" and "columns" instead of the relations, relvars, tuples, etc. described by the actual relational algebraic model. To many of those with an OO fetish, "relational model" became synonymous with "tabular model", and the industry's early 2000's obsession with object polymorphism, OORMs middleware, etc continued the stall in relational database innovation.

I feel like there is now renewed interest in this stuff, though. Your project looks really neat, for example.


I think you analysis is correct. I also add that is related too to the languages, you point with Cobol, and I also add C, Pascal, Java, etc.

When they have nulls and limited to zero queries, relationships and data modeling capabilities is not hard to see that how they are is reflected in how the DBs are too. I honestly not know about sum types until a few years ago, neither have a clue how much useful they are, so surely even today many are in the same blind spots...


> To make sum types work, you NEED to support embed relations inside relations

If that is the case, sum types would be in violation of first normal form.


Even considering that normal forms are tools...

Sum types, how them are stored and how they are used is not the same ie: How are sum types represented in languages like Rust/oCalm? that is tangential as how we experience them.

So, support to embed relations is the step to provide the storage and partially, help in the query, but from the POV of the user them are "normal sum types".

But with that, you get the ability to store relations, that is usefull as-is :)


Oddly enough, 1nf is differently defined depending on which person or database you're asking/using.

Is it

  . just to be rectangular?
  . values are atomic? (and what does that mean)
  . no such thing as nulls? good luck with that
  . no duplicate rows (good luck with enforcing that in intermediate tables)


E.F.Codd defines first normal form as eliminating nested relations (in "A Relational Model of Data for Large Shared Data Banks" which introduces the relational model). "Atomic" in this context just means any value which is not a relation.

A relation by definition cannot have duplicate rows, since relations are sets. If "rectangular" means that all rows in a table have the same columns (attributes), this also follows from the definition of relations. So these constraints have to be fulfilled before we can even talk about normalization, since normalization operates on relations. (Nulls are its own can of worms, but dosn't really have anything to do with 1NF.)

Codd introduces 1NF because he realized supporting nested relations would complicate the model and query language for no benefit. It would need special operators for navigate in and out or nested relations, e.g. if you wanted to join tables nested at different levels in other tables. You would need something like XPath on top of SQL. But you can express exactly the same information using foreign keys between tables and keep the language simpler.

The sum-types suggestion shows the problem. If we have sum types as values, presumably they would be composed of other types, right? So sum types could contain other sum types, arbitrary deep. Since you surely would want to query these individual values, you would need operators in SQL to navigate down into sum types. You would need facilities to index and apply constraints, again arbitrary deep down the structure. So the database model and query language gets significantly more complex for dubious benefit.


> So the database model and query language gets significantly more complex for dubious benefit.

Sum types are definitively not a "dubious benefit". Neither nesting relations (that are not that different to join them and will provide better semantics that create a LOT of null columns to flat them).

Will complicate matters and requiere extensions? Sure. But is clear that, today in this century, exist a lot of use-cases where people are resorting to pseudo-nosql (json embedding, to name one of the biggest anti-patterns), and that create more inconsistent features, extensions, weird extra syntax, etc. that anyway are required for it (and could have been cleaner under this idea).

With this, we can get the same benefits and much more.

P.D. Not against have a very simple core and try to model stuff that way, but the major point of a model, like relational, is support how to MODEL.

And we need to model nested thing, and we need to model variants and get rids of nulls and all that. So, sometimes, you need to add complexity + 1 so you get complications - 10.


> To make sum types work, you NEED to support embed relations inside relations

You don't need to embed relations. You don't even need to embed tuples or other composite types, you can just model sum types using a pair of fields. The latter is what Opaleye and Rel8 (Postgres libraries for Haskell) do:

https://hackage.haskell.org/package/opaleye-0.7.1.0/docs/src...


To reply to both my respondents, the link goes exactly to the source for Opaleye's version of the Maybe data type definition. Neatened up, it is

    data MaybeFields fields =
      MaybeFields {
        mfPresent :: Field SqlBool
      , mfFields  :: fields
      }
That is, an Opaleye Maybe (MaybeFields) is a pair. The first component is an SQL bool indicating whether it's a Just or a Nothing. The second component is the payload of the Just. If it's a Nothing the fields of the payload will be set to NULL.


Ok, this is for a Maybe/Option, but can't see how can work for any algebraic type.

I have this for mine:

    struct Ag {
        tag :String, //like "Option.None"
        data:Vec<Value> //The data inside, if any
    }


Arbitrary sum types are just a generalisation of what I wrote. Opaleye doesn't have them (yet?) but rel8 does:

https://hackage.haskell.org/package/rel8-1.1.0.0/docs/src/Re...

You have a bool which specifies whether it's left or right, and then two payloads, one for left and one for right.


You linked to a file with pages of code. Care to summarize how they model sum types using a pair of fields?


Can be show how is done, with data instead of algorithms? (aka "...show me your tables and I won't usually need your flowcharts; they'll be obvious." ~ Fred Brooks")

Haskell is hard to decipher for me...


Weird that this pops up today as I was trying to find something like this yesterday. Good luck with it!


I am interested in opportunities to cooperate on this.


Great!, I have the project at https://tablam.org


You might want to check TypeDB, which is a database with a strongly-typed query language: https://github.com/vaticle/typedb.

It doesn't do union types yet, but it's got type inheritance, type inference, type-safe joins (with its "relation" feature). It's quite cool what you can do with it when you can express complex domain with the language.

Disclaimer: I work there.


> I’ve often wondered why relational databases don’t have sum types.

It seems unlikely that relational databases would make the effort to add sum types whilst mainstream imperative languages don't have sum types. In turn that is probably because of a mistaken belief that subclassing was a sufficient means to achieve the same end. All is not lost, however. It's easy to simulate sum types using a field for the tag and multiple fields for the payload.

The syntax for that in raw SQL doesn't look nice but in a higher level query DSL it can look lovely!

https://hackage.haskell.org/package/opaleye-0.7.1.0/docs/Opa...

https://hackage.haskell.org/package/rel8-1.0.0.0/docs/Rel8.h...

(Disclaimer: I'm the author of Opaleye)


Not a part of the SQL standard, but would PostgreSQL enumerated types be what you're thinking of?

https://www.postgresql.org/docs/current/datatype-enum.html

Although I guess the question is are you thinking sum types at the attribute level or sum types at the relation level? At the relation level I'm not familiar with an answer, but I can see where that would be a nice database-level solution for polymorphic associations.


Postgres enum types are like C enums--a bounded set of identifiers. Sum types requires a bounded set of arbitrarily complex types (not just identifiers). For example:

    type Event enum {
        KeyPress(rune)
        MouseClick{ x int64, y int64 }
    }


Probably because the same can already be expressed using relations and foreign keys.


I don't think it can. I don't believe SQL allows you to create a fkey column that points into many different tables. You can create your own pseudo fkey column that the database doesn't know is a pointer into other tables ("type erasure") but then you have to recreate referential integrity and so on yourself, and you still run into issues querying against the table that has that fkey column in a useful way.


You don't need a FK to point to multiple tables. You would have the reference in the other direction, from multiple different tables to one table.


How do you ensure that only one row of only one table references the row of that one table?


Most databases have weak support for cross-table constraints. I believe in SQL Server you can define an indexed view on the union of the tables and then apply a unique constraint on the foreign key.


Theoretically, this could also be achived by allowing FKs to Views. But I know of no system that allows this.


The relational model is a logical model; that it's backed up by a B-tree or something else should bear no consequence except in query/update performance.

And you kinda can simulate the sum types, you just can't express it in the first-order way: instead of

    CREATE TYPE Color (Red, Green, Blue)
    
    CREATE TABLE MyTable(x Color, ...)
you do

    CREATE TABLE MyTableRed(...)
    CREATE TABLE MyTableGreen(...)
    CREATE TABLE MyTableBlue(...)
There is just no syntax to group these three tables together in all relevant queries/updates, and such skolemization is really quite painful to maintain manually.


> The relational model is a logical model; that it's backed up by a B-tree or something else should bear no consequence except in query/update performance.

Agreed, but some people have incorrectly argued in the past that introducing sum types would harm performance due to the underlying data structures, so I was trying to head-off that line of argumentation.

> And you kinda can simulate the sum types,

I'm not sure I follow your example. Your `Color` type is a product type, not a sum type. I've explored every workaround I could think of for sum types (and asked many colleagues, etc), and they all failed in one important respect or another (typically because the type system doesn't understand your "sum type").


No, it's a sum type. You have to add the constraints

    CONSTRAINT MyTableRed INTERSECT MyTableGreen IS EMPTY
    CONSTRAINT MyTableGreen INTERSECT MyTableBlue IS EMPTY
    CONSTRAINT MyTableBlue INTERSECT MyTableRed IS EMPTY
but after that, I believe it works. It essentially replaces the original "... and the color is 'x'" predicate with three another predicates: "... and the color is red", "... and the color is green", "... and the color is blue" with the constraint that only one (at most) of those can be true for any particular value of "...". As you can see, you have to do that for every would-be COLOR property, and adding more variants blows up the number of tables and constraints even more drastically.

It kinda reminds me of 3SAT and other logical exercises in reducing everything down to the first-order logic: it's doable, but the terms grow exponentially large.


The problem IIRC is actually consuming that sum type in SQL. Unfortunately it's been several years since I played with that approach, and I've forgotten the specific problems I encountered.


The issue I see is that now you have mixed data and schema. And unfortunately altering schema is much more difficult in current systems. There is no homoiconic SQL DBMS. This lack of homoiconicity is aso reflected in the fact that only values can be parametrized. Identifiers can not.

Also you face issues if there is some other entity depending on MyTable. You can no longer have a FK because you no longer have a table to which to point the FK, you have 3 tables.

You could create a table containing just the key and have FKs pointing from the three tables (I think this pattern does have a name) but you run more issues because there is usually no way to force bidirectional mandatory optionality using FKs. (A FK can be 0..1 - 0..M, 1..1 - 0..M, 0..1 - 0..1 or 1..1 - 0..1. But it can never be 0..1 - 1..M, 1..1 - 1..M or 1..1 - 1..1. In this case 0..1 - 1..1 is not an issue because it can be reversed as 1..1 - 0..1. )

Now you have two options:

- Either create FKs in both directions, but then you run into the issue that most DBMS can not update two tables at the same time in the same statement. You have to use transactions and defer FK checking until the end of the transaction. And by this point you have created a DSL over SQL.

- Or you use extra flag columns in order to achive a three step process (insert key in key table, insert extra attributes in other table, update key table to mark as active) and enforce everything using constraints and triggers, but similarly, by this point you have created a DSL over SQL.


In what situation do you believe your solution would be better than to create a table Colors, insert the desired list of colors and have a FK from MyTable to Colors?

The way I see it a single atribute relation is a sum type. Implemented in current imperfect systems as a single column table (maybe two column table if you insist on using surogate keys for various reasons)


Probably in the situation where the sum types have actual payloads, instead of being simple enums. For example, if it was

    type SHAPE = NONE | SQUARE(w, h INTEGER) | CIRCLE(r INTEGER) | TRIANGLE(a, b, c INTEGER) | BLOB(e SVG_STRING)
In my solution it entails having

    TABLE WithShapeNone(...)
    TABLE WithShapeSquare(..., w, h INTEGER)
    TABLE WithShapeCircle(..., r INTEGER)
    TABLE WithShapeTriangle(..., a, b, c INTEGER)
    TABLE WithShapeBlob(..., e SVG_STRING)
It seems I can't easily generalize your solution, so I won't strawman it.

Then again, I think the proper solution would just be adding the sum types into the system. We could have "+" implemented as a 3-column table too, but why would we?

Your notice of "only values can be parametrized. Identifiers can not" in the sibling comment is exactly what I was trying to express in these threads: the relational model is first-order, not second-order, so while you can theoretically model everything with it (logician have done it), it's ugly and impractical without some extensions. Better types for "ground" values is one such extension.


For one point I do also believe that adding sum types would be great. Even if the only benefit is eliminating 3-value-logic, it is a huge benefit. I also agree that while it can be modeled it is very ugly, though I wouldn't necessarily blame the relational model. The relational model is a modelling tool. The physical implementation doesn't need to be 1 to 1 with the model. Same as you rarely find pure OO or pure FP languages.

Current DBMSs do also have plenty of solutions for the problem of non-homogenous data. Some allow XML and JSON columns on which you can also enforce schemas. SqlServer proposes sparse tables. PostgreSQL has arrays, enums and structs (and arrays of structs) as data types. But I do agree that sum types ammong other things would really be nice.

As for your example I still believe it is incomplete, but I would like to discuss that in a separate thread.


You make a fair point about actual payloads. My example was illustrating a sum type of unit types, an enum.

Do you care to conjure a more complete example of your solution? As @throwaway894345 said, I believe your solution is difficult to consume. I would like an example with something more concrete than MyTable.

A theoretical elegant but currently impossible solution to consume your solution would require FKs pointing to views. A view called Shape that is a Union All (we know there are no common elements) over the 5 WithShape... tables. Other tables would point FKs to the PK of this View. Unfortunately, AFAIK there is no DBMS that supports FKs to Views.

I do believe this is not really what you have in mind and I have a hunch about what you are trying to get at, but I believe the example is incomplete.


I suspect part of the problem is that true sum types are poorly represented in many of the languages that use these databases. In a sense because of the layer these sit in the stacks they have to appeal to a lowest common denominator. If they supported tables where the rows were true sum types and polymorphic as a result many languages would find it quite difficult to interact with that database.


I don't understand why that would be a problem--programming languages interact with databases at runtime via SQL. I regularly use Go which lacks sum types, but there's nothing technically difficult about generating SQL that leverages sum types--except that SQL doesn't support sum types :).


In Go returning a list of items that are one of a set of different types would be poorly handled. It's possible but very awkward. You would probably drop down to just a slice of interface{} and then have to use reflection to figure out what each item was.

Dynamic languages like Python or Ruby would probably be fine but there are more languages than those out there to worry about.


There are ways of modeling sum types in Go (a struct with a few fields and a tag, an interface that's only implemented by the relevant types, etc), but that's unrelated to the database/client interface.


Ah yes, the database textbook from my old uni. I think I managed to get a good grade on databases without ever buying it. But I remember the lecture content based on it being quite interesting.


Another ex Warwick person here!


I'm sure this article will attract the attention those of us who were taught by Hugh Darwen.


I have a Darden-signed copy - one of my proudest possessions!


(2014) - according to the copyright notice.


Might be worthwhile to note this is copyright 2014 in the title.


Interesting read!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: