Always an interesting read. A ton of insights, but I always find it a bit hard to make sense of, and it feels so disconnected from a modern distributed systems context.
In retrospect, I think a lot of the Spark SQL Dataframe workflow comes pretty close to what D/Tutorial D aspire to - static typing, functions on relations, imperative style but fundamentally unordered, etc.; however, it's only a processing system, not a storage system.
I have kept my distance from the "data lake" buzzword circles, but maybe a transactional, Spark-based data lake does approximate what Darwen/Date are going for? The only thing really missing might be nested relations?
I haven't read it but I understand where you're coming from here.
Does this doc talk about the problems with nullability / ternary logic? What about algebraic sum types? Those have always been some of the most difficult aspects of relational data modeling, at least with respect to SQL.
As I understand it, a data lake is a storage space for unstructured data. A data warehouse is a storage + compute layer, usually with data sourced from a data lake, that is ready for querying. This understanding comes from the description in this paper[1]
> To solve these problems, the second generation data analytics platforms started offloading all the raw data into data lakes: low-cost storage systems with a file API that hold data in generic and usually open file formats, such as Apache Parquet and ORC [8, 9]. This approach started with the Apache Hadoop movement [5], using the Hadoop File System (HDFS) for cheap storage. The data lake was a schema-on-read architecture that enabled the agility of storing any data at low cost, but on the other hand, punted the problem of data quality and governance downstream. In this architecture, a small subset of data in the lake would later be ETLed to a downstream data warehouse (such as Teradata) for the most important decision support and BI applications.
In my experience, a data warehouse usually has an ETL process at the beginning. Data comes in from disparate sources and on a regular basis, it is ETLd into a shape that is ready to use by the business.
On the other hand, a data lake slurps in all the data as soon as it is available, in whatever form it is in. You have to process it into the business-consumable form when you query/egress it, but you don't have to know your dream schema up front.
My experience is similar:
extract process -> raw data -> clean/merge -> model
Normally you extract from source, then load to destination. There is no business logic in this process.
From raw you do all of your transforms to get clean up and merge and then get it into a usable model. With big data sets I've done wtih Hadoop and then moved the clean/merged data to a standard or MPP DB for analysts. For normal sets this can all be done in a standard DB.
The other part is all the data is available from raw and clean/merge for analysts to use and is kept. With the thinking the storage cost are extremely low and heading to zero. Whereas in traditional DW analysts used only the modeled sets and depending on the data earlier sets are deleted as they are for operational purposes only. Storage is considered expensive and limiting.
The move to ELT and using a declarative dataops tool has been mind bending and has been a multiplier in terms of speed to get to something usable. I don't want to see another DW again.
In retrospect, I think a lot of the Spark SQL Dataframe workflow comes pretty close to what D/Tutorial D aspire to - static typing, functions on relations, imperative style but fundamentally unordered, etc.; however, it's only a processing system, not a storage system.
I have kept my distance from the "data lake" buzzword circles, but maybe a transactional, Spark-based data lake does approximate what Darwen/Date are going for? The only thing really missing might be nested relations?