Check the Vortex file format (https://vortex.dev/), if you are interested in a distributed SQL engine then you can check SpiralDB (https://spiraldb.com/), I haven’t used this one personally but they created Vortex.
Yeah, this is a hard problem, in special because Standard SQL databases only partially implement the relational model, have not good recurse for deal with relations-in-relations and lack of ways to (in user space) build your own storage (all stuff that I dream to tackle).
I think the possible answer is to try to "compress" columns with custom datatypes, it could require to touch part of the innards of sql (like in postgreSQL you need to solve it with c) but is a viable option in many cases where you noted that what you could express in json, for example, is in fact a custom type that could be stored efficiently if there is a way to translate it to more primitive types, then solved that the indexes will work.
The second option is to hide part of the join complexity with views.
ClickHouse and Scuba address this. The core idea is the data layout on disk only requires the scan to open files or otherwise access data for the columns referenced in that query.
What engine and data format were you using for your experiment?
You mention parquet and spark, but I’m wondering if you tried any of the “Lakehouse” formats that are basically parquet + a metadata layer (ie iceberg). I’d probably at least give Trino or Presto a shot, although I suspect that you’ll have similar metadata issues with those engines.
> With this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.
What are the columns and why are there so many of them? The standard approach is to explode into many tables and introduce joins as you said. Why don’t you want joins?
I am speculating here but as it genomics data I assume it's information such as: gene count, epigenetic information (methylation, histones etc)
Once you do 20k times a few post translational modifications you can come to a few columns quickly.
Usually this would be stored in a sparse long form though. So I might be wrong.
Check the Vortex file format (https://vortex.dev/), if you are interested in a distributed SQL engine then you can check SpiralDB (https://spiraldb.com/), I haven’t used this one personally but they created Vortex.
If you can drop the “distributed” part, then plug DuckDB (https://duckdb.org/) and query Parquet (out of the box) or Vortex (https://duckdb.org/docs/stable/core_extensions/vortex.html) with it.
Yeah, this is a hard problem, in special because Standard SQL databases only partially implement the relational model, have not good recurse for deal with relations-in-relations and lack of ways to (in user space) build your own storage (all stuff that I dream to tackle).
I think the possible answer is to try to "compress" columns with custom datatypes, it could require to touch part of the innards of sql (like in postgreSQL you need to solve it with c) but is a viable option in many cases where you noted that what you could express in json, for example, is in fact a custom type that could be stored efficiently if there is a way to translate it to more primitive types, then solved that the indexes will work.
The second option is to hide part of the join complexity with views.
ClickHouse and Scuba address this. The core idea is the data layout on disk only requires the scan to open files or otherwise access data for the columns referenced in that query.
What engine and data format were you using for your experiment?
You mention parquet and spark, but I’m wondering if you tried any of the “Lakehouse” formats that are basically parquet + a metadata layer (ie iceberg). I’d probably at least give Trino or Presto a shot, although I suspect that you’ll have similar metadata issues with those engines.
> With this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.
What is the design?
What are the columns and why are there so many of them? The standard approach is to explode into many tables and introduce joins as you said. Why don’t you want joins?
I am speculating here but as it genomics data I assume it's information such as: gene count, epigenetic information (methylation, histones etc) Once you do 20k times a few post translational modifications you can come to a few columns quickly.
Usually this would be stored in a sparse long form though. So I might be wrong.
If you want to do that why not just do an EVA pattern or something else that can translate rows to columns?