At some point, the problem stops being “how many rows” and becomes “how many columns”. Thousands, then tens of thousands, sometimes more.
What I observed in practice:
- Standard SQL databases usually cap out around ~1,000–1,600 columns. - Columnar formats like Parquet can handle width, but typically require Spark or Python pipelines. - OLAP engines are fast, but tend to assume relatively narrow schemas. - Feature stores often work around this by exploding data into joins or multiple tables.
At extreme width, metadata handling, query planning, and even SQL parsing become bottlenecks.
I experimented with a different approach: - no joins - no transactions - columns distributed instead of rows - SELECT as the primary operation
With this design, it’s possible to run native SQL selects on tables with hundreds of thousands to millions of columns, with predictable (sub-second) latency when accessing a subset of columns.
On a small cluster (2 servers, AMD EPYC, 128 GB RAM each), rough numbers look like: - creating a 1M-column table: ~6 minutes - inserting a single column with 1M values: ~2 seconds - selecting ~60 columns over ~5,000 rows: ~1 second
I’m curious how others here approach ultra-wide datasets. Have you seen architectures that work cleanly at this width without resorting to heavy ETL or complex joins?
I think the possible answer is to try to "compress" columns with custom datatypes, it could require to touch part of the innards of sql (like in postgreSQL you need to solve it with c) but is a viable option in many cases where you noted that what you could express in json, for example, is in fact a custom type that could be stored efficiently if there is a way to translate it to more primitive types, then solved that the indexes will work.
The second option is to hide part of the join complexity with views.
You mention parquet and spark, but I’m wondering if you tried any of the “Lakehouse” formats that are basically parquet + a metadata layer (ie iceberg). I’d probably at least give Trino or Presto a shot, although I suspect that you’ll have similar metadata issues with those engines.
What is the design?
Usually this would be stored in a sparse long form though. So I might be wrong.