We just create mini data "ponds" on the fly by copying tenant isolated gold tier data to parquet in s3. The users/agent queries are executed with duckdb. We run this process when the user start a session and generate an STS token scoped to their tenant bucket path. Its extremely simple and works well (at least with our data volumes).
This is cool. I think for our use case this wouldn’t work. We’re dealing with billions of rows for some tenants.
We’re about to introduce alerts where users can write their own TRQL queries and then define alerts from them. Which requires evaluating them regularly so effectively the data needs to be continuously up to date.
Small. We're dealing with financial accounts, holdings and transactions. So a user might have 10 accounts, thousands of holdings, 10s of thousands of transactions. Plus a handful of supplemental data tables. Then there is market data that is shared across tenants and updated on interval. This data is maybe 10-20M rows.
Just to clarify, the data is prepared when the user (agent) analytics session starts. Right now it takes 5-10s, which means it's typically ready well before the agent has actually determined it needs to run any queries. I think for larger volumes, pg_duckdb would allow this to scale to 10s of millions rows pretty efficiently.
The DuckDB website has the following to say about the name:
> Why call it DuckDB?
> Ducks are amazing animals. They can fly, walk and swim. They can also live off pretty much everything. They are quite resilient to environmental challenges. A duck's song will bring people back from the dead and inspires database research. They are thus the perfect mascot for a versatile and resilient data management system.
We have various data sources (which is another benefit of this approach). Data from the application DB is currently pulled using the FE apis which handle tenant isolation and allow the application database to deal with the load. I think pg_duckdb could be a good solution here as well, but haven't gotten around to testing it. Other data come from analytics DB. Most of this is landed on an interval via pipeline scripts.
We do the same thing, every employee can access our main financial/back office SQL database, but we just use PostgreSQL with row level security[0]. We never bothered to complicate it like the post does.
I want to build a shared postgres db with hundreds of small apps (OLTP) accessing shared tables using a RLS model model against well defined tables.
What are other limitations and mitigations folks have used or encountered to support stability and security? Things like
- Query timeouts to prevent noisy neighbors
- connection pooling (e.g. pgbouncer) also for noisy neighbors
- client schema compatibility (e.g. some applications running older versions, have certain assumptions about the schema that may change over time)
I'd be so uncomfortable with this. It sounds like you're placing the full burden of access on a single boundary. I mean, maybe there's more to it that you haven't spoken about here, but "everything rests on this one postgres feature" is an unacceptably unsafe state to me.
row level security is not a feature specific to Postgres, but more a pretty standard and acceptable way to control access in a multitenant or multicontext environment that pretty much every data provider supports/implements. When it comes to answering a single specific question (like the one RLS targets) I believe you DO want a single, simple answer, vs. something like "it uses these n independent things working in conjunction..."
Conceptually that's no different to any security measures that prevent you from accessing data you're not supposed to? At the end of the day with all data that is colocated you're trusting that some permission feature somewhere is preventing you from accessing data you're not supposed to.
We trust that Amazon or Google or Microsoft are successful in protecting customer data for example. We trust that when you log into your bank account the money you see is yours, and when you deposit it we trust that the money goes into your account. But it's all just mostly logical separation.
> At the end of the day with all data that is colocated you're trusting that some permission feature somewhere is preventing you from accessing data you're not supposed to.
Right but ideally more than one.
> But it's all just mostly logical separation.
Yes, ideally multiple layers of this. You don't all share one RDS instance and then get row level security.
Can you give an example of more than one layer of logical separation at the data layer?
We all know that authentication should have multiple factors. But that's a different problem. Fundamentally at the point you're reading or writing data you're asking the question "does X has permission to read/write Y".
I don't know their use case enough to understand what would or would not be an appropriate mitigation. For example, with regards to financial data, you could have client side encryption on values where those keys are brokered separately. I can't exactly design their system for them, but they're describing a system in which every employee has direct database access and the database holds financial information.
Right, encryption would protect the data. But still, at the end of the day you're trusting the permission model of the database. Encryption won't prevent you updating a row or deleting a row if the database permission model failed.
Well, I think we basically agree? My suggestion is merely that a database holding financial data should have more than a single layer of security. Granting direct access to a database is a pretty scary thing. A simple example would be that any vulnerability in the database is directly accessible, even just by placing a broker in between users and the database I'd likely start to feel a lot better, and now I'd have a primitive for layering on additional security measures.
Encryption is an extremely powerful measure for this use case. If the data does not need to be indexed, you could literally take over the database process entirely and still not have access, it definitely doesn't rely on the permission model of the db because the keys would be brokered elsewhere.
It's not like RLS is just some random feature they are misusing. It's specifically for security and is absolutely reliable. Maybe you should do a bit more research before making comments like this.
You can (and in some cases should) combine this with other boundaries, like access control or specific query logic. RLS moves the core checks closer to the data being controlled (i.e. the database) specifically to prevent the errors like forgetting to add the "where user_id = xxx" clause. It is super-valuable in compliance scenarios where someone like your DB Admin has permission to control access but not see any data, and consumers (both devs and end users) can write queries to see different levels of access but are not allowed to control data.
Obviously it's not a silver bullet and the isolation can be confusing when debugging, but generally a single point for your applying RBAC is a feature not a shortcoming. The next level of security might be how you define your roles.
I actually believe the simplest, most secure client scenario is physical isolation, where you give the user/consumer only the data they are allowed to use and then don't try to control it (someone mentioned this above, using parquet & duckdb). There's downsides here too: doesn't work for write scenarios, can be resource intensive or time delayed, doesn't handle chain of custody well, etc. You typically have two strategies:
1. pick the best approach for the specific situation.
2. pick your one tool as your hammer and be a d!ck about it.
This is the real world not everybody can get a perfectly isolated database instance. Also you do realize that is not necessarily any more secure than RLS right? Something still has to control what database the code connects to. That could have a flaw just as much as you could have a flaw when setting up RLS.
RLS is one of the simplest things to set up properly, if you can't figure it out I don't think I'm the one who doesn't know what they're talking about.
Reasons 1-3 could very well be done with ClickHouse policies (RLS) and good data warehouse design. In fact, that’s more secure than a compiler adding a where to a query ran by an all mighty user.
Reason 4 is probably an improvement, but could probably be done with CH functions.
The problem with custom DSLs like this is that tradeoff a massive ecosystem for very little benefit.
You’re right RLS can go a long way here. With complex RBAC rules it can get tricky though.
The main advantages of a DSL are you can expose a nicer interface to users (table names, columns, virtual columns, automatic joins, query optimization).
We very intentionally kept the syntax as close to regular ClickHouse as possible but added some functions.
As long as you don't deviate too much from ANSI, I think the 'light sql DSL' approach has a lot of pros when you control the UX. (so UIs, in particular, are fantastic for this approach - what they seem to be targeting with queryies and dashboards). It's more of a product experience; tables are a terrible product surface to manage.
Agreed with the ecosystem cons getting much heavier as you move outside the product surface area.
Personally I think that's worse. SQL - which is almost ubiqutous - already suffers from a fragmentation problem because of the complex and dated standardization setup. When I learn a new DBMS the two questions I ask at the very start are: 1. what common but non-standard features are supported? 2. what new anchor-features (often cool but also often intended to lock me to the vendor) am I going to pick up?
First I need to learn a new (even easy & familiar) language, second I need to be aware of what's proprietary & locks me to the vendor platform. I'd suspect they see the second as a benefit they get IF they can convince people to accept the first.
> How do you let users write arbitrary SQL against a shared multi-tenant analytical database without exposing other tenants' data or letting a rogue query take down the cluster?
For query operations I would try to find a way to solve this with tools like S3 and SQLite. There are a few VFS implementations for S3 and other CDNs.
Open Table Formats (Iceberg, Delta Lake, Hudi, etc) are the approach we've taken. That let's us offer a query engine but also let's the tenant bring their preferred engine (Snowflake, Spark, DuckDB, etc). It also addresses dirty reads and some other state problems that come from trying to use the file system. It scales as much as the bucket does, so we haven't found a use case we couldn't scale to yet.
We (https://prequel.co) recently started offering this as a white labeled capability so anyone can offer it without building it yourself. Its a newer capability to our export product where instead of sending the data to the tenant's data warehouse, we enable you to provision an S3/GCS/ABS/etc bucket with the data formatted. Credential management, analytics, etc is all batteries included so you don't have to do that either. The initial interest from our customers was around BI integrations but agent use is starting to pick up which is kinda interesting to see.
We did this with MotherDuck, and without introducing a new language. Every tenant has their own isolated storage and compute, so it’s trivial to grant internal users access to specific tenants as needed. DuckDB’s SQL dialect is mostly just Postgres’ with some nice ergonomic additions and a host of extra functionality.
This is explicitly not the problem they are trying to solve. In a single tenant database you don’t have to by definition worry about multi tenant databases
In a system with organizations, projects and advanced user access permissions having separate databases doesn’t full solve the problem. You still need access control inside each tenanted database. It also makes cross-cutting queries impossible which means users can’t query across all their orgs for example.
The DSL approach has other advantages too: like rewriting queries to not expose underlying tables, doing automatic performance optimizations…
I guess the question then becomes, what problem does a multi-tenancy setup solve that an isolated database setup doesn't? Are they really not solving the same problem for a user perspective, or is it only from their own engineering perspective? And how do those decisions ultimately impact the product they can surface to users?
Off the top of my head, managing 100 different database instances takes a lot more work from the business standpoint than managing 1 database with 100 users.
The article also mentioned that they isolate by project_id. That implies one customer (assume a business) can isolate permissions more granulary.
With multi-tenant vs multi-database decision one driver would be the level of legal/compliance/risk/cost/resource drivers around how segregated users really are.
Multi-database is more expensive generally but is a more brain dead guaranteed way to ensure the users are properly segregated, resilient across cloud/database/etc software releases that may regress something in a multi-tenant setup.
Multi-tenant you always run the risk of a software update, misconfiguration or operational error exposing existence of other users / their metadata / their data / their usage / etc. You also have a lot more of a challenge engineering for resource contention.
We’re about to introduce alerts where users can write their own TRQL queries and then define alerts from them. Which requires evaluating them regularly so effectively the data needs to be continuously up to date.
Just to clarify, the data is prepared when the user (agent) analytics session starts. Right now it takes 5-10s, which means it's typically ready well before the agent has actually determined it needs to run any queries. I think for larger volumes, pg_duckdb would allow this to scale to 10s of millions rows pretty efficiently.
> Why call it DuckDB?
> Ducks are amazing animals. They can fly, walk and swim. They can also live off pretty much everything. They are quite resilient to environmental challenges. A duck's song will bring people back from the dead and inspires database research. They are thus the perfect mascot for a versatile and resilient data management system.
https://duckdb.org/faq#why-call-it-duckdb
0: https://www.postgresql.org/docs/18/ddl-rowsecurity.html
What are other limitations and mitigations folks have used or encountered to support stability and security? Things like
How do you enforce tenant isolation with that method, or prevent unbounded table reads?
We do something similar for our backoffice - just with the difference that it is Claude that has full freedom to write queries.
> every employee can access our main financial/back office SQL database
This means that there is no access gate other than RLS, which includes financial data. That is a lot of pressure on one control.
We trust that Amazon or Google or Microsoft are successful in protecting customer data for example. We trust that when you log into your bank account the money you see is yours, and when you deposit it we trust that the money goes into your account. But it's all just mostly logical separation.
Right but ideally more than one.
> But it's all just mostly logical separation.
Yes, ideally multiple layers of this. You don't all share one RDS instance and then get row level security.
We all know that authentication should have multiple factors. But that's a different problem. Fundamentally at the point you're reading or writing data you're asking the question "does X has permission to read/write Y".
I don't see what you're getting at.
Encryption is an extremely powerful measure for this use case. If the data does not need to be indexed, you could literally take over the database process entirely and still not have access, it definitely doesn't rely on the permission model of the db because the keys would be brokered elsewhere.
Obviously it's not a silver bullet and the isolation can be confusing when debugging, but generally a single point for your applying RBAC is a feature not a shortcoming. The next level of security might be how you define your roles.
I actually believe the simplest, most secure client scenario is physical isolation, where you give the user/consumer only the data they are allowed to use and then don't try to control it (someone mentioned this above, using parquet & duckdb). There's downsides here too: doesn't work for write scenarios, can be resource intensive or time delayed, doesn't handle chain of custody well, etc. You typically have two strategies:
1. pick the best approach for the specific situation.
2. pick your one tool as your hammer and be a d!ck about it.
Reason 4 is probably an improvement, but could probably be done with CH functions.
The problem with custom DSLs like this is that tradeoff a massive ecosystem for very little benefit.
The main advantages of a DSL are you can expose a nicer interface to users (table names, columns, virtual columns, automatic joins, query optimization).
We very intentionally kept the syntax as close to regular ClickHouse as possible but added some functions.
Agreed with the ecosystem cons getting much heavier as you move outside the product surface area.
First I need to learn a new (even easy & familiar) language, second I need to be aware of what's proprietary & locks me to the vendor platform. I'd suspect they see the second as a benefit they get IF they can convince people to accept the first.
For query operations I would try to find a way to solve this with tools like S3 and SQLite. There are a few VFS implementations for S3 and other CDNs.
We (https://prequel.co) recently started offering this as a white labeled capability so anyone can offer it without building it yourself. Its a newer capability to our export product where instead of sending the data to the tenant's data warehouse, we enable you to provision an S3/GCS/ABS/etc bucket with the data formatted. Credential management, analytics, etc is all batteries included so you don't have to do that either. The initial interest from our customers was around BI integrations but agent use is starting to pick up which is kinda interesting to see.
The DSL approach has other advantages too: like rewriting queries to not expose underlying tables, doing automatic performance optimizations…
The article also mentioned that they isolate by project_id. That implies one customer (assume a business) can isolate permissions more granulary.
Multi-database is more expensive generally but is a more brain dead guaranteed way to ensure the users are properly segregated, resilient across cloud/database/etc software releases that may regress something in a multi-tenant setup.
Multi-tenant you always run the risk of a software update, misconfiguration or operational error exposing existence of other users / their metadata / their data / their usage / etc. You also have a lot more of a challenge engineering for resource contention.
We use it (I’m the author or the article) so users can search every run they do and graph all sorts of metrics.