I am designing a DB in Postgres for an Online Business Management system. Here I have two options:
- Have a single database with all customer data, and in all tables add a column with the customer ID to filter only their data.
- That each customer who registers has their own database with only their data.
The second option seems more orderly to me, and thus not to mix data from different clients in the same table. It also seems to me that the performance of the database will be better if it has to query in smaller tables, but this is just a guess, since having many databases being queried at the same time, I don't know if the engine decreases its performance.
The query is, assuming that the number of clients grows, and you opt for each one to have their own database,
Is there a limit on the amount of DB that Postgres can handle? And about performance?
Would it be better to query on several small databases than if it were querying on a single large database?
See, right now it's trivial to tell a Linux machine that it can keep millions of files open, so the number of tables, which themselves exist as files, is not an issue. Additionally, Postgres keeps in memory a finite amount of data, and on top of that, a finite amount of pages (data segments in a table, index, sequence, etc)
visibles
.However, under the file system
ext4
you can only handle up to 64,000 subdirectories. This means that you can have 63,995 databases. The other 5 are.
a hard link to the current subdirectory..
a hard link to the parent subdirectorypostgres
default templatetemplate1
alternate template 1template2
alternative template 2How to bypass this restriction?
If you use the dir_nlink directive you can make hard links not actually count towards this limit. By the same token, you can make some of your databases not actually physically under the postgres directory but in another subdirectory to which postgres points with a hard link.
But you don't have to get complicated with that. Postgres itself supports tablespaces that can physically reside elsewhere on disk.
Strategic approach
Although it's not unusual to implement a struct
multitenant
, it very much depends on the data you want to store. Both in quantity and in the degree of confidentiality, you may not have to store everything together. What's more, when you implement a multitenant system it becomes almost impossible to separate the data if one day you wanted to separate it into different databases.If you use different databases you will have to implement a kind of instance registry that links a client with its DB instance (it can be done with a tuple
cliente-PGurl
) hosted elsewhere. In the approachconvención por sobre configuración
the sanitized name of the client could be mapped to the name of a DB. But the day will come when you want to have different clients on different machines and that approach will not serve you.As I said before, regulations on privacy and security, depending on the type of business, especially if you want to certify yourself in ISO 27001 (let's think big) you have to implement something more robust than mixing everything. The multitenant is perfect for a multi-tenant system, but not so much for a multi-tenant system.
Performance
Depending on the size of the tables you are considering, there may be a decrease in performance for queries that do not take advantage of the indexes if you put everything in the same database. But if the indexes and queries are well designed, this loss would be marginal. Additionally, with partitioning built into Postgres 10 and refined in Postgres 11, you can address that problem with round robin partitions. For any table that has the key
cliente_id
(as you state in the multitenant alternative) you can partition by hash by saying:By the law of large numbers, the records will be divided into 3 more or less evenly. Postgres 11 optimizes queries in such a way that when doing a query it simply ignores partitions that cannot match the query, as early as possible.
Whether you use a large database or many databases, having many connected clients implies keeping a lot of data in memory and a lot of visible data, and this will happen in both cases.
The difference is on the side of strategy. If you separate and implement the instance registry beforehand, you can later grow horizontally by adding more machines instead of fattening up the main machine.
That, and finally the old adage of eggs in the same basket...
Bonus Track: mapping via business layer
I once implemented a system where the stumbling block was not the relational structure, which was 3NF compliant. It was the fact that a client could upload N data tables with arbitrary columns that were his "personal collections", like someone uploading spreadsheets to google drive.
Since there is no way to maintain relational integrity between a client and its tables using postgres catalog entities like
pg_tables
, you can't use the database logic itself so that, given the deletion of a client, drop is done. to the boards he climbed. You have to use the business layer to relate a customer to its tables, and delete them before deleting a customer.As a result, the business layer is absolutely coupled to the persistence layer, and the idea is that the layers of an application do not need to know more than the minimum about the implementation of the adjacent layers.
(currently someone would tell me that instead of having N tables per customer, I had a table where each record had a field of type JSONB that contained a certain collection of data for the owner of that row... which would be almost fine, except that these were geospatial tables that required filtering using PostGIS (Postgres doesn't have spatial indexes for GeoJSON structures like MongoDB does)