No Comments

Fine tuning Postgres to achieve 5,000 Queries per second!

Nikhil Shetty | Database Consultant, Ashnik
Singapore, 15 Jan 2021
Fine-tuning-Postgres-Nikhil-FImg

by , , No Comments

15-Jan-2021

I am sharing my recent experience at one of the customers of Ashnik. This customer is a global ISV/SI who has developed business critical application that is used by millions of end customers. The functionality demands very high number of transaction processing during peak periods. The customer decided to use EDB Postgres instead of proprietary database technology and hence there was a lot of pressure on us to deliver desired performance. We had done consulting work for design and sizing phase. After lot of reviews and approvals, we deployed EDB Postgres on target servers.

Pre-solution status and setup

We had deployed a three node EDB Postgres cluster with EDB EFM for HA and fault tolerance. We had also setup Pgpool for connection pooling and load balancing. We were expecting a lot of transactions (SELECT,INSERT and UPDATES) and concurrency in the database server. We enabled serialize_accept on the pgpool to prevent the “thundering herd problem” more of which can be found in the link serialize_accept

Additionally, below parameters were set on the pgpool for maintaining the pool of connections:
1. client_idle_limit – 120 (Disconnects database sessions after 120sec have passed since the last query)
2. child_life_time – 0 (We disabled this parameter because we had enabled serialize_accept)
3. connection_life_time – 60 (Terminates a database cached connection 60 seconds after the client disconnects)
4. child_max_connections – 10 (Each pgpool child process can handle 10 connections before it spawns a new child process)

Database Architecture for Performance Test environment:

Postgres

We started with recommended settings for database parameters like shared_buffers, work_mem, temp_file_limit, etc. and after a few weeks of testing we finetuned some parameters from their original values. Other configuration settings at the database and OS level were done for the expected workload and we were ready for a PT run (Performance Test).

The Challenge

The business functionality of this application demands very high number of business transactions which in turn means high number of concurrent connections coming from multiple microservices to database. For this purpose, we suggested use of pgpool for connections pooling as well as for load balancing of connections at database end coming from these microservices. Initially, ISV’s application architecture team estimated that to achieve the desired business transactions, it would need upto 10,000 concurrent connections at database level. This number (10,000 concurrent connections) is really high but we started working with application and infra team to achieve the performance.

The joint team planned for rounds of rigorous Performance Testing ( PT) and Load Testing ( LT).

When the Performance Testing run started, everything looked ok with 500-600 concurrent connections to the database. Queries were also running within reasonable time(~<100ms). As the run progressed, we noticed that the database connections were piling up. Within 10-15 seconds, connections count moved from 1,000 to 2,500 and soon there were around 7,000+ connections to the database (5000+ active, 1000+ idle connections, and 500+ idle in transaction connections). Load average on the server shot off the roof and read 3,000, 2,800, 2,700 for the last 1, 5 and 15 minutes respectively. It was evident that the database server was not able to handle these many connections. Once it reached beyond “num_init_children”parameter on pgpool we also started getting error “Sorry, too many clients already”.

For each run, we analysed the behaviour of the connections (how many connections came to the database, how long each connection was held in the database, and what happened to the connections after client disconnected). From this analysis we found that, after the client disconnects, the connections go into idle state and never terminate ( As per the “connection_life_time” parameter in pgpool, sixty seconds after the client disconnects, that particular cached database connections should be terminated. But this was not happening). This led to a pileup of the database connections and soon there were no connections left in the database to serve application requests.

One more observation was that there were too many connections coming into the database in a short span which led to exponential increase in the number of concurrent connections.
At this stage it is important to understand the nuances of concurrent connections.

What exactly is ‘concurrent connections’ ?

When connections, coming from application get connected to database through pgpool , there are 4 types of connections state ,

1. active: This indicates that the connection is working.
2. idle: This indicates that the connection is idle, and we need to track these connections based on the time that they have been idle.
3. idle in transaction: This indicates that backend is in a transaction, but it is currently not doing anything and could be waiting for an input from the end user such as “begin “ transactions
4. idle in transaction (aborted): This state is similar to idle in transaction, except one of the statements in the transaction caused an error. This also needs to be monitored based on the time since it has been idle.

Remember, in this case, active state connections are important because those connections are currently running and consuming resources that results in shooting-up of load average.

Solution to achieve desired results

Our ultimate goal is to achieve desired business transactions and to reach that goal, we have to walk through improving concurrent connections step.

Community to the Rescue!

We noticed that there were too many “idle” connections laying idle for more than few seconds which were not required and to overcome those we had set “connection_life_time” parameters. But we observed during several rounds of PT that this parameter was not working as desired.

We raised the concern with the Pgpool community and as you would expect from the community, we received a quick response from one of the committers who tried the same scenario at his end. After few trials we got the response, “somehow connection_life_time parameter does not work if serialize_accept is enabled”. Community immediately released a patch for all new and older versions of pgpool to fix the issue.

We took the patch to EDB and in no time EDB came up with a point fix and released the patch as a new minor version of pgpool. We installed pgpool with the fix on the client machine and tested with a few connections and voila! the issue was resolved. We then moved towards the Performance Test run. After applying the fix we noticed that the cached connections were terminated after sixty seconds of no reuse and that reduced the pressure on the database server to a great extent.

Since, for the primary database user, we could see less reuse of cached connections, we reduced connection_life_time to 30 seconds and database server could easily handle more concurrent connections to the database. For other users we noticed that connections were being used within the 30 seconds period and pooling helped in mitigating the overhead of creating a new connection everytime.

Fine tunning the microservices architecture also

As mentioned earlier, there were multiple sources of connections coming to databases including connections from multiple microservices as well as from data stream pipeline through kafka.

In microservices application architecture where each API / application running within container connects to database through jdbc, it is important to tune this connections paraments. It is because there is no stateful application connections size when you are using microservices in containers.

Taking these into consideration, application team did some parameter tuning at the application level to minimize the number of connections to the database without affecting the desired performance of processing huge amount of data

Conclusion

With the combination of Postgres fix and application fine tuning, so we saw a peak of 900-1,100 concurrent connections (which was way below 7,000 connections we had seen earlier) to the database and the patch helped us to easily manage those connections as it did not allow idle connections to be in the database forever if they were not reused.

With these fixes applied, we noticed that the database server could handle upto 1,000+ concurrent connections during peak load and gave us the performance that we had expected.

In terms of business transactions, each business transactions is around 30-35 queries hitting the database. We are able to achieve ~ 150 business transactions with 4,500-5,000 QPS ( query per second ). This is what the ISV’s application team was aiming for.

It was a great experience to troubleshoot various scenarios with postgres and I am really amazed at the performance. I am more amazed by the speed of the community support! I truly experienced the power of open source during this project.

1
0

  • Nikhil has joined Ashnik as a database consultant for postgres having more than three years of experience working as a database administrator at TCS and as a senior analyst at Allianz Technology on different technologies such as Postgres, Oracle and SAP HANA.

More From Nikhil Shetty | Database Consultant, Ashnik :
15-Jan-2021