It was 2 AM when the alerts started firing. Database connections exhausted. Services timing out. Users unable to log in. The dashboard showed our MySQL instance drowning—800+ connections when we'd configured for 500, and climbing.
The instinct was to increase max_connections. We bumped it to 1000. Relief for a few hours. Then the same cascade started again. That's when we knew: something was fundamentally broken, and throwing resources at it wasn't the answer.
What followed was a three-week journey through the depths of MySQL internals, connection pooling mechanics, and prepared statement behavior. This is that story—the debugging process, the band-aids that kept us alive, and the industry-standard fixes that finally solved the problem.
The Architecture That Worked Until It Didn't
Our production setup was fairly standard for a growing platform:
- 30+ Spring Boot microservices connecting to MySQL 8.0 on AWS RDS
- HikariCP managing connection pools at the application level
- ~8,000 requests per second at peak traffic
- JPA/Hibernate for ORM, with some raw JDBC for performance-critical paths
This architecture had served us well for two years. Load tests passed. Staging environments behaved. But production has a way of exposing edge cases that controlled environments never reveal.
The pattern was always the same: traffic would spike, connections would climb, and suddenly we'd hit a wall. Not a gradual degradation—a cliff. One moment things were fine, the next moment nothing worked.
Understanding the Enemy: How MySQL Prepared Statements Actually Work
Before I explain what went wrong, you need to understand how MySQL handles prepared statements. This knowledge was crucial to diagnosing our problem, and it's something most developers never think about.
The Two Flavors of Prepared Statements
MySQL JDBC drivers support two completely different modes of prepared statement handling, and confusing them is a common source of production issues.
Client-Side Prepared Statements (The Default)
When you use MySQL Connector/J with default settings, prepared statements are emulated entirely on the client side. Here's what actually happens:
Your application creates a PreparedStatement with placeholders. When you set parameters and execute, the JDBC driver substitutes the values directly into the SQL string and sends a complete query to MySQL. The server receives SELECT * FROM users WHERE id = 42, not a parameterized query.
This means:
- Every execution sends a fresh SQL string to the server
- The server parses and optimizes the query each time
- Nothing accumulates on the server—there's no state to leak
Server-Side Prepared Statements (The Performance Optimization)
When you enable useServerPrepStmts=true in your connection string, behavior changes dramatically. Now the driver uses MySQL's binary protocol for true prepared statements.
The first time you execute a query, the driver sends a COM_STMT_PREPARE command with your SQL template. MySQL parses the query, builds an execution plan, and stores everything in memory—tied to your specific connection. It returns a statement handle (just a number) that your driver caches.
Subsequent executions send COM_STMT_EXECUTE with just the statement handle and parameter values. MySQL skips parsing entirely and uses the cached plan. This is faster, especially for complex queries executed repeatedly.
But here's the critical detail: that prepared statement lives on the server until the connection closes or you explicitly deallocate it. And MySQL has a global limit on how many can exist across all connections.
The Global Limit Nobody Thinks About
MySQL enforces a server-wide limit on prepared statements through the max_prepared_stmt_count variable. The default is 16,382. That might sound like a lot, but consider:
- 500 connections × 50 prepared statements each = 25,000 statements
- You've exceeded the limit, and new prepares start failing
When you hit this limit, MySQL returns:
ERROR 1461 (42000): Can't create more than max_prepared_stmt_count statements
This error doesn't tell you which connection is the problem or what queries are accumulating. It just fails—and often it fails for your healthiest services because the misbehaving ones already consumed the quota.
Memory Implications
Each server-side prepared statement consumes memory on MySQL:
- Statement metadata: 1-2 KB for the SQL text and parsed structure
- Execution plan cache: 2-10 KB depending on query complexity (joins, subqueries, etc.)
- Parameter buffers: Variable based on parameter types and counts
With 500 connections each holding 100 prepared statements, you're looking at:
- 50,000 statements server-wide
- 250-500 MB of dedicated memory
- Significant pressure on MySQL's internal caches
This memory doesn't show up in your application metrics. It's invisible until it causes problems.
Phase 1: Discovery — Finding the Smoking Gun
When our incident started, we had no visibility into prepared statement counts. We could see connections climbing in CloudWatch, but we didn't know why connections weren't being released or reused properly.
The performance_schema database became our forensic toolkit. It's MySQL's built-in instrumentation layer, and it contains tables that track everything from query execution times to—crucially—prepared statement instances.
First Question: What Are All These Connections Doing?
We started with the basics. MySQL tracks every client connection as a "thread," and the threads table tells you exactly what each one is doing:
SELECT
PROCESSLIST_ID AS connection_id,
PROCESSLIST_USER AS user,
SUBSTRING_INDEX(PROCESSLIST_HOST, ':', 1) AS client_ip,
PROCESSLIST_COMMAND AS state,
PROCESSLIST_TIME AS idle_seconds,
LEFT(PROCESSLIST_INFO, 50) AS current_query
FROM performance_schema.threads
WHERE TYPE = 'FOREGROUND'
AND PROCESSLIST_USER IS NOT NULL
ORDER BY PROCESSLIST_TIME DESC
LIMIT 30;
The results were immediately suspicious. Hundreds of connections showed Sleep as their state, with idle times measured in hours. These connections had finished their work long ago but were still sitting there, holding resources.
In a healthy connection pool, connections return to the pool after each transaction and get reused quickly. Seeing connections idle for hours meant either our pool was misconfigured, connections were leaking, or something was preventing proper reuse.
The Real Question: What Are These Connections Holding?
Next, we needed to see what was accumulating on these zombie connections. The prepared_statements_instances table tracks every prepared statement currently allocated on the server:
SELECT
t.PROCESSLIST_ID AS connection_id,
t.PROCESSLIST_USER AS user,
SUBSTRING_INDEX(t.PROCESSLIST_HOST, ':', 1) AS client_ip,
t.PROCESSLIST_TIME AS idle_seconds,
COUNT(ps.STATEMENT_ID) AS prepared_statements,
ROUND(SUM(ps.SUM_TIMER_EXECUTE)/1000000000, 2) AS total_exec_time_ms
FROM performance_schema.threads t
LEFT JOIN performance_schema.prepared_statements_instances ps
ON t.THREAD_ID = ps.OWNER_THREAD_ID
WHERE t.TYPE = 'FOREGROUND'
AND t.PROCESSLIST_USER IS NOT NULL
GROUP BY t.THREAD_ID, t.PROCESSLIST_ID, t.PROCESSLIST_USER,
t.PROCESSLIST_HOST, t.PROCESSLIST_TIME
HAVING prepared_statements > 50
ORDER BY prepared_statements DESC;
This query joined thread information with prepared statement counts. The results were staggering: some connections had over 2,000 prepared statements each. Connections that had been idle for hours were holding onto thousands of server-side statements.
We checked the global limit usage:
SELECT
@@max_prepared_stmt_count AS max_allowed,
(SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Prepared_stmt_count') AS current_count;
We were at 15,847 out of 16,382. The system was 97% exhausted, and any new service trying to prepare a statement would fail.
Identifying the Culprits
We needed to know which services were causing the problem. Since each microservice connected from a different IP (or set of IPs in Kubernetes), we could aggregate by client address:
SELECT
SUBSTRING_INDEX(t.PROCESSLIST_HOST, ':', 1) AS client_ip,
COUNT(DISTINCT t.PROCESSLIST_ID) AS connections,
SUM(CASE WHEN t.PROCESSLIST_COMMAND = 'Sleep'
AND t.PROCESSLIST_TIME > 300 THEN 1 ELSE 0 END) AS zombies,
SUM(prep.stmt_count) AS total_prepared_statements,
ROUND(AVG(prep.stmt_count), 1) AS avg_per_connection
FROM performance_schema.threads t
LEFT JOIN (
SELECT OWNER_THREAD_ID, COUNT(*) AS stmt_count
FROM performance_schema.prepared_statements_instances
GROUP BY OWNER_THREAD_ID
) prep ON t.THREAD_ID = prep.OWNER_THREAD_ID
WHERE t.TYPE = 'FOREGROUND'
AND t.PROCESSLIST_USER IS NOT NULL
GROUP BY client_ip
ORDER BY total_prepared_statements DESC;
Three services stood out with dramatically higher statement counts. These became our targets for investigation—but first, we needed to stop the bleeding.
Phase 2: The Band-Aid — Killing Zombie Connections
With production unstable and users affected, we couldn't wait for a proper fix. We needed immediate relief while we investigated the root cause.
The solution was simple but effective: automatically kill connections that had been idle too long. When a connection dies, MySQL deallocates all its prepared statements immediately. It's aggressive—applications will see connection errors—but it's better than total system failure.
Why MySQL Events Instead of Application Changes
We could have tried to fix the application immediately, but that approach had problems:
- We didn't yet know which code was causing the issue
- Deploying changes to 30+ services takes time
- We needed relief in minutes, not hours
MySQL's event scheduler let us implement a server-side solution that worked immediately, regardless of application behavior.
Building the Zombie Killer
First, we created audit tables to track what we were killing. Visibility into automated actions is crucial—you need to know what's happening and prove the fix is working:
CREATE DATABASE IF NOT EXISTS ops_tools;
CREATE TABLE ops_tools.connection_kill_log (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
killed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
connection_id BIGINT NOT NULL,
user VARCHAR(32),
host VARCHAR(255),
idle_seconds INT,
prepared_stmt_count INT,
kill_reason VARCHAR(100),
INDEX idx_killed_at (killed_at),
INDEX idx_host (host(50))
) ENGINE=InnoDB;
CREATE TABLE ops_tools.zombie_killer_runs (
id INT AUTO_INCREMENT PRIMARY KEY,
run_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
connections_killed INT,
statements_freed INT,
INDEX idx_run_time (run_time)
) ENGINE=InnoDB;
The stored procedure does the heavy lifting. It scans for idle connections, logs what it finds, and terminates them:
DELIMITER //
CREATE PROCEDURE ops_tools.kill_zombie_connections(
IN idle_threshold_seconds INT
)
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE v_conn_id BIGINT;
DECLARE v_user VARCHAR(32);
DECLARE v_host VARCHAR(255);
DECLARE v_idle INT;
DECLARE v_prep_count INT;
DECLARE v_killed INT DEFAULT 0;
DECLARE v_freed INT DEFAULT 0;
DECLARE zombie_cursor CURSOR FOR
SELECT
t.PROCESSLIST_ID,
t.PROCESSLIST_USER,
t.PROCESSLIST_HOST,
t.PROCESSLIST_TIME,
COALESCE(p.cnt, 0)
FROM performance_schema.threads t
LEFT JOIN (
SELECT OWNER_THREAD_ID, COUNT(*) AS cnt
FROM performance_schema.prepared_statements_instances
GROUP BY OWNER_THREAD_ID
) p ON t.THREAD_ID = p.OWNER_THREAD_ID
WHERE t.TYPE = 'FOREGROUND'
AND t.PROCESSLIST_COMMAND = 'Sleep'
AND t.PROCESSLIST_TIME > idle_threshold_seconds
AND t.PROCESSLIST_USER NOT IN ('rdsadmin', 'mysql.sys', 'root')
AND t.PROCESSLIST_ID != CONNECTION_ID();
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
DECLARE CONTINUE HANDLER FOR SQLEXCEPTION BEGIN END;
OPEN zombie_cursor;
kill_loop: LOOP
FETCH zombie_cursor INTO v_conn_id, v_user, v_host, v_idle, v_prep_count;
IF done THEN LEAVE kill_loop; END IF;
INSERT INTO ops_tools.connection_kill_log
(connection_id, user, host, idle_seconds, prepared_stmt_count, kill_reason)
VALUES
(v_conn_id, v_user, v_host, v_idle, v_prep_count,
CONCAT('Idle ', v_idle, 's with ', v_prep_count, ' prepared stmts'));
KILL CONNECTION v_conn_id;
SET v_killed = v_killed + 1;
SET v_freed = v_freed + v_prep_count;
END LOOP;
CLOSE zombie_cursor;
IF v_killed > 0 THEN
INSERT INTO ops_tools.zombie_killer_runs (connections_killed, statements_freed)
VALUES (v_killed, v_freed);
END IF;
END //
DELIMITER ;
Finally, we scheduled it to run every 5 minutes:
SET GLOBAL event_scheduler = ON;
CREATE EVENT ops_tools.zombie_killer_event
ON SCHEDULE EVERY 5 MINUTE
STARTS CURRENT_TIMESTAMP
ENABLE
DO CALL ops_tools.kill_zombie_connections(600);
The 600 parameter means we kill connections idle for more than 10 minutes. This is aggressive but necessary during a crisis. We could tune it later.
Monitoring the Fix
Within the first hour, we could see the impact:
SELECT
DATE(killed_at) AS date,
COUNT(*) AS connections_killed,
SUM(prepared_stmt_count) AS statements_freed,
ROUND(AVG(idle_seconds)/60, 1) AS avg_idle_minutes
FROM ops_tools.connection_kill_log
GROUP BY DATE(killed_at)
ORDER BY date DESC;
The zombie killer was terminating 50-100 connections per hour and freeing thousands of prepared statements. More importantly, the global prepared statement count dropped from 97% to around 30%, and the system stabilized.
This band-aid kept us alive for three weeks while we investigated the root cause and planned proper fixes.
Phase 3: Root Cause Analysis — Why Were Statements Accumulating?
With immediate pressure relieved, we could dig into the actual problem. The investigation revealed multiple contributing factors, each making the others worse.
Root Cause #1: Missing Client-Side Statement Cache
Our JDBC connection strings had useServerPrepStmts=true to enable the performance benefits of server-side prepared statements. But we were missing a critical companion setting: cachePrepStmts=true.
Here's what was happening without client-side caching:
- Application gets a connection from the pool
- Application prepares:
SELECT * FROM users WHERE id = ? - Driver sends
COM_STMT_PREPAREto MySQL - MySQL creates prepared statement #1, returns handle
- Application executes the statement
- Connection returns to the pool
- Later, same connection runs the same query
- Driver sends
COM_STMT_PREPAREagain - MySQL creates prepared statement #2 (duplicate!)
- This repeats forever
The MySQL Connector/J driver has a built-in LRU cache that remembers which statements have already been prepared on each connection. With cachePrepStmts=true, step 8 becomes "driver finds cached handle, reuses it" instead of preparing again.
Without this cache, every time your code called prepareStatement(), a new server-side statement was created—even for identical queries on the same connection.
Root Cause #2: Dynamic Query Generation
Even with proper caching, we had another problem: some services were generating queries dynamically in ways that defeated caching entirely.
The worst offender was IN clause construction:
// This code creates a NEW prepared statement for every unique list size
public List<Order> findByIds(List<Long> ids) {
String placeholders = ids.stream()
.map(id -> "?")
.collect(Collectors.joining(","));
String sql = "SELECT * FROM orders WHERE id IN (" + placeholders + ")";
return jdbcTemplate.query(sql, ids.toArray(), orderMapper);
}
When called with 5 IDs, this generates: SELECT * FROM orders WHERE id IN (?,?,?,?,?)
When called with 3 IDs: SELECT * FROM orders WHERE id IN (?,?,?)
When called with 100 IDs: SELECT * FROM orders WHERE id IN (?,?,?...100 times...)
Each unique SQL string is a different prepared statement. If your application calls this method with varying list sizes—which is normal—you'll create dozens or hundreds of distinct statements.
Similarly, some code was concatenating values directly into SQL:
// NEVER do this—each unique value creates a new statement
String sql = "SELECT * FROM products WHERE category = '" + category + "'";
This not only defeats prepared statement reuse but also opens SQL injection vulnerabilities.
Root Cause #3: ORM-Generated Dynamic Queries
Hibernate's Criteria API and JPA specifications can generate dynamic SQL based on which parameters are non-null:
// Each combination of non-null parameters generates different SQL
CriteriaBuilder cb = entityManager.getCriteriaBuilder();
CriteriaQuery<User> query = cb.createQuery(User.class);
Root<User> root = query.from(User.class);
List<Predicate> predicates = new ArrayList<>();
if (name != null) predicates.add(cb.equal(root.get("name"), name));
if (email != null) predicates.add(cb.equal(root.get("email"), email));
if (status != null) predicates.add(cb.equal(root.get("status"), status));
query.where(predicates.toArray(new Predicate[0]));
With three optional parameters, you have 8 possible SQL combinations (2³). With ten parameters, you have 1,024 combinations. Each becomes a separate prepared statement.
Root Cause #4: Connection Lifetime Too Long
Our HikariCP configuration allowed connections to live for 30 minutes:
hikari:
max-lifetime: 1800000 # 30 minutes
A connection living for 30 minutes handles thousands of requests. Even with a small rate of statement leakage, the accumulation becomes significant. Reducing connection lifetime limits how much can accumulate before the connection is recycled.
Phase 4: The Proper Fix — Industry-Standard Configuration
With root causes identified, we implemented proper fixes. These are the configurations and patterns that eliminated our prepared statement problems permanently.
Fix #1: Complete JDBC Connection String
The MySQL Connector/J driver has several parameters that work together for proper prepared statement handling:
jdbc:mysql://hostname:3306/database?
useServerPrepStmts=true&
cachePrepStmts=true&
prepStmtCacheSize=250&
prepStmtCacheSqlLimit=2048&
cacheResultSetMetadata=true&
cacheServerConfiguration=true&
useLocalSessionState=true
Let me explain each parameter:
useServerPrepStmts=true — Enables true server-side prepared statements using MySQL's binary protocol. Provides performance benefits for repeated queries.
cachePrepStmts=true — This is the critical one we were missing. Enables the driver's client-side LRU cache that maps SQL strings to server-side statement handles. Without this, every prepareStatement() call creates a new server-side statement.
prepStmtCacheSize=250 — Number of prepared statements to cache per connection. The default (25) is too low for most applications. Set this to match your application's unique query count. Too low means cache evictions and re-preparation; too high wastes memory.
prepStmtCacheSqlLimit=2048 — Maximum SQL string length that will be cached. The default (256) is far too short—many real queries exceed 256 characters. Queries longer than this limit are prepared fresh every time.
cacheResultSetMetadata=true — Caches result set metadata (column names, types) to avoid re-fetching it on each execution.
cacheServerConfiguration=true — Caches server variables and configuration to reduce round-trips.
useLocalSessionState=true — Tracks session state locally to avoid unnecessary SET commands.
Fix #2: Proper HikariCP Configuration
spring:
datasource:
hikari:
pool-name: "MyService-HikariPool"
# Pool sizing - smaller is often better
minimum-idle: 5
maximum-pool-size: 20
# Connection lifecycle - shorter prevents accumulation
max-lifetime: 300000 # 5 minutes (was 30)
idle-timeout: 60000 # 1 minute
# Validation
connection-test-query: "SELECT 1"
validation-timeout: 5000
# Leak detection - catches application bugs
leak-detection-threshold: 30000 # 30 seconds
# Monitoring
register-mbeans: true
The key insight here is that shorter connection lifetimes are safer. Yes, creating new connections has overhead. But that overhead is tiny compared to the cost of accumulated statements bringing down your database.
The leak-detection-threshold is invaluable during development. If your application holds a connection for more than 30 seconds without closing it, HikariCP logs a stack trace showing exactly where the leak occurs.
Fix #3: Fixing Dynamic Queries
For IN clauses, we implemented padding to maintain a consistent query structure:
private static final int[] IN_CLAUSE_SIZES = {1, 5, 10, 25, 50, 100};
public List<Order> findByIds(List<Long> ids) {
// Find the smallest bucket that fits our IDs
int bucketSize = Arrays.stream(IN_CLAUSE_SIZES)
.filter(size -> size >= ids.size())
.findFirst()
.orElse(IN_CLAUSE_SIZES[IN_CLAUSE_SIZES.length - 1]);
// Pad the list to bucket size by repeating the first ID
List<Long> paddedIds = new ArrayList<>(ids);
while (paddedIds.size() < bucketSize) {
paddedIds.add(ids.get(0));
}
String placeholders = String.join(",", Collections.nCopies(bucketSize, "?"));
String sql = "SELECT * FROM orders WHERE id IN (" + placeholders + ")";
return jdbcTemplate.query(sql, paddedIds.toArray(), orderMapper);
}
Now instead of unique queries for every list size from 1 to 100, we have only 6 distinct queries. Each gets prepared once and reused.
For optional filter parameters, we changed from dynamic predicates to inclusive SQL:
// Instead of building dynamic WHERE clauses:
String jpql = """
SELECT u FROM User u
WHERE (:name IS NULL OR u.name = :name)
AND (:email IS NULL OR u.email = :email)
AND (:status IS NULL OR u.status = :status)
""";
This single query handles all parameter combinations. The database optimizer is smart enough to optimize away the IS NULL checks for provided parameters.
Fix #4: Hibernate-Specific Settings
spring:
jpa:
properties:
hibernate:
query.plan_cache_max_size: 2048
query.in_clause_parameter_padding: true
The in_clause_parameter_padding setting tells Hibernate to automatically pad IN clause parameters to power-of-2 sizes (1, 2, 4, 8, 16...). This dramatically reduces unique query counts when using JPA Criteria or JPQL with collection parameters.
Fix #5: MySQL Server Configuration
For additional safety margin, we adjusted MySQL settings:
-- Increase the global limit (calculate based on your connection count)
SET GLOBAL max_prepared_stmt_count = 50000;
-- Reduce wait_timeout as a safety net
SET GLOBAL wait_timeout = 300;
SET GLOBAL interactive_timeout = 300;
For RDS, these are configured in the parameter group rather than via SQL.
Phase 5: Building the Monitoring Stack
After fixing the immediate problems, we built monitoring to catch any regression before it became an incident.
Custom CloudWatch Metrics
We created a Lambda function that runs every minute, queries performance_schema, and pushes custom metrics to CloudWatch:
Metrics collected:
TotalPreparedStatements— Global count across all connectionsPreparedStatementLimitUsage— Percentage ofmax_prepared_stmt_countusedMaxStatementsPerConnection— Highest count on any single connectionZombieConnections— Connections idle more than 5 minutesConnectionsByState— Breakdown of Sleep, Query, etc.
These metrics give us visibility that standard RDS metrics don't provide. We can see statement accumulation trends before they become critical.
CloudWatch Alarms
We configured tiered alerting:
Warning (Slack notification):
- Prepared statement usage exceeds 60%
- Zombie connections exceed 50
- Any single connection has more than 100 statements
Critical (PagerDuty):
- Prepared statement usage exceeds 80%
- Total connections exceed 80% of max
- Database connections growing faster than 10/minute for 5 minutes
CloudWatch Dashboard
Our dashboard shows:
- Connection count over time with warning/critical thresholds
- Prepared statement count with annotations at limit percentages
- Zombie connection count
- Stacked chart of connections by state
- Prepared statements per connection distribution
This gives on-call engineers immediate visibility into database health without needing to run diagnostic queries.
Automated Response
For additional safety, we deployed a Lambda function that runs every 5 minutes:
- Checks for connections with excessive prepared statements (>200)
- Checks for connections idle more than 15 minutes
- Terminates connections meeting either criteria
- Sends alert with details to Slack
This automated killer is a last line of defense. With proper application configuration, it should rarely trigger. But it's there as a safety net.
Scaling Beyond the Fix: Read Replicas and RDS Proxy
Once we stabilized the system, we looked at architectural improvements to handle future growth. Two AWS features became essential parts of our strategy: Read Replicas and RDS Proxy.
Read Replicas: Distributing the Load
Our analysis revealed that 70% of our database traffic was read queries—reports, dashboards, search results. All of this was hitting the primary instance, competing for the same connection pool.
We deployed two RDS Read Replicas and restructured our application to route read traffic appropriately:
@Configuration
public class DataSourceConfig {
@Bean
@Primary
public DataSource writeDataSource() {
return createHikariDataSource(
"jdbc:mysql://primary.cluster-xxx.rds.amazonaws.com:3306/mydb",
10 // Smaller pool for writes
);
}
@Bean
public DataSource readDataSource() {
return createHikariDataSource(
"jdbc:mysql://replica.cluster-xxx.rds.amazonaws.com:3306/mydb",
20 // Larger pool for reads
);
}
}
The benefits were immediate:
- Primary instance connections dropped by 60% — Only write operations and transactions requiring read-your-writes consistency hit the primary
- Each replica has its own
max_prepared_stmt_count— Prepared statements don't replicate, so each instance has independent limits - Horizontal read scaling — Adding more replicas is trivial when read traffic grows
- Failure isolation — A misbehaving read query can't exhaust connections needed for writes
For Spring applications, we used @Transactional(readOnly = true) annotations to automatically route to replicas:
@Service
public class ReportService {
@Transactional(readOnly = true) // Routes to read replica
public List<SalesReport> generateMonthlyReport(YearMonth month) {
return reportRepository.findByMonth(month);
}
@Transactional // Routes to primary
public void saveReport(SalesReport report) {
reportRepository.save(report);
}
}
RDS Proxy: Connection Pooling at the Infrastructure Level
Even with application-level pooling via HikariCP, we faced challenges:
- 30+ microservices each maintaining their own connection pools
- Serverless functions (Lambda) creating connections on every invocation
- Connection storms during deployments when all pods restart simultaneously
RDS Proxy sits between your applications and the database, providing a managed connection pool at the infrastructure level:
┌─────────────────────────────────────────────────────────────┐
│ Without RDS Proxy │
├─────────────────────────────────────────────────────────────┤
│ Service A (20 conns) ─┐ │
│ Service B (20 conns) ─┼──→ MySQL (max 500 connections) │
│ Service C (20 conns) ─┤ │
│ ... 30 services ... │ │
│ Lambda (100s of conns)┘ │
│ │
│ Total: 700+ connections competing for 500 slots │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ With RDS Proxy │
├─────────────────────────────────────────────────────────────┤
│ Service A ─┐ │
│ Service B ─┼──→ RDS Proxy ──→ MySQL │
│ Service C ─┤ (manages (sees only ~50 │
│ ... │ pooling) connections) │
│ Lambda ─┘ │
│ │
│ Applications open 1000s of connections to Proxy │
│ Proxy maintains ~50 real connections to MySQL │
└─────────────────────────────────────────────────────────────┘
Key benefits we observed:
-
Connection multiplexing — RDS Proxy maintains a small pool of connections to MySQL and multiplexes thousands of application connections through them. Our database went from seeing 500+ connections to under 100.
-
Graceful handling of connection storms — During deployments, new pods request connections simultaneously. Proxy queues these requests instead of overwhelming the database.
-
Lambda compatibility — Serverless functions no longer create and destroy connections on each invocation. Proxy reuses connections across invocations.
-
Prepared statement handling — RDS Proxy uses transaction-level connection borrowing by default, which means prepared statements work correctly without the session-level issues that plague other poolers.
-
Failover handling — During RDS failover events, Proxy maintains application connections and transparently reconnects to the new primary.
Configuration considerations:
# Terraform example for RDS Proxy
resource "aws_db_proxy" "mysql_proxy" {
name = "mysql-proxy"
engine_family = "MYSQL"
idle_client_timeout = 1800 # 30 minutes
require_tls = true
auth {
auth_scheme = "SECRETS"
secret_arn = aws_secretsmanager_secret.db_credentials.arn
}
vpc_subnet_ids = var.private_subnet_ids
vpc_security_group_ids = [aws_security_group.proxy_sg.id]
}
resource "aws_db_proxy_default_target_group" "mysql_proxy" {
db_proxy_name = aws_db_proxy.mysql_proxy.name
connection_pool_config {
max_connections_percent = 50 # Use 50% of max_connections
max_idle_connections_percent = 10
connection_borrow_timeout = 120 # seconds
}
}
The prepared statement caveat with RDS Proxy:
RDS Proxy works in "transaction" mode by default, meaning it can reassign connections between transactions. This is similar to PgBouncer's transaction mode. Server-side prepared statements created in one transaction might not be available in the next if you get a different backend connection.
However, with our JDBC configuration using cachePrepStmts=true, the driver handles this gracefully—it re-prepares statements when needed because it tracks the SQL-to-handle mapping client-side.
The Combined Architecture
Our final architecture uses both strategies together:
┌──────────────────┐
│ Application │
└────────┬─────────┘
│
┌──────────────┴──────────────┐
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ RDS Proxy │ │ RDS Proxy │
│ (writes) │ │ (reads) │
└───────┬───────┘ └───────┬───────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Primary │───────────▶│ Replica 1 │
│ (writes) │ repl │ (reads) │
└───────────────┘ ├───────────────┤
│ Replica 2 │
│ (reads) │
└───────────────┘
This architecture provides:
- Write isolation — Write traffic has dedicated connections through its own proxy
- Read scalability — Reads distribute across replicas with their own proxy
- Connection efficiency — Proxies multiplex connections, reducing database load
- Prepared statement isolation — Each database instance has independent limits
Results
After implementing all fixes and monitoring:
| Metric | During Crisis | With Band-Aid | After Full Fix |
|---|---|---|---|
| Prepared stmt count | 15,000+ | ~5,000 | ~400 |
| Max stmts per connection | 2,000+ | ~100 | ~20 |
| Zombie connections | 200+ | ~20 | ~0 |
| Connection lifetime | 30 min | 30 min | 5 min |
| Statement limit usage | 95%+ | 30% | 3% |
| Incidents per month | Multiple | Rare | Zero |
The system has been stable for six months now. The monitoring has caught a few regressions during development—new services with missing cache configuration—and we fixed them before they reached production.
Key Takeaways
1. Understand client-side vs server-side prepared statements. They behave completely differently. Server-side statements can accumulate indefinitely; client-side statements don't persist at all.
2. cachePrepStmts=true is essential. Without this, the driver creates new server-side statements for every prepareStatement() call, even for identical queries on the same connection.
3. Dynamic queries defeat caching. Every unique SQL string is a separate prepared statement. Normalize your queries using parameter padding, inclusive WHERE clauses, or query templating.
4. Shorter connection lifetimes are safer. Five minutes is enough. The connection creation overhead is tiny compared to the cost of statement accumulation.
5. Monitor what matters. Standard RDS metrics don't show prepared statement counts. Build custom metrics that give you visibility into this failure mode.
6. Band-aids buy time. Automated zombie killers kept us alive while we implemented proper fixes. Don't be afraid to use aggressive temporary measures during a crisis.
7. The global limit is shared. max_prepared_stmt_count applies across all connections. One misbehaving service can exhaust it for everyone.
Quick Reference
When you're debugging a similar issue, start with these queries:
Check global statement usage:
SELECT @@max_prepared_stmt_count AS max_limit,
(SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Prepared_stmt_count') AS current_count;
Find connections with most statements:
SELECT t.PROCESSLIST_ID, t.PROCESSLIST_USER,
SUBSTRING_INDEX(t.PROCESSLIST_HOST, ':', 1) AS ip,
t.PROCESSLIST_TIME AS idle_sec,
COUNT(ps.STATEMENT_ID) AS stmt_count
FROM performance_schema.threads t
LEFT JOIN performance_schema.prepared_statements_instances ps
ON t.THREAD_ID = ps.OWNER_THREAD_ID
WHERE t.TYPE = 'FOREGROUND'
GROUP BY t.THREAD_ID
HAVING stmt_count > 10
ORDER BY stmt_count DESC
LIMIT 20;
Identify problem services by IP:
SELECT SUBSTRING_INDEX(PROCESSLIST_HOST, ':', 1) AS client_ip,
COUNT(*) AS connections,
SUM(CASE WHEN PROCESSLIST_COMMAND = 'Sleep'
AND PROCESSLIST_TIME > 300 THEN 1 ELSE 0 END) AS zombies
FROM performance_schema.threads
WHERE TYPE = 'FOREGROUND' AND PROCESSLIST_USER IS NOT NULL
GROUP BY client_ip
ORDER BY connections DESC;
The Journey Matters
This wasn't a clean, linear debugging process. Real incidents rarely are. We started at 2 AM with a dying database, implemented ugly band-aids to survive the night, and spent weeks understanding the actual problem.
The key was maintaining stability while investigating. The zombie killer wasn't elegant, but it kept us running. The metrics didn't exist at first—we built them because we needed visibility. Each fix addressed one piece of the puzzle.
If you're facing a similar situation, remember: it's okay to implement temporary workarounds. Just make sure you also invest in understanding and fixing the root cause. The temporary solutions buy you time; the permanent fixes prevent recurrence.
Understanding the full stack—how your ORM generates queries, how the JDBC driver handles prepared statements, how connection pools manage lifecycle, and how MySQL tracks everything internally—is what transforms a mystery into a solvable problem.
Dealing with database scaling challenges? Connect on Twitter or LinkedIn to discuss solutions.
