← Back to Blog

The Silent Killer: How Prepared Statement Leaks Crash Production Databases

DatabaseMySQLPerformanceDebuggingScaleAWS RDS
The Silent Killer: How Prepared Statement Leaks Crash Production Databases

It was 2 AM when the alerts started firing. Database connections exhausted. Services timing out. Users unable to log in. The dashboard showed our MySQL instance drowning—800+ connections when we'd configured for 500, and climbing.

The instinct was to increase max_connections. We bumped it to 1000. Relief for a few hours. Then the same cascade started again. That's when we knew: something was fundamentally broken, and throwing resources at it wasn't the answer.

What followed was a three-week journey through the depths of MySQL internals, connection pooling mechanics, and prepared statement behavior. This is that story—the debugging process, the band-aids that kept us alive, and the industry-standard fixes that finally solved the problem.


The Architecture That Worked Until It Didn't

Our production setup was fairly standard for a growing platform:

This architecture had served us well for two years. Load tests passed. Staging environments behaved. But production has a way of exposing edge cases that controlled environments never reveal.

The pattern was always the same: traffic would spike, connections would climb, and suddenly we'd hit a wall. Not a gradual degradation—a cliff. One moment things were fine, the next moment nothing worked.


Understanding the Enemy: How MySQL Prepared Statements Actually Work

Before I explain what went wrong, you need to understand how MySQL handles prepared statements. This knowledge was crucial to diagnosing our problem, and it's something most developers never think about.

The Two Flavors of Prepared Statements

MySQL JDBC drivers support two completely different modes of prepared statement handling, and confusing them is a common source of production issues.

Client-Side Prepared Statements (The Default)

When you use MySQL Connector/J with default settings, prepared statements are emulated entirely on the client side. Here's what actually happens:

Your application creates a PreparedStatement with placeholders. When you set parameters and execute, the JDBC driver substitutes the values directly into the SQL string and sends a complete query to MySQL. The server receives SELECT * FROM users WHERE id = 42, not a parameterized query.

This means:

Server-Side Prepared Statements (The Performance Optimization)

When you enable useServerPrepStmts=true in your connection string, behavior changes dramatically. Now the driver uses MySQL's binary protocol for true prepared statements.

The first time you execute a query, the driver sends a COM_STMT_PREPARE command with your SQL template. MySQL parses the query, builds an execution plan, and stores everything in memory—tied to your specific connection. It returns a statement handle (just a number) that your driver caches.

Subsequent executions send COM_STMT_EXECUTE with just the statement handle and parameter values. MySQL skips parsing entirely and uses the cached plan. This is faster, especially for complex queries executed repeatedly.

But here's the critical detail: that prepared statement lives on the server until the connection closes or you explicitly deallocate it. And MySQL has a global limit on how many can exist across all connections.

The Global Limit Nobody Thinks About

MySQL enforces a server-wide limit on prepared statements through the max_prepared_stmt_count variable. The default is 16,382. That might sound like a lot, but consider:

When you hit this limit, MySQL returns:

ERROR 1461 (42000): Can't create more than max_prepared_stmt_count statements

This error doesn't tell you which connection is the problem or what queries are accumulating. It just fails—and often it fails for your healthiest services because the misbehaving ones already consumed the quota.

Memory Implications

Each server-side prepared statement consumes memory on MySQL:

With 500 connections each holding 100 prepared statements, you're looking at:

This memory doesn't show up in your application metrics. It's invisible until it causes problems.


Phase 1: Discovery — Finding the Smoking Gun

When our incident started, we had no visibility into prepared statement counts. We could see connections climbing in CloudWatch, but we didn't know why connections weren't being released or reused properly.

The performance_schema database became our forensic toolkit. It's MySQL's built-in instrumentation layer, and it contains tables that track everything from query execution times to—crucially—prepared statement instances.

First Question: What Are All These Connections Doing?

We started with the basics. MySQL tracks every client connection as a "thread," and the threads table tells you exactly what each one is doing:

SELECT
    PROCESSLIST_ID AS connection_id,
    PROCESSLIST_USER AS user,
    SUBSTRING_INDEX(PROCESSLIST_HOST, ':', 1) AS client_ip,
    PROCESSLIST_COMMAND AS state,
    PROCESSLIST_TIME AS idle_seconds,
    LEFT(PROCESSLIST_INFO, 50) AS current_query
FROM performance_schema.threads
WHERE TYPE = 'FOREGROUND'
  AND PROCESSLIST_USER IS NOT NULL
ORDER BY PROCESSLIST_TIME DESC
LIMIT 30;

The results were immediately suspicious. Hundreds of connections showed Sleep as their state, with idle times measured in hours. These connections had finished their work long ago but were still sitting there, holding resources.

In a healthy connection pool, connections return to the pool after each transaction and get reused quickly. Seeing connections idle for hours meant either our pool was misconfigured, connections were leaking, or something was preventing proper reuse.

The Real Question: What Are These Connections Holding?

Next, we needed to see what was accumulating on these zombie connections. The prepared_statements_instances table tracks every prepared statement currently allocated on the server:

SELECT
    t.PROCESSLIST_ID AS connection_id,
    t.PROCESSLIST_USER AS user,
    SUBSTRING_INDEX(t.PROCESSLIST_HOST, ':', 1) AS client_ip,
    t.PROCESSLIST_TIME AS idle_seconds,
    COUNT(ps.STATEMENT_ID) AS prepared_statements,
    ROUND(SUM(ps.SUM_TIMER_EXECUTE)/1000000000, 2) AS total_exec_time_ms
FROM performance_schema.threads t
LEFT JOIN performance_schema.prepared_statements_instances ps
    ON t.THREAD_ID = ps.OWNER_THREAD_ID
WHERE t.TYPE = 'FOREGROUND'
  AND t.PROCESSLIST_USER IS NOT NULL
GROUP BY t.THREAD_ID, t.PROCESSLIST_ID, t.PROCESSLIST_USER,
         t.PROCESSLIST_HOST, t.PROCESSLIST_TIME
HAVING prepared_statements > 50
ORDER BY prepared_statements DESC;

This query joined thread information with prepared statement counts. The results were staggering: some connections had over 2,000 prepared statements each. Connections that had been idle for hours were holding onto thousands of server-side statements.

We checked the global limit usage:

SELECT
    @@max_prepared_stmt_count AS max_allowed,
    (SELECT VARIABLE_VALUE FROM performance_schema.global_status
     WHERE VARIABLE_NAME = 'Prepared_stmt_count') AS current_count;

We were at 15,847 out of 16,382. The system was 97% exhausted, and any new service trying to prepare a statement would fail.

Identifying the Culprits

We needed to know which services were causing the problem. Since each microservice connected from a different IP (or set of IPs in Kubernetes), we could aggregate by client address:

SELECT
    SUBSTRING_INDEX(t.PROCESSLIST_HOST, ':', 1) AS client_ip,
    COUNT(DISTINCT t.PROCESSLIST_ID) AS connections,
    SUM(CASE WHEN t.PROCESSLIST_COMMAND = 'Sleep'
             AND t.PROCESSLIST_TIME > 300 THEN 1 ELSE 0 END) AS zombies,
    SUM(prep.stmt_count) AS total_prepared_statements,
    ROUND(AVG(prep.stmt_count), 1) AS avg_per_connection
FROM performance_schema.threads t
LEFT JOIN (
    SELECT OWNER_THREAD_ID, COUNT(*) AS stmt_count
    FROM performance_schema.prepared_statements_instances
    GROUP BY OWNER_THREAD_ID
) prep ON t.THREAD_ID = prep.OWNER_THREAD_ID
WHERE t.TYPE = 'FOREGROUND'
  AND t.PROCESSLIST_USER IS NOT NULL
GROUP BY client_ip
ORDER BY total_prepared_statements DESC;

Three services stood out with dramatically higher statement counts. These became our targets for investigation—but first, we needed to stop the bleeding.


Phase 2: The Band-Aid — Killing Zombie Connections

With production unstable and users affected, we couldn't wait for a proper fix. We needed immediate relief while we investigated the root cause.

The solution was simple but effective: automatically kill connections that had been idle too long. When a connection dies, MySQL deallocates all its prepared statements immediately. It's aggressive—applications will see connection errors—but it's better than total system failure.

Why MySQL Events Instead of Application Changes

We could have tried to fix the application immediately, but that approach had problems:

MySQL's event scheduler let us implement a server-side solution that worked immediately, regardless of application behavior.

Building the Zombie Killer

First, we created audit tables to track what we were killing. Visibility into automated actions is crucial—you need to know what's happening and prove the fix is working:

CREATE DATABASE IF NOT EXISTS ops_tools;

CREATE TABLE ops_tools.connection_kill_log (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    killed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    connection_id BIGINT NOT NULL,
    user VARCHAR(32),
    host VARCHAR(255),
    idle_seconds INT,
    prepared_stmt_count INT,
    kill_reason VARCHAR(100),
    INDEX idx_killed_at (killed_at),
    INDEX idx_host (host(50))
) ENGINE=InnoDB;

CREATE TABLE ops_tools.zombie_killer_runs (
    id INT AUTO_INCREMENT PRIMARY KEY,
    run_time TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    connections_killed INT,
    statements_freed INT,
    INDEX idx_run_time (run_time)
) ENGINE=InnoDB;

The stored procedure does the heavy lifting. It scans for idle connections, logs what it finds, and terminates them:

DELIMITER //

CREATE PROCEDURE ops_tools.kill_zombie_connections(
    IN idle_threshold_seconds INT
)
BEGIN
    DECLARE done INT DEFAULT FALSE;
    DECLARE v_conn_id BIGINT;
    DECLARE v_user VARCHAR(32);
    DECLARE v_host VARCHAR(255);
    DECLARE v_idle INT;
    DECLARE v_prep_count INT;
    DECLARE v_killed INT DEFAULT 0;
    DECLARE v_freed INT DEFAULT 0;

    DECLARE zombie_cursor CURSOR FOR
        SELECT
            t.PROCESSLIST_ID,
            t.PROCESSLIST_USER,
            t.PROCESSLIST_HOST,
            t.PROCESSLIST_TIME,
            COALESCE(p.cnt, 0)
        FROM performance_schema.threads t
        LEFT JOIN (
            SELECT OWNER_THREAD_ID, COUNT(*) AS cnt
            FROM performance_schema.prepared_statements_instances
            GROUP BY OWNER_THREAD_ID
        ) p ON t.THREAD_ID = p.OWNER_THREAD_ID
        WHERE t.TYPE = 'FOREGROUND'
          AND t.PROCESSLIST_COMMAND = 'Sleep'
          AND t.PROCESSLIST_TIME > idle_threshold_seconds
          AND t.PROCESSLIST_USER NOT IN ('rdsadmin', 'mysql.sys', 'root')
          AND t.PROCESSLIST_ID != CONNECTION_ID();

    DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
    DECLARE CONTINUE HANDLER FOR SQLEXCEPTION BEGIN END;

    OPEN zombie_cursor;

    kill_loop: LOOP
        FETCH zombie_cursor INTO v_conn_id, v_user, v_host, v_idle, v_prep_count;
        IF done THEN LEAVE kill_loop; END IF;

        INSERT INTO ops_tools.connection_kill_log
            (connection_id, user, host, idle_seconds, prepared_stmt_count, kill_reason)
        VALUES
            (v_conn_id, v_user, v_host, v_idle, v_prep_count,
             CONCAT('Idle ', v_idle, 's with ', v_prep_count, ' prepared stmts'));

        KILL CONNECTION v_conn_id;
        SET v_killed = v_killed + 1;
        SET v_freed = v_freed + v_prep_count;
    END LOOP;

    CLOSE zombie_cursor;

    IF v_killed > 0 THEN
        INSERT INTO ops_tools.zombie_killer_runs (connections_killed, statements_freed)
        VALUES (v_killed, v_freed);
    END IF;
END //

DELIMITER ;

Finally, we scheduled it to run every 5 minutes:

SET GLOBAL event_scheduler = ON;

CREATE EVENT ops_tools.zombie_killer_event
ON SCHEDULE EVERY 5 MINUTE
STARTS CURRENT_TIMESTAMP
ENABLE
DO CALL ops_tools.kill_zombie_connections(600);

The 600 parameter means we kill connections idle for more than 10 minutes. This is aggressive but necessary during a crisis. We could tune it later.

Monitoring the Fix

Within the first hour, we could see the impact:

SELECT
    DATE(killed_at) AS date,
    COUNT(*) AS connections_killed,
    SUM(prepared_stmt_count) AS statements_freed,
    ROUND(AVG(idle_seconds)/60, 1) AS avg_idle_minutes
FROM ops_tools.connection_kill_log
GROUP BY DATE(killed_at)
ORDER BY date DESC;

The zombie killer was terminating 50-100 connections per hour and freeing thousands of prepared statements. More importantly, the global prepared statement count dropped from 97% to around 30%, and the system stabilized.

This band-aid kept us alive for three weeks while we investigated the root cause and planned proper fixes.


Phase 3: Root Cause Analysis — Why Were Statements Accumulating?

With immediate pressure relieved, we could dig into the actual problem. The investigation revealed multiple contributing factors, each making the others worse.

Root Cause #1: Missing Client-Side Statement Cache

Our JDBC connection strings had useServerPrepStmts=true to enable the performance benefits of server-side prepared statements. But we were missing a critical companion setting: cachePrepStmts=true.

Here's what was happening without client-side caching:

  1. Application gets a connection from the pool
  2. Application prepares: SELECT * FROM users WHERE id = ?
  3. Driver sends COM_STMT_PREPARE to MySQL
  4. MySQL creates prepared statement #1, returns handle
  5. Application executes the statement
  6. Connection returns to the pool
  7. Later, same connection runs the same query
  8. Driver sends COM_STMT_PREPARE again
  9. MySQL creates prepared statement #2 (duplicate!)
  10. This repeats forever

The MySQL Connector/J driver has a built-in LRU cache that remembers which statements have already been prepared on each connection. With cachePrepStmts=true, step 8 becomes "driver finds cached handle, reuses it" instead of preparing again.

Without this cache, every time your code called prepareStatement(), a new server-side statement was created—even for identical queries on the same connection.

Root Cause #2: Dynamic Query Generation

Even with proper caching, we had another problem: some services were generating queries dynamically in ways that defeated caching entirely.

The worst offender was IN clause construction:

// This code creates a NEW prepared statement for every unique list size
public List<Order> findByIds(List<Long> ids) {
    String placeholders = ids.stream()
        .map(id -> "?")
        .collect(Collectors.joining(","));

    String sql = "SELECT * FROM orders WHERE id IN (" + placeholders + ")";
    return jdbcTemplate.query(sql, ids.toArray(), orderMapper);
}

When called with 5 IDs, this generates: SELECT * FROM orders WHERE id IN (?,?,?,?,?)

When called with 3 IDs: SELECT * FROM orders WHERE id IN (?,?,?)

When called with 100 IDs: SELECT * FROM orders WHERE id IN (?,?,?...100 times...)

Each unique SQL string is a different prepared statement. If your application calls this method with varying list sizes—which is normal—you'll create dozens or hundreds of distinct statements.

Similarly, some code was concatenating values directly into SQL:

// NEVER do this—each unique value creates a new statement
String sql = "SELECT * FROM products WHERE category = '" + category + "'";

This not only defeats prepared statement reuse but also opens SQL injection vulnerabilities.

Root Cause #3: ORM-Generated Dynamic Queries

Hibernate's Criteria API and JPA specifications can generate dynamic SQL based on which parameters are non-null:

// Each combination of non-null parameters generates different SQL
CriteriaBuilder cb = entityManager.getCriteriaBuilder();
CriteriaQuery<User> query = cb.createQuery(User.class);
Root<User> root = query.from(User.class);

List<Predicate> predicates = new ArrayList<>();
if (name != null) predicates.add(cb.equal(root.get("name"), name));
if (email != null) predicates.add(cb.equal(root.get("email"), email));
if (status != null) predicates.add(cb.equal(root.get("status"), status));

query.where(predicates.toArray(new Predicate[0]));

With three optional parameters, you have 8 possible SQL combinations (2³). With ten parameters, you have 1,024 combinations. Each becomes a separate prepared statement.

Root Cause #4: Connection Lifetime Too Long

Our HikariCP configuration allowed connections to live for 30 minutes:

hikari:
  max-lifetime: 1800000  # 30 minutes

A connection living for 30 minutes handles thousands of requests. Even with a small rate of statement leakage, the accumulation becomes significant. Reducing connection lifetime limits how much can accumulate before the connection is recycled.


Phase 4: The Proper Fix — Industry-Standard Configuration

With root causes identified, we implemented proper fixes. These are the configurations and patterns that eliminated our prepared statement problems permanently.

Fix #1: Complete JDBC Connection String

The MySQL Connector/J driver has several parameters that work together for proper prepared statement handling:

jdbc:mysql://hostname:3306/database?
  useServerPrepStmts=true&
  cachePrepStmts=true&
  prepStmtCacheSize=250&
  prepStmtCacheSqlLimit=2048&
  cacheResultSetMetadata=true&
  cacheServerConfiguration=true&
  useLocalSessionState=true

Let me explain each parameter:

useServerPrepStmts=true — Enables true server-side prepared statements using MySQL's binary protocol. Provides performance benefits for repeated queries.

cachePrepStmts=true — This is the critical one we were missing. Enables the driver's client-side LRU cache that maps SQL strings to server-side statement handles. Without this, every prepareStatement() call creates a new server-side statement.

prepStmtCacheSize=250 — Number of prepared statements to cache per connection. The default (25) is too low for most applications. Set this to match your application's unique query count. Too low means cache evictions and re-preparation; too high wastes memory.

prepStmtCacheSqlLimit=2048 — Maximum SQL string length that will be cached. The default (256) is far too short—many real queries exceed 256 characters. Queries longer than this limit are prepared fresh every time.

cacheResultSetMetadata=true — Caches result set metadata (column names, types) to avoid re-fetching it on each execution.

cacheServerConfiguration=true — Caches server variables and configuration to reduce round-trips.

useLocalSessionState=true — Tracks session state locally to avoid unnecessary SET commands.

Fix #2: Proper HikariCP Configuration

spring:
  datasource:
    hikari:
      pool-name: "MyService-HikariPool"

      # Pool sizing - smaller is often better
      minimum-idle: 5
      maximum-pool-size: 20

      # Connection lifecycle - shorter prevents accumulation
      max-lifetime: 300000      # 5 minutes (was 30)
      idle-timeout: 60000       # 1 minute

      # Validation
      connection-test-query: "SELECT 1"
      validation-timeout: 5000

      # Leak detection - catches application bugs
      leak-detection-threshold: 30000  # 30 seconds

      # Monitoring
      register-mbeans: true

The key insight here is that shorter connection lifetimes are safer. Yes, creating new connections has overhead. But that overhead is tiny compared to the cost of accumulated statements bringing down your database.

The leak-detection-threshold is invaluable during development. If your application holds a connection for more than 30 seconds without closing it, HikariCP logs a stack trace showing exactly where the leak occurs.

Fix #3: Fixing Dynamic Queries

For IN clauses, we implemented padding to maintain a consistent query structure:

private static final int[] IN_CLAUSE_SIZES = {1, 5, 10, 25, 50, 100};

public List<Order> findByIds(List<Long> ids) {
    // Find the smallest bucket that fits our IDs
    int bucketSize = Arrays.stream(IN_CLAUSE_SIZES)
        .filter(size -> size >= ids.size())
        .findFirst()
        .orElse(IN_CLAUSE_SIZES[IN_CLAUSE_SIZES.length - 1]);

    // Pad the list to bucket size by repeating the first ID
    List<Long> paddedIds = new ArrayList<>(ids);
    while (paddedIds.size() < bucketSize) {
        paddedIds.add(ids.get(0));
    }

    String placeholders = String.join(",", Collections.nCopies(bucketSize, "?"));
    String sql = "SELECT * FROM orders WHERE id IN (" + placeholders + ")";

    return jdbcTemplate.query(sql, paddedIds.toArray(), orderMapper);
}

Now instead of unique queries for every list size from 1 to 100, we have only 6 distinct queries. Each gets prepared once and reused.

For optional filter parameters, we changed from dynamic predicates to inclusive SQL:

// Instead of building dynamic WHERE clauses:
String jpql = """
    SELECT u FROM User u
    WHERE (:name IS NULL OR u.name = :name)
      AND (:email IS NULL OR u.email = :email)
      AND (:status IS NULL OR u.status = :status)
    """;

This single query handles all parameter combinations. The database optimizer is smart enough to optimize away the IS NULL checks for provided parameters.

Fix #4: Hibernate-Specific Settings

spring:
  jpa:
    properties:
      hibernate:
        query.plan_cache_max_size: 2048
        query.in_clause_parameter_padding: true

The in_clause_parameter_padding setting tells Hibernate to automatically pad IN clause parameters to power-of-2 sizes (1, 2, 4, 8, 16...). This dramatically reduces unique query counts when using JPA Criteria or JPQL with collection parameters.

Fix #5: MySQL Server Configuration

For additional safety margin, we adjusted MySQL settings:

-- Increase the global limit (calculate based on your connection count)
SET GLOBAL max_prepared_stmt_count = 50000;

-- Reduce wait_timeout as a safety net
SET GLOBAL wait_timeout = 300;
SET GLOBAL interactive_timeout = 300;

For RDS, these are configured in the parameter group rather than via SQL.


Phase 5: Building the Monitoring Stack

After fixing the immediate problems, we built monitoring to catch any regression before it became an incident.

Custom CloudWatch Metrics

We created a Lambda function that runs every minute, queries performance_schema, and pushes custom metrics to CloudWatch:

Metrics collected:

These metrics give us visibility that standard RDS metrics don't provide. We can see statement accumulation trends before they become critical.

CloudWatch Alarms

We configured tiered alerting:

Warning (Slack notification):

Critical (PagerDuty):

CloudWatch Dashboard

Our dashboard shows:

This gives on-call engineers immediate visibility into database health without needing to run diagnostic queries.

Automated Response

For additional safety, we deployed a Lambda function that runs every 5 minutes:

  1. Checks for connections with excessive prepared statements (>200)
  2. Checks for connections idle more than 15 minutes
  3. Terminates connections meeting either criteria
  4. Sends alert with details to Slack

This automated killer is a last line of defense. With proper application configuration, it should rarely trigger. But it's there as a safety net.


Scaling Beyond the Fix: Read Replicas and RDS Proxy

Once we stabilized the system, we looked at architectural improvements to handle future growth. Two AWS features became essential parts of our strategy: Read Replicas and RDS Proxy.

Read Replicas: Distributing the Load

Our analysis revealed that 70% of our database traffic was read queries—reports, dashboards, search results. All of this was hitting the primary instance, competing for the same connection pool.

We deployed two RDS Read Replicas and restructured our application to route read traffic appropriately:

@Configuration
public class DataSourceConfig {

    @Bean
    @Primary
    public DataSource writeDataSource() {
        return createHikariDataSource(
            "jdbc:mysql://primary.cluster-xxx.rds.amazonaws.com:3306/mydb",
            10  // Smaller pool for writes
        );
    }

    @Bean
    public DataSource readDataSource() {
        return createHikariDataSource(
            "jdbc:mysql://replica.cluster-xxx.rds.amazonaws.com:3306/mydb",
            20  // Larger pool for reads
        );
    }
}

The benefits were immediate:

For Spring applications, we used @Transactional(readOnly = true) annotations to automatically route to replicas:

@Service
public class ReportService {

    @Transactional(readOnly = true)  // Routes to read replica
    public List<SalesReport> generateMonthlyReport(YearMonth month) {
        return reportRepository.findByMonth(month);
    }

    @Transactional  // Routes to primary
    public void saveReport(SalesReport report) {
        reportRepository.save(report);
    }
}

RDS Proxy: Connection Pooling at the Infrastructure Level

Even with application-level pooling via HikariCP, we faced challenges:

RDS Proxy sits between your applications and the database, providing a managed connection pool at the infrastructure level:

┌─────────────────────────────────────────────────────────────┐
│                     Without RDS Proxy                        │
├─────────────────────────────────────────────────────────────┤
│  Service A (20 conns) ─┐                                    │
│  Service B (20 conns) ─┼──→ MySQL (max 500 connections)     │
│  Service C (20 conns) ─┤                                    │
│  ... 30 services ...   │                                    │
│  Lambda (100s of conns)┘                                    │
│                                                             │
│  Total: 700+ connections competing for 500 slots            │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                      With RDS Proxy                          │
├─────────────────────────────────────────────────────────────┤
│  Service A ─┐                                               │
│  Service B ─┼──→ RDS Proxy ──→ MySQL                        │
│  Service C ─┤    (manages     (sees only ~50               │
│  ...        │     pooling)     connections)                 │
│  Lambda    ─┘                                               │
│                                                             │
│  Applications open 1000s of connections to Proxy            │
│  Proxy maintains ~50 real connections to MySQL              │
└─────────────────────────────────────────────────────────────┘

Key benefits we observed:

  1. Connection multiplexing — RDS Proxy maintains a small pool of connections to MySQL and multiplexes thousands of application connections through them. Our database went from seeing 500+ connections to under 100.

  2. Graceful handling of connection storms — During deployments, new pods request connections simultaneously. Proxy queues these requests instead of overwhelming the database.

  3. Lambda compatibility — Serverless functions no longer create and destroy connections on each invocation. Proxy reuses connections across invocations.

  4. Prepared statement handling — RDS Proxy uses transaction-level connection borrowing by default, which means prepared statements work correctly without the session-level issues that plague other poolers.

  5. Failover handling — During RDS failover events, Proxy maintains application connections and transparently reconnects to the new primary.

Configuration considerations:

# Terraform example for RDS Proxy
resource "aws_db_proxy" "mysql_proxy" {
  name                   = "mysql-proxy"
  engine_family          = "MYSQL"
  idle_client_timeout    = 1800  # 30 minutes
  require_tls            = true

  auth {
    auth_scheme = "SECRETS"
    secret_arn  = aws_secretsmanager_secret.db_credentials.arn
  }

  vpc_subnet_ids         = var.private_subnet_ids
  vpc_security_group_ids = [aws_security_group.proxy_sg.id]
}

resource "aws_db_proxy_default_target_group" "mysql_proxy" {
  db_proxy_name = aws_db_proxy.mysql_proxy.name

  connection_pool_config {
    max_connections_percent      = 50  # Use 50% of max_connections
    max_idle_connections_percent = 10
    connection_borrow_timeout    = 120 # seconds
  }
}

The prepared statement caveat with RDS Proxy:

RDS Proxy works in "transaction" mode by default, meaning it can reassign connections between transactions. This is similar to PgBouncer's transaction mode. Server-side prepared statements created in one transaction might not be available in the next if you get a different backend connection.

However, with our JDBC configuration using cachePrepStmts=true, the driver handles this gracefully—it re-prepares statements when needed because it tracks the SQL-to-handle mapping client-side.

The Combined Architecture

Our final architecture uses both strategies together:

                    ┌──────────────────┐
                    │   Application    │
                    └────────┬─────────┘
                             │
              ┌──────────────┴──────────────┐
              │                             │
              ▼                             ▼
      ┌───────────────┐            ┌───────────────┐
      │  RDS Proxy    │            │  RDS Proxy    │
      │  (writes)     │            │  (reads)      │
      └───────┬───────┘            └───────┬───────┘
              │                             │
              ▼                             ▼
      ┌───────────────┐            ┌───────────────┐
      │    Primary    │───────────▶│   Replica 1   │
      │    (writes)   │    repl    │   (reads)     │
      └───────────────┘            ├───────────────┤
                                   │   Replica 2   │
                                   │   (reads)     │
                                   └───────────────┘

This architecture provides:


Results

After implementing all fixes and monitoring:

MetricDuring CrisisWith Band-AidAfter Full Fix
Prepared stmt count15,000+~5,000~400
Max stmts per connection2,000+~100~20
Zombie connections200+~20~0
Connection lifetime30 min30 min5 min
Statement limit usage95%+30%3%
Incidents per monthMultipleRareZero

The system has been stable for six months now. The monitoring has caught a few regressions during development—new services with missing cache configuration—and we fixed them before they reached production.


Key Takeaways

1. Understand client-side vs server-side prepared statements. They behave completely differently. Server-side statements can accumulate indefinitely; client-side statements don't persist at all.

2. cachePrepStmts=true is essential. Without this, the driver creates new server-side statements for every prepareStatement() call, even for identical queries on the same connection.

3. Dynamic queries defeat caching. Every unique SQL string is a separate prepared statement. Normalize your queries using parameter padding, inclusive WHERE clauses, or query templating.

4. Shorter connection lifetimes are safer. Five minutes is enough. The connection creation overhead is tiny compared to the cost of statement accumulation.

5. Monitor what matters. Standard RDS metrics don't show prepared statement counts. Build custom metrics that give you visibility into this failure mode.

6. Band-aids buy time. Automated zombie killers kept us alive while we implemented proper fixes. Don't be afraid to use aggressive temporary measures during a crisis.

7. The global limit is shared. max_prepared_stmt_count applies across all connections. One misbehaving service can exhaust it for everyone.


Quick Reference

When you're debugging a similar issue, start with these queries:

Check global statement usage:

SELECT @@max_prepared_stmt_count AS max_limit,
       (SELECT VARIABLE_VALUE FROM performance_schema.global_status
        WHERE VARIABLE_NAME = 'Prepared_stmt_count') AS current_count;

Find connections with most statements:

SELECT t.PROCESSLIST_ID, t.PROCESSLIST_USER,
       SUBSTRING_INDEX(t.PROCESSLIST_HOST, ':', 1) AS ip,
       t.PROCESSLIST_TIME AS idle_sec,
       COUNT(ps.STATEMENT_ID) AS stmt_count
FROM performance_schema.threads t
LEFT JOIN performance_schema.prepared_statements_instances ps
    ON t.THREAD_ID = ps.OWNER_THREAD_ID
WHERE t.TYPE = 'FOREGROUND'
GROUP BY t.THREAD_ID
HAVING stmt_count > 10
ORDER BY stmt_count DESC
LIMIT 20;

Identify problem services by IP:

SELECT SUBSTRING_INDEX(PROCESSLIST_HOST, ':', 1) AS client_ip,
       COUNT(*) AS connections,
       SUM(CASE WHEN PROCESSLIST_COMMAND = 'Sleep'
                AND PROCESSLIST_TIME > 300 THEN 1 ELSE 0 END) AS zombies
FROM performance_schema.threads
WHERE TYPE = 'FOREGROUND' AND PROCESSLIST_USER IS NOT NULL
GROUP BY client_ip
ORDER BY connections DESC;

The Journey Matters

This wasn't a clean, linear debugging process. Real incidents rarely are. We started at 2 AM with a dying database, implemented ugly band-aids to survive the night, and spent weeks understanding the actual problem.

The key was maintaining stability while investigating. The zombie killer wasn't elegant, but it kept us running. The metrics didn't exist at first—we built them because we needed visibility. Each fix addressed one piece of the puzzle.

If you're facing a similar situation, remember: it's okay to implement temporary workarounds. Just make sure you also invest in understanding and fixing the root cause. The temporary solutions buy you time; the permanent fixes prevent recurrence.

Understanding the full stack—how your ORM generates queries, how the JDBC driver handles prepared statements, how connection pools manage lifecycle, and how MySQL tracks everything internally—is what transforms a mystery into a solvable problem.


Dealing with database scaling challenges? Connect on Twitter or LinkedIn to discuss solutions.