← Back to Blog

Engineering for Chaos: Processing 100,000 Payments in 60 Seconds

PaymentsFintechScaleArchitecturePerformance
Engineering for Chaos: Processing 100,000 Payments in 60 Seconds

Picture this: It's salary day. Thousands of companies are processing payroll simultaneously. A payment system needs to handle 100,000 transactions in under 60 seconds—and every single one must be accurate, traceable, and compliant.

This guide covers the architecture, patterns, and optimizations required to build payment systems capable of handling extreme transaction bursts while maintaining sub-second latency and zero data loss.

The Challenge: Payments at Scale

Understanding the requirements is critical:

Traditional architectures crumble under these requirements. Here's how to build one that doesn't.

The Architecture: Designed for Chaos

Layer 1: The Ingestion Layer

The first challenge is accepting 100k requests without dropping any:

                    ┌─────────────────┐
                    │   Load Balancer │
                    │   (HAProxy)     │
                    └────────┬────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
   ┌────▼────┐         ┌────▼────┐         ┌────▼────┐
   │ API GW  │         │ API GW  │         │ API GW  │
   │  Pod 1  │         │  Pod 2  │         │  Pod N  │
   └────┬────┘         └────┬────┘         └────┬────┘
        │                    │                    │
        └────────────────────┼────────────────────┘
                             │
                    ┌────────▼────────┐
                    │   Kafka Cluster │
                    │  (Partitioned)  │
                    └─────────────────┘

Key Design Decisions:

  1. Stateless API Gateways - Horizontal scaling without coordination
  2. Immediate Kafka Write - Accept payment intent, respond immediately
  3. Idempotency Keys - Every request has a unique key to prevent duplicates
@PostMapping("/payments")
public ResponseEntity<PaymentResponse> initiatePayment(
    @RequestHeader("Idempotency-Key") String idempotencyKey,
    @RequestBody PaymentRequest request
) {
    // Check idempotency cache first (Redis)
    Optional<PaymentResponse> cached = idempotencyCache.get(idempotencyKey);
    if (cached.isPresent()) {
        return ResponseEntity.ok(cached.get());
    }

    // Create payment intent
    PaymentIntent intent = PaymentIntent.builder()
        .id(UUID.randomUUID().toString())
        .idempotencyKey(idempotencyKey)
        .amount(request.getAmount())
        .merchantId(request.getMerchantId())
        .status(PaymentStatus.INITIATED)
        .createdAt(Instant.now())
        .build();

    // Persist to Kafka (async, guaranteed delivery)
    kafkaTemplate.send("payment-intents", intent.getMerchantId(), intent);

    // Cache the response
    PaymentResponse response = new PaymentResponse(intent.getId(), "INITIATED");
    idempotencyCache.put(idempotencyKey, response, Duration.ofHours(24));

    return ResponseEntity.accepted().body(response);
}

Layer 2: The Processing Engine

This is where the complexity lives. A staged event-driven architecture (SEDA) provides isolation and scalability:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Validate   │───▶│   Enrich     │───▶│   Execute    │───▶│   Settle     │
│    Stage     │    │    Stage     │    │    Stage     │    │    Stage     │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘
       │                   │                   │                   │
       ▼                   ▼                   ▼                   ▼
  ┌─────────┐        ┌─────────┐        ┌─────────┐        ┌─────────┐
  │ Kafka   │        │ Kafka   │        │ Kafka   │        │ Kafka   │
  │ Topic 1 │        │ Topic 2 │        │ Topic 3 │        │ Topic 4 │
  └─────────┘        └─────────┘        └─────────┘        └─────────┘

Each stage is independently scalable:

@Component
public class PaymentProcessor {

    private static final int PARALLELISM = 32;

    @KafkaListener(
        topics = "payment-intents",
        groupId = "payment-processor",
        concurrency = "32"
    )
    public void processPayment(PaymentIntent intent) {
        try {
            // Stage 1: Validate
            ValidationResult validation = validationService.validate(intent);
            if (!validation.isValid()) {
                handleValidationFailure(intent, validation);
                return;
            }

            // Stage 2: Enrich with merchant data
            EnrichedPayment enriched = enrichmentService.enrich(intent);

            // Stage 3: Execute via payment gateway
            ExecutionResult result = executeWithRetry(enriched);

            // Stage 4: Settle and notify
            settlementService.settle(result);

        } catch (Exception e) {
            handleFailure(intent, e);
        }
    }
}

Layer 3: The Database Layer

For 100k writes per minute, traditional RDBMS patterns fail. A multi-database strategy is essential:

Write Path (Hot):

Payments ──▶ Kafka ──▶ ClickHouse (Analytics)
                  └──▶ PostgreSQL (Source of Truth)
                  └──▶ Redis (Real-time Status)

Read Path (Hot):

Status Queries ──▶ Redis Cache (99% hit rate)
                       │
                       ▼ (cache miss)
                  PostgreSQL

PostgreSQL Optimizations:

-- Partitioned table by date for faster writes and queries
CREATE TABLE payments (
    id UUID PRIMARY KEY,
    merchant_id VARCHAR(50) NOT NULL,
    amount DECIMAL(15, 2) NOT NULL,
    status VARCHAR(20) NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE payments_2024_12 PARTITION OF payments
    FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');

-- Optimized indexes for common query patterns
CREATE INDEX idx_payments_merchant_status
    ON payments (merchant_id, status)
    WHERE status IN ('PENDING', 'PROCESSING');

CREATE INDEX idx_payments_created_at
    ON payments (created_at DESC);

Batch Inserts for Performance:

@Service
public class PaymentBatchWriter {

    private static final int BATCH_SIZE = 1000;
    private final BlockingQueue<Payment> buffer = new LinkedBlockingQueue<>(10000);
    private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(2);

    @PostConstruct
    public void init() {
        // Flush every 100ms or when batch is full
        scheduler.scheduleAtFixedRate(this::flush, 100, 100, TimeUnit.MILLISECONDS);
    }

    public void write(Payment payment) {
        buffer.offer(payment);
        if (buffer.size() >= BATCH_SIZE) {
            flush();
        }
    }

    private synchronized void flush() {
        List<Payment> batch = new ArrayList<>(BATCH_SIZE);
        buffer.drainTo(batch, BATCH_SIZE);

        if (!batch.isEmpty()) {
            jdbcTemplate.batchUpdate(
                "INSERT INTO payments (id, merchant_id, amount, status, created_at) " +
                "VALUES (?, ?, ?, ?, ?)",
                batch,
                BATCH_SIZE,
                (ps, payment) -> {
                    ps.setObject(1, payment.getId());
                    ps.setString(2, payment.getMerchantId());
                    ps.setBigDecimal(3, payment.getAmount());
                    ps.setString(4, payment.getStatus().name());
                    ps.setTimestamp(5, Timestamp.from(payment.getCreatedAt()));
                }
            );
        }
    }
}

Critical Patterns for Reliability

Pattern 1: The Outbox Pattern

Never lose a payment, even if downstream services fail. The outbox pattern ensures atomicity between database updates and event publishing:

@Transactional
public void processPayment(Payment payment) {
    // 1. Update payment status
    paymentRepository.updateStatus(payment.getId(), PaymentStatus.COMPLETED);

    // 2. Write to outbox (same transaction!)
    OutboxEvent event = OutboxEvent.builder()
        .aggregateId(payment.getId())
        .eventType("PAYMENT_COMPLETED")
        .payload(objectMapper.writeValueAsString(payment))
        .build();
    outboxRepository.save(event);

    // Transaction commits atomically
}

// Separate process reads outbox and publishes to Kafka
@Scheduled(fixedDelay = 100)
public void publishOutboxEvents() {
    List<OutboxEvent> events = outboxRepository.findUnpublished(100);
    for (OutboxEvent event : events) {
        kafkaTemplate.send("payment-events", event.getAggregateId(), event);
        outboxRepository.markPublished(event.getId());
    }
}

Why this matters: Without the outbox pattern, a database commit could succeed but the Kafka publish could fail, leaving the system in an inconsistent state. The outbox ensures exactly-once semantics.

Pattern 2: Circuit Breaker for External Services

Payment gateways fail. Plan for it with circuit breakers that prevent cascade failures:

@Service
public class PaymentGatewayService {

    private final CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("payment-gateway");

    public ExecutionResult execute(EnrichedPayment payment) {
        return circuitBreaker.executeSupplier(() -> {
            // Try primary gateway
            try {
                return primaryGateway.process(payment);
            } catch (GatewayException e) {
                // Fallback to secondary
                log.warn("Primary gateway failed, trying secondary", e);
                return secondaryGateway.process(payment);
            }
        });
    }
}

Circuit Breaker Configuration:

resilience4j:
  circuitbreaker:
    instances:
      payment-gateway:
        sliding-window-size: 100
        failure-rate-threshold: 50
        wait-duration-in-open-state: 30s
        permitted-number-of-calls-in-half-open-state: 10
        slow-call-rate-threshold: 80
        slow-call-duration-threshold: 2s

How it works:

Pattern 3: Distributed Locking for Idempotency

Prevent double-processing across multiple instances with Redis-based distributed locks:

@Service
public class DistributedLockService {

    private final RedisTemplate<String, String> redis;

    public boolean tryLock(String key, Duration timeout) {
        String lockKey = "lock:" + key;
        String lockValue = UUID.randomUUID().toString();

        Boolean acquired = redis.opsForValue().setIfAbsent(
            lockKey,
            lockValue,
            timeout
        );

        return Boolean.TRUE.equals(acquired);
    }

    public void processWithLock(String paymentId, Runnable task) {
        String lockKey = "payment:" + paymentId;

        if (!tryLock(lockKey, Duration.ofMinutes(5))) {
            log.info("Payment {} already being processed", paymentId);
            return;
        }

        try {
            task.run();
        } finally {
            redis.delete("lock:" + lockKey);
        }
    }
}

Important considerations:

Pattern 4: Dead Letter Queue for Failed Payments

Never silently drop a payment. Dead letter queues ensure manual intervention for edge cases:

@KafkaListener(topics = "payment-intents")
public void processPayment(
    @Payload PaymentIntent intent,
    @Header(KafkaHeaders.RECEIVED_TOPIC) String topic,
    Acknowledgment ack
) {
    try {
        doProcess(intent);
        ack.acknowledge();
    } catch (RetryableException e) {
        // Don't ack - Kafka will retry
        throw e;
    } catch (NonRetryableException e) {
        // Send to DLQ for manual intervention
        kafkaTemplate.send("payment-intents-dlq", intent);
        ack.acknowledge();

        // Alert on-call
        alertService.sendAlert(
            AlertLevel.HIGH,
            "Payment sent to DLQ",
            Map.of("paymentId", intent.getId(), "error", e.getMessage())
        );
    }
}

DLQ processing workflow:

  1. Payment fails with non-retryable error
  2. Message moves to DLQ topic
  3. Alert triggers for operations team
  4. Manual review determines fix (retry, refund, or escalate)
  5. Fixed payment reprocessed through main pipeline

Performance Optimizations

Optimization 1: Connection Pooling Done Right

hikari:
  maximum-pool-size: 20          # CPU cores * 2 + disk spindles
  minimum-idle: 10
  connection-timeout: 10000
  idle-timeout: 300000
  max-lifetime: 900000
  leak-detection-threshold: 60000

The formula: pool_size = CPU_cores * 2 + effective_spindle_count

For SSDs, spindle count is typically 1. A 4-core machine with SSD should use ~9-10 connections.

Optimization 2: Async Everywhere

Non-critical operations should never block the payment path:

@Service
public class NotificationService {

    private final ExecutorService executor =
        Executors.newFixedThreadPool(10);

    @Async
    public CompletableFuture<Void> notifyMerchant(Payment payment) {
        return CompletableFuture.runAsync(() -> {
            // Send webhook
            webhookClient.send(payment.getMerchantWebhook(), payment);

            // Send SMS
            smsService.send(payment.getMerchantPhone(), formatMessage(payment));

            // Send email
            emailService.send(payment.getMerchantEmail(), formatEmail(payment));
        }, executor);
    }
}

Optimization 3: Compression for Kafka

Reduce network bandwidth and improve throughput:

spring:
  kafka:
    producer:
      compression-type: lz4
      batch-size: 32768
      linger-ms: 5
      buffer-memory: 67108864

Compression comparison:

Optimization 4: JVM Tuning

For payment systems with strict latency requirements:

JAVA_OPTS="-Xms4g -Xmx4g \
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=100 \
  -XX:+ParallelRefProcEnabled \
  -XX:+UseStringDeduplication \
  -XX:+HeapDumpOnOutOfMemoryError"

Why G1GC:

Monitoring & Observability

Key Metrics Dashboard

@Component
public class PaymentMetrics {

    private final MeterRegistry registry;

    private final Counter paymentsInitiated;
    private final Counter paymentsCompleted;
    private final Counter paymentsFailed;
    private final Timer processingTime;
    private final Gauge queueDepth;

    public PaymentMetrics(MeterRegistry registry) {
        this.registry = registry;

        this.paymentsInitiated = Counter.builder("payments.initiated")
            .description("Total payments initiated")
            .register(registry);

        this.paymentsCompleted = Counter.builder("payments.completed")
            .description("Total payments completed")
            .register(registry);

        this.paymentsFailed = Counter.builder("payments.failed")
            .description("Total payments failed")
            .tag("reason", "unknown")
            .register(registry);

        this.processingTime = Timer.builder("payments.processing.time")
            .description("Payment processing duration")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);
    }

    public void recordPaymentProcessed(Payment payment, Duration duration) {
        processingTime.record(duration);
        if (payment.getStatus() == PaymentStatus.COMPLETED) {
            paymentsCompleted.increment();
        } else {
            paymentsFailed.increment();
        }
    }
}

Alert Rules

groups:
  - name: payment_alerts
    rules:
      - alert: HighPaymentFailureRate
        expr: |
          rate(payments_failed_total[5m]) /
          rate(payments_initiated_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Payment failure rate above 5%"

      - alert: PaymentProcessingLatency
        expr: |
          histogram_quantile(0.99, payments_processing_time_seconds_bucket) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 payment latency above 2 seconds"

      - alert: KafkaConsumerLag
        expr: kafka_consumer_lag > 10000
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Kafka consumer lag exceeds 10k messages"

Expected Results

When this architecture is properly implemented:

MetricTargetAchievable
Throughput100k/min150k+/min
P99 Latency< 2s~800ms
Success Rate99.9%99.95%+
Data Loss00
Monthly Downtime< 5 min< 3 min

Key Takeaways

  1. Accept fast, process async - Decouple ingestion from processing to handle bursts

  2. Partition everything - Databases, queues, services—parallelism is essential for scale

  3. Design for failure - Circuit breakers, retries, dead letter queues are not optional

  4. Batch aggressively - Individual operations don't scale; batch writes, batch reads

  5. Monitor obsessively - Problems invisible to metrics are problems waiting to explode

  6. Test at scale - Load test with realistic data volumes before production

Conclusion

Building systems that handle 100k payments per minute isn't about finding a silver bullet—it's about making hundreds of small, correct architectural decisions. Each optimization, each pattern, each configuration choice compounds.

The architecture described here isn't theoretical. These patterns power real payment systems processing billions in transactions, serving millions of merchants, and standing firm during the most chaotic payment spikes.

The key insight? Embrace the chaos. Design for it. Test against it. And when it arrives, the system will be ready.


Building payment systems at scale? Connect on Twitter or LinkedIn to discuss architecture patterns.