Engineering for Chaos: Processing 100,000 Payments in 60 Seconds

December 2, 2024 · 10 min read

PaymentsFintechScaleArchitecturePerformance

Picture this: It's salary day. Thousands of companies are processing payroll simultaneously. A payment system needs to handle 100,000 transactions in under 60 seconds—and every single one must be accurate, traceable, and compliant.

This guide covers the architecture, patterns, and optimizations required to build payment systems capable of handling extreme transaction bursts while maintaining sub-second latency and zero data loss.

The Challenge: Payments at Scale

Understanding the requirements is critical:

100,000+ transactions in peak minutes
Millions in currency flowing through the system
Tens of thousands of merchants depending on reliability
99.99% uptime requirement (4.3 minutes downtime per month max)
Zero tolerance for duplicate payments or data loss

Traditional architectures crumble under these requirements. Here's how to build one that doesn't.

The Architecture: Designed for Chaos

Layer 1: The Ingestion Layer

The first challenge is accepting 100k requests without dropping any:

                    ┌─────────────────┐
                    │   Load Balancer │
                    │   (HAProxy)     │
                    └────────┬────────┘
                             │
        ┌────────────────────┼────────────────────┐
        │                    │                    │
   ┌────▼────┐         ┌────▼────┐         ┌────▼────┐
   │ API GW  │         │ API GW  │         │ API GW  │
   │  Pod 1  │         │  Pod 2  │         │  Pod N  │
   └────┬────┘         └────┬────┘         └────┬────┘
        │                    │                    │
        └────────────────────┼────────────────────┘
                             │
                    ┌────────▼────────┐
                    │   Kafka Cluster │
                    │  (Partitioned)  │
                    └─────────────────┘

Key Design Decisions:

Stateless API Gateways - Horizontal scaling without coordination
Immediate Kafka Write - Accept payment intent, respond immediately
Idempotency Keys - Every request has a unique key to prevent duplicates

@PostMapping("/payments")
public ResponseEntity<PaymentResponse> initiatePayment(
    @RequestHeader("Idempotency-Key") String idempotencyKey,
    @RequestBody PaymentRequest request
) {
    // Check idempotency cache first (Redis)
    Optional<PaymentResponse> cached = idempotencyCache.get(idempotencyKey);
    if (cached.isPresent()) {
        return ResponseEntity.ok(cached.get());
    }

    // Create payment intent
    PaymentIntent intent = PaymentIntent.builder()
        .id(UUID.randomUUID().toString())
        .idempotencyKey(idempotencyKey)
        .amount(request.getAmount())
        .merchantId(request.getMerchantId())
        .status(PaymentStatus.INITIATED)
        .createdAt(Instant.now())
        .build();

    // Persist to Kafka (async, guaranteed delivery)
    kafkaTemplate.send("payment-intents", intent.getMerchantId(), intent);

    // Cache the response
    PaymentResponse response = new PaymentResponse(intent.getId(), "INITIATED");
    idempotencyCache.put(idempotencyKey, response, Duration.ofHours(24));

    return ResponseEntity.accepted().body(response);
}

Layer 2: The Processing Engine

This is where the complexity lives. A staged event-driven architecture (SEDA) provides isolation and scalability:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Validate   │───▶│   Enrich     │───▶│   Execute    │───▶│   Settle     │
│    Stage     │    │    Stage     │    │    Stage     │    │    Stage     │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘
       │                   │                   │                   │
       ▼                   ▼                   ▼                   ▼
  ┌─────────┐        ┌─────────┐        ┌─────────┐        ┌─────────┐
  │ Kafka   │        │ Kafka   │        │ Kafka   │        │ Kafka   │
  │ Topic 1 │        │ Topic 2 │        │ Topic 3 │        │ Topic 4 │
  └─────────┘        └─────────┘        └─────────┘        └─────────┘

Each stage is independently scalable:

@Component
public class PaymentProcessor {

    private static final int PARALLELISM = 32;

    @KafkaListener(
        topics = "payment-intents",
        groupId = "payment-processor",
        concurrency = "32"
    )
    public void processPayment(PaymentIntent intent) {
        try {
            // Stage 1: Validate
            ValidationResult validation = validationService.validate(intent);
            if (!validation.isValid()) {
                handleValidationFailure(intent, validation);
                return;
            }

            // Stage 2: Enrich with merchant data
            EnrichedPayment enriched = enrichmentService.enrich(intent);

            // Stage 3: Execute via payment gateway
            ExecutionResult result = executeWithRetry(enriched);

            // Stage 4: Settle and notify
            settlementService.settle(result);

        } catch (Exception e) {
            handleFailure(intent, e);
        }
    }
}

Layer 3: The Database Layer

For 100k writes per minute, traditional RDBMS patterns fail. A multi-database strategy is essential:

Write Path (Hot):

Payments ──▶ Kafka ──▶ ClickHouse (Analytics)
                  └──▶ PostgreSQL (Source of Truth)
                  └──▶ Redis (Real-time Status)

Read Path (Hot):

Status Queries ──▶ Redis Cache (99% hit rate)
                       │
                       ▼ (cache miss)
                  PostgreSQL

PostgreSQL Optimizations:

-- Partitioned table by date for faster writes and queries
CREATE TABLE payments (
    id UUID PRIMARY KEY,
    merchant_id VARCHAR(50) NOT NULL,
    amount DECIMAL(15, 2) NOT NULL,
    status VARCHAR(20) NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);

-- Create monthly partitions
CREATE TABLE payments_2024_12 PARTITION OF payments
    FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');

-- Optimized indexes for common query patterns
CREATE INDEX idx_payments_merchant_status
    ON payments (merchant_id, status)
    WHERE status IN ('PENDING', 'PROCESSING');

CREATE INDEX idx_payments_created_at
    ON payments (created_at DESC);

Batch Inserts for Performance:

@Service
public class PaymentBatchWriter {

    private static final int BATCH_SIZE = 1000;
    private final BlockingQueue<Payment> buffer = new LinkedBlockingQueue<>(10000);
    private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(2);

    @PostConstruct
    public void init() {
        // Flush every 100ms or when batch is full
        scheduler.scheduleAtFixedRate(this::flush, 100, 100, TimeUnit.MILLISECONDS);
    }

    public void write(Payment payment) {
        buffer.offer(payment);
        if (buffer.size() >= BATCH_SIZE) {
            flush();
        }
    }

    private synchronized void flush() {
        List<Payment> batch = new ArrayList<>(BATCH_SIZE);
        buffer.drainTo(batch, BATCH_SIZE);

        if (!batch.isEmpty()) {
            jdbcTemplate.batchUpdate(
                "INSERT INTO payments (id, merchant_id, amount, status, created_at) " +
                "VALUES (?, ?, ?, ?, ?)",
                batch,
                BATCH_SIZE,
                (ps, payment) -> {
                    ps.setObject(1, payment.getId());
                    ps.setString(2, payment.getMerchantId());
                    ps.setBigDecimal(3, payment.getAmount());
                    ps.setString(4, payment.getStatus().name());
                    ps.setTimestamp(5, Timestamp.from(payment.getCreatedAt()));
                }
            );
        }
    }
}

Critical Patterns for Reliability

Pattern 1: The Outbox Pattern

Never lose a payment, even if downstream services fail. The outbox pattern ensures atomicity between database updates and event publishing:

@Transactional
public void processPayment(Payment payment) {
    // 1. Update payment status
    paymentRepository.updateStatus(payment.getId(), PaymentStatus.COMPLETED);

    // 2. Write to outbox (same transaction!)
    OutboxEvent event = OutboxEvent.builder()
        .aggregateId(payment.getId())
        .eventType("PAYMENT_COMPLETED")
        .payload(objectMapper.writeValueAsString(payment))
        .build();
    outboxRepository.save(event);

    // Transaction commits atomically
}

// Separate process reads outbox and publishes to Kafka
@Scheduled(fixedDelay = 100)
public void publishOutboxEvents() {
    List<OutboxEvent> events = outboxRepository.findUnpublished(100);
    for (OutboxEvent event : events) {
        kafkaTemplate.send("payment-events", event.getAggregateId(), event);
        outboxRepository.markPublished(event.getId());
    }
}

Why this matters: Without the outbox pattern, a database commit could succeed but the Kafka publish could fail, leaving the system in an inconsistent state. The outbox ensures exactly-once semantics.

Pattern 2: Circuit Breaker for External Services

Payment gateways fail. Plan for it with circuit breakers that prevent cascade failures:

@Service
public class PaymentGatewayService {

    private final CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("payment-gateway");

    public ExecutionResult execute(EnrichedPayment payment) {
        return circuitBreaker.executeSupplier(() -> {
            // Try primary gateway
            try {
                return primaryGateway.process(payment);
            } catch (GatewayException e) {
                // Fallback to secondary
                log.warn("Primary gateway failed, trying secondary", e);
                return secondaryGateway.process(payment);
            }
        });
    }
}

Circuit Breaker Configuration:

resilience4j:
  circuitbreaker:
    instances:
      payment-gateway:
        sliding-window-size: 100
        failure-rate-threshold: 50
        wait-duration-in-open-state: 30s
        permitted-number-of-calls-in-half-open-state: 10
        slow-call-rate-threshold: 80
        slow-call-duration-threshold: 2s

How it works:

Closed state: Normal operation, requests pass through
Open state: After 50% failures in 100 requests, circuit opens and fails fast
Half-open state: After 30s, allows 10 test requests to check recovery

Pattern 3: Distributed Locking for Idempotency

Prevent double-processing across multiple instances with Redis-based distributed locks:

@Service
public class DistributedLockService {

    private final RedisTemplate<String, String> redis;

    public boolean tryLock(String key, Duration timeout) {
        String lockKey = "lock:" + key;
        String lockValue = UUID.randomUUID().toString();

        Boolean acquired = redis.opsForValue().setIfAbsent(
            lockKey,
            lockValue,
            timeout
        );

        return Boolean.TRUE.equals(acquired);
    }

    public void processWithLock(String paymentId, Runnable task) {
        String lockKey = "payment:" + paymentId;

        if (!tryLock(lockKey, Duration.ofMinutes(5))) {
            log.info("Payment {} already being processed", paymentId);
            return;
        }

        try {
            task.run();
        } finally {
            redis.delete("lock:" + lockKey);
        }
    }
}

Important considerations:

Lock timeouts must exceed maximum processing time
Consider using Redlock for multi-datacenter deployments
Always release locks in finally blocks

Pattern 4: Dead Letter Queue for Failed Payments

Never silently drop a payment. Dead letter queues ensure manual intervention for edge cases:

@KafkaListener(topics = "payment-intents")
public void processPayment(
    @Payload PaymentIntent intent,
    @Header(KafkaHeaders.RECEIVED_TOPIC) String topic,
    Acknowledgment ack
) {
    try {
        doProcess(intent);
        ack.acknowledge();
    } catch (RetryableException e) {
        // Don't ack - Kafka will retry
        throw e;
    } catch (NonRetryableException e) {
        // Send to DLQ for manual intervention
        kafkaTemplate.send("payment-intents-dlq", intent);
        ack.acknowledge();

        // Alert on-call
        alertService.sendAlert(
            AlertLevel.HIGH,
            "Payment sent to DLQ",
            Map.of("paymentId", intent.getId(), "error", e.getMessage())
        );
    }
}

DLQ processing workflow:

Payment fails with non-retryable error
Message moves to DLQ topic
Alert triggers for operations team
Manual review determines fix (retry, refund, or escalate)
Fixed payment reprocessed through main pipeline

Performance Optimizations

Optimization 1: Connection Pooling Done Right

hikari:
  maximum-pool-size: 20          # CPU cores * 2 + disk spindles
  minimum-idle: 10
  connection-timeout: 10000
  idle-timeout: 300000
  max-lifetime: 900000
  leak-detection-threshold: 60000

The formula: pool_size = CPU_cores * 2 + effective_spindle_count

For SSDs, spindle count is typically 1. A 4-core machine with SSD should use ~9-10 connections.

Optimization 2: Async Everywhere

Non-critical operations should never block the payment path:

@Service
public class NotificationService {

    private final ExecutorService executor =
        Executors.newFixedThreadPool(10);

    @Async
    public CompletableFuture<Void> notifyMerchant(Payment payment) {
        return CompletableFuture.runAsync(() -> {
            // Send webhook
            webhookClient.send(payment.getMerchantWebhook(), payment);

            // Send SMS
            smsService.send(payment.getMerchantPhone(), formatMessage(payment));

            // Send email
            emailService.send(payment.getMerchantEmail(), formatEmail(payment));
        }, executor);
    }
}

Optimization 3: Compression for Kafka

Reduce network bandwidth and improve throughput:

spring:
  kafka:
    producer:
      compression-type: lz4
      batch-size: 32768
      linger-ms: 5
      buffer-memory: 67108864

Compression comparison:

None: Fastest CPU, highest bandwidth
LZ4: Best balance of speed and compression
Snappy: Good compression, moderate CPU
GZIP: Best compression, highest CPU

Optimization 4: JVM Tuning

For payment systems with strict latency requirements:

JAVA_OPTS="-Xms4g -Xmx4g \
  -XX:+UseG1GC \
  -XX:MaxGCPauseMillis=100 \
  -XX:+ParallelRefProcEnabled \
  -XX:+UseStringDeduplication \
  -XX:+HeapDumpOnOutOfMemoryError"

Why G1GC:

Predictable pause times (MaxGCPauseMillis=100ms)
Good throughput for heap sizes 4GB+
Concurrent marking reduces stop-the-world pauses

Monitoring & Observability

Key Metrics Dashboard

@Component
public class PaymentMetrics {

    private final MeterRegistry registry;

    private final Counter paymentsInitiated;
    private final Counter paymentsCompleted;
    private final Counter paymentsFailed;
    private final Timer processingTime;
    private final Gauge queueDepth;

    public PaymentMetrics(MeterRegistry registry) {
        this.registry = registry;

        this.paymentsInitiated = Counter.builder("payments.initiated")
            .description("Total payments initiated")
            .register(registry);

        this.paymentsCompleted = Counter.builder("payments.completed")
            .description("Total payments completed")
            .register(registry);

        this.paymentsFailed = Counter.builder("payments.failed")
            .description("Total payments failed")
            .tag("reason", "unknown")
            .register(registry);

        this.processingTime = Timer.builder("payments.processing.time")
            .description("Payment processing duration")
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry);
    }

    public void recordPaymentProcessed(Payment payment, Duration duration) {
        processingTime.record(duration);
        if (payment.getStatus() == PaymentStatus.COMPLETED) {
            paymentsCompleted.increment();
        } else {
            paymentsFailed.increment();
        }
    }
}

Alert Rules

groups:
  - name: payment_alerts
    rules:
      - alert: HighPaymentFailureRate
        expr: |
          rate(payments_failed_total[5m]) /
          rate(payments_initiated_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Payment failure rate above 5%"

      - alert: PaymentProcessingLatency
        expr: |
          histogram_quantile(0.99, payments_processing_time_seconds_bucket) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 payment latency above 2 seconds"

      - alert: KafkaConsumerLag
        expr: kafka_consumer_lag > 10000
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Kafka consumer lag exceeds 10k messages"

Expected Results

When this architecture is properly implemented:

Metric	Target	Achievable
Throughput	100k/min	150k+/min
P99 Latency	< 2s	~800ms
Success Rate	99.9%	99.95%+
Data Loss	0	0
Monthly Downtime	< 5 min	< 3 min

Key Takeaways

Accept fast, process async - Decouple ingestion from processing to handle bursts
Partition everything - Databases, queues, services—parallelism is essential for scale
Design for failure - Circuit breakers, retries, dead letter queues are not optional
Batch aggressively - Individual operations don't scale; batch writes, batch reads
Monitor obsessively - Problems invisible to metrics are problems waiting to explode
Test at scale - Load test with realistic data volumes before production

Conclusion

Building systems that handle 100k payments per minute isn't about finding a silver bullet—it's about making hundreds of small, correct architectural decisions. Each optimization, each pattern, each configuration choice compounds.

The architecture described here isn't theoretical. These patterns power real payment systems processing billions in transactions, serving millions of merchants, and standing firm during the most chaotic payment spikes.

The key insight? Embrace the chaos. Design for it. Test against it. And when it arrives, the system will be ready.

Building payment systems at scale? Connect on Twitter or LinkedIn to discuss architecture patterns.