Picture this: It's salary day. Thousands of companies are processing payroll simultaneously. A payment system needs to handle 100,000 transactions in under 60 seconds—and every single one must be accurate, traceable, and compliant.
This guide covers the architecture, patterns, and optimizations required to build payment systems capable of handling extreme transaction bursts while maintaining sub-second latency and zero data loss.
The Challenge: Payments at Scale
Understanding the requirements is critical:
- 100,000+ transactions in peak minutes
- Millions in currency flowing through the system
- Tens of thousands of merchants depending on reliability
- 99.99% uptime requirement (4.3 minutes downtime per month max)
- Zero tolerance for duplicate payments or data loss
Traditional architectures crumble under these requirements. Here's how to build one that doesn't.
The Architecture: Designed for Chaos
Layer 1: The Ingestion Layer
The first challenge is accepting 100k requests without dropping any:
┌─────────────────┐
│ Load Balancer │
│ (HAProxy) │
└────────┬────────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ API GW │ │ API GW │ │ API GW │
│ Pod 1 │ │ Pod 2 │ │ Pod N │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└────────────────────┼────────────────────┘
│
┌────────▼────────┐
│ Kafka Cluster │
│ (Partitioned) │
└─────────────────┘
Key Design Decisions:
- Stateless API Gateways - Horizontal scaling without coordination
- Immediate Kafka Write - Accept payment intent, respond immediately
- Idempotency Keys - Every request has a unique key to prevent duplicates
@PostMapping("/payments")
public ResponseEntity<PaymentResponse> initiatePayment(
@RequestHeader("Idempotency-Key") String idempotencyKey,
@RequestBody PaymentRequest request
) {
// Check idempotency cache first (Redis)
Optional<PaymentResponse> cached = idempotencyCache.get(idempotencyKey);
if (cached.isPresent()) {
return ResponseEntity.ok(cached.get());
}
// Create payment intent
PaymentIntent intent = PaymentIntent.builder()
.id(UUID.randomUUID().toString())
.idempotencyKey(idempotencyKey)
.amount(request.getAmount())
.merchantId(request.getMerchantId())
.status(PaymentStatus.INITIATED)
.createdAt(Instant.now())
.build();
// Persist to Kafka (async, guaranteed delivery)
kafkaTemplate.send("payment-intents", intent.getMerchantId(), intent);
// Cache the response
PaymentResponse response = new PaymentResponse(intent.getId(), "INITIATED");
idempotencyCache.put(idempotencyKey, response, Duration.ofHours(24));
return ResponseEntity.accepted().body(response);
}
Layer 2: The Processing Engine
This is where the complexity lives. A staged event-driven architecture (SEDA) provides isolation and scalability:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Validate │───▶│ Enrich │───▶│ Execute │───▶│ Settle │
│ Stage │ │ Stage │ │ Stage │ │ Stage │
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Kafka │ │ Kafka │ │ Kafka │ │ Kafka │
│ Topic 1 │ │ Topic 2 │ │ Topic 3 │ │ Topic 4 │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
Each stage is independently scalable:
@Component
public class PaymentProcessor {
private static final int PARALLELISM = 32;
@KafkaListener(
topics = "payment-intents",
groupId = "payment-processor",
concurrency = "32"
)
public void processPayment(PaymentIntent intent) {
try {
// Stage 1: Validate
ValidationResult validation = validationService.validate(intent);
if (!validation.isValid()) {
handleValidationFailure(intent, validation);
return;
}
// Stage 2: Enrich with merchant data
EnrichedPayment enriched = enrichmentService.enrich(intent);
// Stage 3: Execute via payment gateway
ExecutionResult result = executeWithRetry(enriched);
// Stage 4: Settle and notify
settlementService.settle(result);
} catch (Exception e) {
handleFailure(intent, e);
}
}
}
Layer 3: The Database Layer
For 100k writes per minute, traditional RDBMS patterns fail. A multi-database strategy is essential:
Write Path (Hot):
Payments ──▶ Kafka ──▶ ClickHouse (Analytics)
└──▶ PostgreSQL (Source of Truth)
└──▶ Redis (Real-time Status)
Read Path (Hot):
Status Queries ──▶ Redis Cache (99% hit rate)
│
▼ (cache miss)
PostgreSQL
PostgreSQL Optimizations:
-- Partitioned table by date for faster writes and queries
CREATE TABLE payments (
id UUID PRIMARY KEY,
merchant_id VARCHAR(50) NOT NULL,
amount DECIMAL(15, 2) NOT NULL,
status VARCHAR(20) NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (created_at);
-- Create monthly partitions
CREATE TABLE payments_2024_12 PARTITION OF payments
FOR VALUES FROM ('2024-12-01') TO ('2025-01-01');
-- Optimized indexes for common query patterns
CREATE INDEX idx_payments_merchant_status
ON payments (merchant_id, status)
WHERE status IN ('PENDING', 'PROCESSING');
CREATE INDEX idx_payments_created_at
ON payments (created_at DESC);
Batch Inserts for Performance:
@Service
public class PaymentBatchWriter {
private static final int BATCH_SIZE = 1000;
private final BlockingQueue<Payment> buffer = new LinkedBlockingQueue<>(10000);
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(2);
@PostConstruct
public void init() {
// Flush every 100ms or when batch is full
scheduler.scheduleAtFixedRate(this::flush, 100, 100, TimeUnit.MILLISECONDS);
}
public void write(Payment payment) {
buffer.offer(payment);
if (buffer.size() >= BATCH_SIZE) {
flush();
}
}
private synchronized void flush() {
List<Payment> batch = new ArrayList<>(BATCH_SIZE);
buffer.drainTo(batch, BATCH_SIZE);
if (!batch.isEmpty()) {
jdbcTemplate.batchUpdate(
"INSERT INTO payments (id, merchant_id, amount, status, created_at) " +
"VALUES (?, ?, ?, ?, ?)",
batch,
BATCH_SIZE,
(ps, payment) -> {
ps.setObject(1, payment.getId());
ps.setString(2, payment.getMerchantId());
ps.setBigDecimal(3, payment.getAmount());
ps.setString(4, payment.getStatus().name());
ps.setTimestamp(5, Timestamp.from(payment.getCreatedAt()));
}
);
}
}
}
Critical Patterns for Reliability
Pattern 1: The Outbox Pattern
Never lose a payment, even if downstream services fail. The outbox pattern ensures atomicity between database updates and event publishing:
@Transactional
public void processPayment(Payment payment) {
// 1. Update payment status
paymentRepository.updateStatus(payment.getId(), PaymentStatus.COMPLETED);
// 2. Write to outbox (same transaction!)
OutboxEvent event = OutboxEvent.builder()
.aggregateId(payment.getId())
.eventType("PAYMENT_COMPLETED")
.payload(objectMapper.writeValueAsString(payment))
.build();
outboxRepository.save(event);
// Transaction commits atomically
}
// Separate process reads outbox and publishes to Kafka
@Scheduled(fixedDelay = 100)
public void publishOutboxEvents() {
List<OutboxEvent> events = outboxRepository.findUnpublished(100);
for (OutboxEvent event : events) {
kafkaTemplate.send("payment-events", event.getAggregateId(), event);
outboxRepository.markPublished(event.getId());
}
}
Why this matters: Without the outbox pattern, a database commit could succeed but the Kafka publish could fail, leaving the system in an inconsistent state. The outbox ensures exactly-once semantics.
Pattern 2: Circuit Breaker for External Services
Payment gateways fail. Plan for it with circuit breakers that prevent cascade failures:
@Service
public class PaymentGatewayService {
private final CircuitBreaker circuitBreaker = CircuitBreaker.ofDefaults("payment-gateway");
public ExecutionResult execute(EnrichedPayment payment) {
return circuitBreaker.executeSupplier(() -> {
// Try primary gateway
try {
return primaryGateway.process(payment);
} catch (GatewayException e) {
// Fallback to secondary
log.warn("Primary gateway failed, trying secondary", e);
return secondaryGateway.process(payment);
}
});
}
}
Circuit Breaker Configuration:
resilience4j:
circuitbreaker:
instances:
payment-gateway:
sliding-window-size: 100
failure-rate-threshold: 50
wait-duration-in-open-state: 30s
permitted-number-of-calls-in-half-open-state: 10
slow-call-rate-threshold: 80
slow-call-duration-threshold: 2s
How it works:
- Closed state: Normal operation, requests pass through
- Open state: After 50% failures in 100 requests, circuit opens and fails fast
- Half-open state: After 30s, allows 10 test requests to check recovery
Pattern 3: Distributed Locking for Idempotency
Prevent double-processing across multiple instances with Redis-based distributed locks:
@Service
public class DistributedLockService {
private final RedisTemplate<String, String> redis;
public boolean tryLock(String key, Duration timeout) {
String lockKey = "lock:" + key;
String lockValue = UUID.randomUUID().toString();
Boolean acquired = redis.opsForValue().setIfAbsent(
lockKey,
lockValue,
timeout
);
return Boolean.TRUE.equals(acquired);
}
public void processWithLock(String paymentId, Runnable task) {
String lockKey = "payment:" + paymentId;
if (!tryLock(lockKey, Duration.ofMinutes(5))) {
log.info("Payment {} already being processed", paymentId);
return;
}
try {
task.run();
} finally {
redis.delete("lock:" + lockKey);
}
}
}
Important considerations:
- Lock timeouts must exceed maximum processing time
- Consider using Redlock for multi-datacenter deployments
- Always release locks in finally blocks
Pattern 4: Dead Letter Queue for Failed Payments
Never silently drop a payment. Dead letter queues ensure manual intervention for edge cases:
@KafkaListener(topics = "payment-intents")
public void processPayment(
@Payload PaymentIntent intent,
@Header(KafkaHeaders.RECEIVED_TOPIC) String topic,
Acknowledgment ack
) {
try {
doProcess(intent);
ack.acknowledge();
} catch (RetryableException e) {
// Don't ack - Kafka will retry
throw e;
} catch (NonRetryableException e) {
// Send to DLQ for manual intervention
kafkaTemplate.send("payment-intents-dlq", intent);
ack.acknowledge();
// Alert on-call
alertService.sendAlert(
AlertLevel.HIGH,
"Payment sent to DLQ",
Map.of("paymentId", intent.getId(), "error", e.getMessage())
);
}
}
DLQ processing workflow:
- Payment fails with non-retryable error
- Message moves to DLQ topic
- Alert triggers for operations team
- Manual review determines fix (retry, refund, or escalate)
- Fixed payment reprocessed through main pipeline
Performance Optimizations
Optimization 1: Connection Pooling Done Right
hikari:
maximum-pool-size: 20 # CPU cores * 2 + disk spindles
minimum-idle: 10
connection-timeout: 10000
idle-timeout: 300000
max-lifetime: 900000
leak-detection-threshold: 60000
The formula: pool_size = CPU_cores * 2 + effective_spindle_count
For SSDs, spindle count is typically 1. A 4-core machine with SSD should use ~9-10 connections.
Optimization 2: Async Everywhere
Non-critical operations should never block the payment path:
@Service
public class NotificationService {
private final ExecutorService executor =
Executors.newFixedThreadPool(10);
@Async
public CompletableFuture<Void> notifyMerchant(Payment payment) {
return CompletableFuture.runAsync(() -> {
// Send webhook
webhookClient.send(payment.getMerchantWebhook(), payment);
// Send SMS
smsService.send(payment.getMerchantPhone(), formatMessage(payment));
// Send email
emailService.send(payment.getMerchantEmail(), formatEmail(payment));
}, executor);
}
}
Optimization 3: Compression for Kafka
Reduce network bandwidth and improve throughput:
spring:
kafka:
producer:
compression-type: lz4
batch-size: 32768
linger-ms: 5
buffer-memory: 67108864
Compression comparison:
- None: Fastest CPU, highest bandwidth
- LZ4: Best balance of speed and compression
- Snappy: Good compression, moderate CPU
- GZIP: Best compression, highest CPU
Optimization 4: JVM Tuning
For payment systems with strict latency requirements:
JAVA_OPTS="-Xms4g -Xmx4g \
-XX:+UseG1GC \
-XX:MaxGCPauseMillis=100 \
-XX:+ParallelRefProcEnabled \
-XX:+UseStringDeduplication \
-XX:+HeapDumpOnOutOfMemoryError"
Why G1GC:
- Predictable pause times (MaxGCPauseMillis=100ms)
- Good throughput for heap sizes 4GB+
- Concurrent marking reduces stop-the-world pauses
Monitoring & Observability
Key Metrics Dashboard
@Component
public class PaymentMetrics {
private final MeterRegistry registry;
private final Counter paymentsInitiated;
private final Counter paymentsCompleted;
private final Counter paymentsFailed;
private final Timer processingTime;
private final Gauge queueDepth;
public PaymentMetrics(MeterRegistry registry) {
this.registry = registry;
this.paymentsInitiated = Counter.builder("payments.initiated")
.description("Total payments initiated")
.register(registry);
this.paymentsCompleted = Counter.builder("payments.completed")
.description("Total payments completed")
.register(registry);
this.paymentsFailed = Counter.builder("payments.failed")
.description("Total payments failed")
.tag("reason", "unknown")
.register(registry);
this.processingTime = Timer.builder("payments.processing.time")
.description("Payment processing duration")
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry);
}
public void recordPaymentProcessed(Payment payment, Duration duration) {
processingTime.record(duration);
if (payment.getStatus() == PaymentStatus.COMPLETED) {
paymentsCompleted.increment();
} else {
paymentsFailed.increment();
}
}
}
Alert Rules
groups:
- name: payment_alerts
rules:
- alert: HighPaymentFailureRate
expr: |
rate(payments_failed_total[5m]) /
rate(payments_initiated_total[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Payment failure rate above 5%"
- alert: PaymentProcessingLatency
expr: |
histogram_quantile(0.99, payments_processing_time_seconds_bucket) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "P99 payment latency above 2 seconds"
- alert: KafkaConsumerLag
expr: kafka_consumer_lag > 10000
for: 3m
labels:
severity: critical
annotations:
summary: "Kafka consumer lag exceeds 10k messages"
Expected Results
When this architecture is properly implemented:
| Metric | Target | Achievable |
|---|---|---|
| Throughput | 100k/min | 150k+/min |
| P99 Latency | < 2s | ~800ms |
| Success Rate | 99.9% | 99.95%+ |
| Data Loss | 0 | 0 |
| Monthly Downtime | < 5 min | < 3 min |
Key Takeaways
-
Accept fast, process async - Decouple ingestion from processing to handle bursts
-
Partition everything - Databases, queues, services—parallelism is essential for scale
-
Design for failure - Circuit breakers, retries, dead letter queues are not optional
-
Batch aggressively - Individual operations don't scale; batch writes, batch reads
-
Monitor obsessively - Problems invisible to metrics are problems waiting to explode
-
Test at scale - Load test with realistic data volumes before production
Conclusion
Building systems that handle 100k payments per minute isn't about finding a silver bullet—it's about making hundreds of small, correct architectural decisions. Each optimization, each pattern, each configuration choice compounds.
The architecture described here isn't theoretical. These patterns power real payment systems processing billions in transactions, serving millions of merchants, and standing firm during the most chaotic payment spikes.
The key insight? Embrace the chaos. Design for it. Test against it. And when it arrives, the system will be ready.
Building payment systems at scale? Connect on Twitter or LinkedIn to discuss architecture patterns.
