Architecture of a Fraud Detection Platform: How BigShield Works Under the Hood
A system design deep-dive into building a real-time fraud detection platform: tiered signal processing, scoring pipelines, Redis caching, BullMQ job queues, and database design for sub-200ms email validation.
Building a fraud detection system that responds in under 200ms while running 20+ validation signals is an engineering challenge that touches on distributed systems, caching strategies, async processing, and scoring algorithms. In this article, we are pulling back the curtain on BigShield's architecture to show how it all fits together.
Whether you are building your own email validation service, designing a scoring pipeline for another domain, or just curious about how real-time fraud detection works at the system level, this deep-dive covers the design decisions, trade-offs, and lessons we have learned. If you are looking for more hands-on integration guidance, check out our developer's guide to building email validation.
The High-Level Architecture
BigShield's architecture follows a pattern we call "tiered signal processing." The core idea is simple: not all validation signals need to (or can) run at the same speed. Some are fast enough for synchronous execution. Others require network calls, third-party lookups, or complex analysis that takes seconds. The architecture separates these into two tiers and merges their results through a unified scoring pipeline.
Here is the high-level data flow:
Client Request
|
v
API Gateway (Next.js API Routes)
|
+---> Authentication + Rate Limiting (Upstash Redis)
|
v
Tier 1 Signal Processing (synchronous, <100ms)
|
+---> Immediate Score Calculation
| |
| +---> If score <30 or >85: return immediately (decisive)
| |
| +---> If score 30-85: enqueue Tier 2
|
v
Response to Client (Tier 1 score + signals)
|
v (async, via webhook or polling)
Tier 2 Signal Processing (BullMQ Worker on Fly.io)
|
+---> Deep signals (SMTP probing, domain analysis, etc.)
|
v
Final Score Calculation + Storage (Supabase PostgreSQL)
|
v
Webhook Notification to Client (optional)This tiered approach is the foundation of the entire system. It lets us return a useful score in under 100ms for the majority of requests while still running deep analysis for ambiguous cases.
Tier 1: Synchronous Signal Processing
Tier 1 signals must complete within a strict budget of 100ms total. They run synchronously in the request path, and their results are returned directly to the client. Every Tier 1 signal relies on local computation or cached data, never on external network calls during the request.
What Runs in Tier 1
- Email syntax validation: RFC 5322 compliance, local part analysis, domain extraction. Pure computation, sub-millisecond.
- Burner domain check: Compare the email domain against a local database of 945+ known disposable email providers. We store these in a Set in memory for O(1) lookup.
- Domain age lookup (cached): Domain registration date from WHOIS, pre-cached in Redis. A cache miss does not block the response; it queues a background refresh.
- Email pattern analysis: Statistical analysis of the local part (entropy, character distribution, common patterns like "test123" or keyboard walks). Pure computation.
- IP reputation (cached): Lookup against a cached IP reputation database. Identifies known datacenter IPs, VPNs, proxies, and Tor exit nodes.
- Device fingerprint match (cached): Compare the submitted device fingerprint hash against known fraud fingerprints stored in Redis.
- Rate pattern analysis: Check recent signup velocity from the same IP, email domain, or device fingerprint using Redis counters.
The Speed Budget
With 7+ signals running in Tier 1, managing the time budget is critical. We run all signals concurrently using Promise.all, not sequentially:
async function runTier1Signals(
request: ValidationRequest
): Promise<SignalResult[]> {
const signals = [
checkSyntax(request.email),
checkBurnerDomain(request.email),
checkDomainAge(request.email),
analyzeEmailPattern(request.email),
checkIPReputation(request.ip),
checkDeviceFingerprint(request.fingerprint),
analyzeRatePatterns(request.ip, request.email),
];
// Race all signals against a timeout
const results = await Promise.allSettled(
signals.map(signal =>
Promise.race([
signal,
new Promise<SignalResult>((_, reject) =>
setTimeout(() => reject(new Error('timeout')), 90)
),
])
)
);
return results
.filter((r): r is PromiseFulfilledResult<SignalResult> =>
r.status === 'fulfilled'
)
.map(r => r.value);
}Notice the 90ms timeout per signal. If any single signal is slow (a rare Redis hiccup, for example), it gets dropped from the score calculation rather than delaying the response. The scoring algorithm handles missing signals gracefully by adjusting confidence levels.
The Scoring Pipeline
Raw signals are not useful to API consumers. They need a single, actionable score. BigShield's scoring pipeline takes individual signal results and produces a score from 0 (definitely fraudulent) to 100 (definitely legitimate).
How Scoring Works
Every email starts with a base score of 50 (neutral). Each signal contributes a score_impact (positive or negative) weighted by its confidence level:
interface SignalResult {
name: string;
score_impact: number; // -30 to +20 typically
confidence: number; // 0.0 to 1.0
details: Record<string, unknown>;
}
function calculateScore(signals: SignalResult[]): number {
const BASE_SCORE = 50;
const totalImpact = signals.reduce((sum, signal) => {
return sum + (signal.score_impact * signal.confidence);
}, 0);
// Clamp to 0-100
return Math.max(0, Math.min(100, BASE_SCORE + totalImpact));
}The confidence field is critical. A burner domain match has near-perfect confidence (0.95+). An email pattern that "looks suspicious" might only have confidence of 0.4. By weighting impact by confidence, we avoid over-penalizing ambiguous signals while still letting strong signals drive the score decisively.
The Decisive Score Optimization
This is one of our most impactful architectural decisions. After Tier 1 scoring, if the score is below 30 (very likely fraudulent) or above 85 (very likely legitimate), we skip Tier 2 entirely and return immediately. We call these "decisive scores."
In practice, about 70% of requests produce decisive scores from Tier 1 alone. This means 70% of API calls never touch the async pipeline, dramatically reducing infrastructure costs and latency. The remaining 30% of ambiguous cases get the full Tier 2 treatment.
Tier 2: Asynchronous Deep Signals
Tier 2 signals require network calls, longer computation, or third-party API lookups. They run asynchronously on a dedicated worker process.
Job Delivery with QStash
When the API determines that Tier 2 processing is needed, it publishes a job via Upstash QStash. We chose QStash over direct BullMQ queue publishing for one key reason: the API runs on Vercel's serverless functions, which cannot maintain persistent Redis connections for BullMQ. QStash provides HTTP-based job delivery that works perfectly in serverless environments.
async function enqueueTier2(
validationId: string,
request: ValidationRequest,
tier1Score: number,
tier1Signals: SignalResult[]
): Promise<void> {
await qstash.publishJSON({
url: `${WORKER_URL}/api/process-tier2`,
body: {
validationId,
email: request.email,
ip: request.ip,
fingerprint: request.fingerprint,
tier1Score,
tier1Signals,
},
retries: 3,
timeout: '30s',
});
}BullMQ Worker on Fly.io
The QStash message lands on our worker service running on Fly.io. The worker uses BullMQ for internal job management, retry logic, and concurrency control. Why a separate worker on Fly.io instead of more serverless functions? Three reasons:
- Persistent connections. SMTP probing and certain DNS lookups require long-lived TCP connections that do not fit the serverless model.
- Concurrency control. BullMQ gives us precise control over how many jobs run simultaneously, which matters when we are making external network calls that could be rate-limited.
- Cost. For sustained background processing, a dedicated VM on Fly.io is significantly cheaper than thousands of serverless function invocations.
What Runs in Tier 2
- SMTP verification: Connect to the mail server and verify the mailbox exists without sending an email. This requires actual TCP connections and can take 2-10 seconds.
- Deep domain analysis: Full WHOIS lookup, SSL certificate analysis, web presence check, catch-all domain detection.
- Social profile correlation: Check if the email address has associated social media profiles or public web presence (Gravatar, GitHub, LinkedIn via public APIs).
- Campaign attribution: Compare the signup against recently detected fraud campaigns to see if it matches a known attack pattern. For more on how this works, see our article on pattern detection at scale.
- Historical pattern matching: Query the database for previous validations with similar characteristics.
The Caching Layer: Upstash Redis
Redis is the backbone of BigShield's performance. We use Upstash Redis (serverless Redis with HTTP API) for three distinct purposes:
1. Signal Caching
Expensive lookups (domain WHOIS, IP reputation, DNS records) are cached with TTLs appropriate to their volatility:
const CACHE_TTLS = {
domainAge: 86400 * 7, // 7 days - domain age rarely changes
ipReputation: 3600, // 1 hour - IP reputation can shift
dnsRecords: 3600, // 1 hour - DNS changes propagate slowly
burnerDomainList: 86400, // 1 day - we refresh the list daily
smtpVerification: 86400, // 1 day - mailbox status is fairly stable
socialProfiles: 86400 * 3, // 3 days - social presence changes slowly
} as const;Cache keys are structured hierarchically: bs:cache:domain_age:example.com, bs:cache:ip_rep:203.0.113.42. This makes it easy to invalidate entire signal categories when needed.
2. Rate Limiting
API rate limiting uses Redis sliding window counters. Each API key has a per-minute and per-day limit. We also track per-IP and per-email-domain rates for fraud detection signals:
async function checkRateLimit(
key: string,
limit: number,
windowSeconds: number
): Promise<{ allowed: boolean; remaining: number }> {
const now = Date.now();
const windowKey = `bs:rate:${key}`;
const pipe = redis.pipeline();
pipe.zremrangebyscore(windowKey, 0, now - windowSeconds * 1000);
pipe.zadd(windowKey, { score: now, member: `${now}-${Math.random()}` });
pipe.zcard(windowKey);
pipe.expire(windowKey, windowSeconds);
const results = await pipe.exec();
const count = results[2] as number;
return {
allowed: count <= limit,
remaining: Math.max(0, limit - count),
};
}3. Real-Time Counters for Fraud Signals
Beyond rate limiting, Redis powers the "signup velocity" signals. We maintain sliding window counters for signups per IP subnet, per email domain, and per device fingerprint. A sudden spike in signups from a single subnet within a 5-minute window is a strong fraud signal, and Redis makes this check a single fast operation.
Database Design: Supabase PostgreSQL
All persistent data lives in Supabase PostgreSQL. The schema uses a consistent email_validator_ prefix for all tables. Here are the core tables and their purposes:
Core Tables
-- API keys with hashed secrets
email_validator_api_keys
id, user_id, key_hash (SHA-256), key_prefix,
tier, rate_limit, created_at, last_used_at
-- Validation results
email_validator_validations
id, api_key_id, email_hash, domain,
tier1_score, final_score, is_decisive,
signals (JSONB), created_at
-- Known burner domains
email_validator_burner_domains
id, domain, source, added_at, confidence
-- Fraud campaigns
email_validator_campaigns
id, fingerprint_hash, first_seen, last_seen,
signup_count, signals (JSONB), status
-- Device fingerprints linked to accounts
email_validator_fingerprints
id, fingerprint_hash, associated_emails (JSONB),
first_seen, last_seen, fraud_score
-- Domain intelligence cache
email_validator_domains
id, domain, mx_records, spf_record, dkim_present,
domain_age_days, catch_all, last_checked
-- User accounts and billing
email_validator_users
id, email, plan_tier, stripe_customer_id,
created_at, usage_this_monthWhy JSONB for Signals
Validation signals are stored as JSONB rather than in normalized tables. This is a deliberate choice. The signal schema evolves frequently as we add new signals, adjust weights, or change confidence calculations. JSONB lets us change the signal format without database migrations. It also makes it simple to store the complete signal snapshot for each validation, which is invaluable for debugging and for training our scoring model.
Indexing Strategy
The validations table gets queried heavily for analytics and pattern detection. Key indexes include:
- Composite index on (api_key_id, created_at) for customer dashboards
- Index on domain for domain-level analytics
- Index on email_hash for repeat-validation lookups
- GIN index on the signals JSONB column for signal-specific queries
- Partial index on (created_at) WHERE final_score < 30 for fraud analysis
API Key Authentication
API keys follow a specific format: ev_live_<random> for production and ev_test_<random> for testing. We store only the SHA-256 hash of the key in the database, not the raw key. The key prefix (first 8 characters) is stored separately for identification in logs and customer support without exposing the full key.
import { createHash } from 'crypto';
function hashApiKey(key: string): string {
return createHash('sha256').update(key).digest('hex');
}
async function authenticateRequest(
authHeader: string
): Promise<ApiKey | null> {
const key = authHeader.replace('Bearer ', '');
if (!key.startsWith('ev_live_') && !key.startsWith('ev_test_')) {
return null;
}
const hash = hashApiKey(key);
const apiKey = await db
.from('email_validator_api_keys')
.select('*')
.eq('key_hash', hash)
.single();
return apiKey.data;
}Deployment Topology
The complete deployment topology looks like this:
Vercel (Edge + Serverless)
+-- Next.js App (Dashboard + API Routes)
+-- Middleware (auth, rate limiting via Upstash)
Upstash
+-- Redis (caching, rate limiting, counters)
+-- QStash (job delivery to worker)
Fly.io
+-- BullMQ Worker (Tier 2 signal processing)
Supabase
+-- PostgreSQL (persistent storage)
+-- Auth (user authentication for dashboard)This topology is optimized for cost at our current scale. The API and dashboard run on Vercel's free/pro tier. Redis on Upstash scales per-request. The worker on Fly.io runs on a single small VM. And Supabase provides a generous free tier for PostgreSQL. The entire platform runs for under $50/month until you hit significant traffic.
Lessons Learned
A few hard-won lessons from building this architecture:
- Start with decisive scores. The optimization of skipping Tier 2 for clear-cut cases saved us more than any caching improvement. Most signups are obviously good or obviously bad.
- Cache aggressively, invalidate carefully. A stale domain age is fine. A stale IP reputation could miss a newly compromised range. Match your TTLs to the data's real volatility.
- Use JSONB early. We started with normalized signal tables and migrated to JSONB within a month. The flexibility is worth the trade-off of slightly harder querying.
- Separate the worker. Running async processing on the same serverless platform as the API seemed simpler at first, but the connection management and concurrency requirements pushed us to a dedicated worker quickly.
- Measure everything. We track p50, p95, and p99 latencies for every signal individually. When a signal gets slow, we know immediately and can adjust its timeout or move it between tiers.
What Is Next
We are actively working on a few architectural improvements: a dedicated ML scoring model trained on our historical data, geographic distribution of the worker for lower-latency SMTP probing, and a streaming API for real-time score updates as Tier 2 signals complete.
If you are building something similar, or if you would rather use a fraud detection platform than build one, BigShield's API is ready to use today. The architecture described in this article powers every request, and you get the benefit of it with a single API call. Check it out at bigshield.app.