APITier 1Tier 2Score
Developer12 min readJune 9, 2026

How Pattern Detection Works at Scale: Inside BigShield's Email Analysis Engine

A deep technical look at email pattern detection: entropy analysis, n-gram frequency, Markov chain scoring, dictionary attacks, homoglyph detection, and normalization at scale.

The Problem With Simple Email Validation

Most email validation stops at format checking and DNS lookups. Is the syntax valid? Does the domain have MX records? Can we deliver to it? These checks are necessary but nowhere near sufficient. An email like xk7mq92z@gmail.com passes every basic validation check. It has valid syntax, Gmail's MX records resolve fine, and the address is technically deliverable. But it is almost certainly not a real person's primary email address.

At BigShield, we go much deeper. Our pattern detection engine analyzes the structure, entropy, and statistical properties of the email local part (everything before the @) to determine how likely it is to belong to a real human. This article explains exactly how that works.

Email Entropy Analysis

Shannon entropy measures the randomness of a string. Real human email addresses tend to have moderate entropy because they are derived from names, words, or meaningful combinations. Random strings have high entropy. Patterned strings (like "aaaaaa") have low entropy.

The formula for Shannon entropy of a string is:

H(X) = -SUM(p(x) * log2(p(x))) for each unique character x

where p(x) = frequency of character x / total length

Here is a practical implementation:

function shannonEntropy(str: string): number {
  const len = str.length;
  if (len === 0) return 0;

  const freq = new Map<string, number>();
  for (const ch of str) {
    freq.set(ch, (freq.get(ch) || 0) + 1);
  }

  let entropy = 0;
  for (const count of freq.values()) {
    const p = count / len;
    entropy -= p * Math.log2(p);
  }

  return entropy;
}

// Examples:
// "john.smith"      -> ~3.12 (moderate, typical for real names)
// "xk7mq92z"        -> ~3.00 (similar raw entropy but different distribution)
// "aaaaaaa"          -> 0.00 (zero entropy, repetitive)
// "j8k2m5n9p3q7"    -> ~3.58 (high entropy, likely random)

Raw Shannon entropy alone is not enough to distinguish random from human-generated strings because the ranges overlap. A name like "alexander" and a random string like "xpqmztlkr" can have similar entropy values. That is why we need additional analysis layers.

Normalized Entropy

We normalize entropy by the theoretical maximum for the string's length and character set. This gives us a ratio between 0 and 1 that is more useful for comparison:

function normalizedEntropy(str: string): number {
  const entropy = shannonEntropy(str);
  const uniqueChars = new Set(str).size;
  const maxEntropy = Math.log2(uniqueChars);
  if (maxEntropy === 0) return 0;
  return entropy / maxEntropy;
}

// "john.smith"    -> ~0.93 (high ratio but from a limited char set)
// "xk7mq92z"     -> ~0.95 (high ratio from a wider char set)
// "j8k2m5n9p3q7" -> ~0.97 (very high, likely random)

Normalized entropy above 0.95 with a character set that includes digits and lowercase letters is a moderate fraud signal. Real email addresses occasionally hit this threshold, but it is uncommon.

N-Gram Frequency Analysis

N-gram analysis examines the frequency of character sequences in the email local part and compares them against expected frequencies in natural language. English text, for example, has very predictable bigram (2-character) and trigram (3-character) frequencies. The pair "th" is extremely common, while "qx" is virtually nonexistent.

Our approach:

  1. Extract all bigrams and trigrams from the email local part
  2. Look up each n-gram's frequency in a pre-computed language model
  3. Calculate the average "naturalness" score
// Pre-computed from a corpus of 10M+ real email addresses
// Values represent relative frequency (0-1 scale)
const BIGRAM_FREQ: Record<string, number> = {
  'th': 0.89, 'he': 0.87, 'in': 0.84, 'er': 0.83,
  'an': 0.82, 'on': 0.78, 'en': 0.76, 'at': 0.74,
  'es': 0.72, 'al': 0.71, 'st': 0.70, 'ar': 0.69,
  // ... thousands more entries
  'qx': 0.01, 'zq': 0.01, 'xk': 0.02, 'jq': 0.01,
};

function ngramScore(localPart: string): number {
  // Remove dots, underscores, hyphens, and digits for analysis
  const cleaned = localPart.replace(/[._\-0-9]/g, '').toLowerCase();
  if (cleaned.length < 3) return 0.5; // too short to analyze

  let totalScore = 0;
  let count = 0;

  // Bigram analysis
  for (let i = 0; i < cleaned.length - 1; i++) {
    const bigram = cleaned.slice(i, i + 2);
    const freq = BIGRAM_FREQ[bigram] ?? 0.15; // default for unknown
    totalScore += freq;
    count++;
  }

  return count > 0 ? totalScore / count : 0.5;
}

// "johnsmith"   -> ~0.72 (natural English bigrams)
// "xkmqzplw"   -> ~0.11 (unnatural character combinations)
// "alexander"   -> ~0.68 (natural)
// "j8k2m5n9"   -> ~0.15 (after stripping digits, barely analyzable)

Emails with an n-gram score below 0.25 are almost never legitimate. The false positive rate at this threshold is extremely low because even unusual real names (think Slavic or East Asian romanizations) tend to have higher bigram naturalness than truly random strings.

Markov Chain Scoring

N-gram frequency tells us about local character patterns, but Markov chain scoring captures the transition probabilities, the likelihood that one character follows another in a natural context. We build a first-order Markov model from a large corpus of real email local parts.

// Transition probability matrix
// transitionProb[a][b] = P(next char is b | current char is a)
type TransitionMatrix = Record<string, Record<string, number>>;

function buildTransitionMatrix(
  corpus: string[]
): TransitionMatrix {
  const counts: Record<string, Record<string, number>> = {};
  const totals: Record<string, number> = {};

  for (const email of corpus) {
    const local = email.split('@')[0].toLowerCase();
    for (let i = 0; i < local.length - 1; i++) {
      const curr = local[i];
      const next = local[i + 1];

      if (!counts[curr]) counts[curr] = {};
      counts[curr][next] = (counts[curr][next] || 0) + 1;
      totals[curr] = (totals[curr] || 0) + 1;
    }
  }

  // Normalize to probabilities
  const matrix: TransitionMatrix = {};
  for (const [curr, nextCounts] of Object.entries(counts)) {
    matrix[curr] = {};
    for (const [next, count] of Object.entries(nextCounts)) {
      matrix[curr][next] = count / totals[curr];
    }
  }

  return matrix;
}

function markovScore(
  localPart: string,
  matrix: TransitionMatrix
): number {
  const cleaned = localPart.toLowerCase();
  if (cleaned.length < 2) return 0.5;

  let logProb = 0;
  let transitions = 0;

  for (let i = 0; i < cleaned.length - 1; i++) {
    const curr = cleaned[i];
    const next = cleaned[i + 1];

    const prob = matrix[curr]?.[next] ?? 0.001; // smoothing
    logProb += Math.log(prob);
    transitions++;
  }

  // Average log probability per transition
  const avgLogProb = logProb / transitions;

  // Map to 0-1 scale (typical range: -7 to -1)
  // Higher is more natural
  const normalized = Math.max(0, Math.min(1,
    (avgLogProb + 7) / 6
  ));

  return normalized;
}

The pseudocode for the full scoring pipeline:

// Pseudocode: Email pattern scoring pipeline
//
// INPUT: email local part (string before @)
// OUTPUT: pattern_score (0 to 1, where 1 = most natural)
//
// 1. NORMALIZE the input (lowercase, strip separators)
// 2. COMPUTE Shannon entropy and normalized entropy
// 3. EXTRACT bigrams and trigrams
// 4. COMPUTE n-gram naturalness score against language model
// 5. COMPUTE Markov chain transition probability score
// 6. CHECK against known dictionary attack patterns
// 7. CHECK for homoglyphs
// 8. COMBINE scores with learned weights
//
// Time complexity: O(n) where n = length of local part
// Space complexity: O(1) (models are pre-loaded, constant size)
// Latency: < 2ms for pattern analysis alone

Dictionary Attack Detection

A dictionary attack on your signup form looks like this: the attacker takes a list of common first names and last names, combines them with predictable patterns, and generates thousands of realistic-looking email addresses.

Common patterns include:

  • firstname.lastname
  • firstnamelastname
  • firstinitiallastname
  • firstname.lastname## (with 1-2 trailing digits)
  • lastname.firstname

Individually, each of these looks perfectly legitimate. The detection happens at the cohort level, by analyzing signup batches:

interface SignupBatch {
  emails: string[];
  window: { start: Date; end: Date };
}

interface DictionaryAttackResult {
  isAttack: boolean;
  confidence: number;
  patternType: string;
  matchingEmails: string[];
}

function detectDictionaryAttack(
  batch: SignupBatch,
  threshold: number = 0.7
): DictionaryAttackResult {
  const locals = batch.emails.map(e =>
    e.split('@')[0].toLowerCase()
  );

  // Check for common structural patterns
  const patterns = {
    'first.last': /^[a-z]{2,15}.[a-z]{2,15}$/,
    'firstlast': /^[a-z]{4,20}$/,
    'first.last##': /^[a-z]{2,15}.[a-z]{2,15}d{1,3}$/,
    'f.last': /^[a-z].[a-z]{2,15}$/,
    'flast##': /^[a-z][a-z]{2,15}d{1,4}$/,
  };

  let bestMatch = { pattern: '', count: 0, matches: [] as string[] };

  for (const [name, regex] of Object.entries(patterns)) {
    const matching = locals.filter(l => regex.test(l));
    if (matching.length > bestMatch.count) {
      bestMatch = { pattern: name, count: matching.length, matches: matching };
    }
  }

  const ratio = bestMatch.count / locals.length;

  // Also check: do the names come from a common name list?
  const nameListHits = checkAgainstNameCorpus(bestMatch.matches);
  const nameListRatio = nameListHits / bestMatch.matches.length;

  // High structural pattern match + high name list overlap = dictionary attack
  const confidence = (ratio * 0.6) + (nameListRatio * 0.4);

  return {
    isAttack: confidence > threshold,
    confidence,
    patternType: bestMatch.pattern,
    matchingEmails: bestMatch.matches,
  };
}

The time complexity here is O(n * p) where n is the number of emails in the batch and p is the number of patterns. For practical batch sizes (up to 10,000) and our current pattern set (~20 patterns), this runs in under 50ms.

Homoglyph Detection

Homoglyphs are characters from different scripts that look identical or nearly identical to ASCII characters. The Cyrillic "a" (U+0430) looks exactly like the Latin "a" (U+0061) in most fonts. Fraudsters use this to create email addresses that appear to be duplicates but are technically different strings.

For example, john@gmail.com with a Cyrillic "o" (U+043E) looks identical to the legitimate john@gmail.com but is a different email address entirely.

// Common homoglyph mappings
const HOMOGLYPHS: Record<string, string> = {
  // Cyrillic
  '\u0430': 'a', // а -> a
  '\u0435': 'e', // е -> e
  '\u043E': 'o', // о -> o
  '\u0440': 'p', // р -> p
  '\u0441': 'c', // с -> c
  '\u0443': 'y', // у -> y
  '\u0445': 'x', // х -> x
  '\u04BB': 'h', // һ -> h
  // Greek
  '\u03B1': 'a', // α -> a
  '\u03B5': 'e', // ε -> e
  '\u03BF': 'o', // ο -> o
  '\u03C1': 'p', // ρ -> p
  // Latin extended
  '\u0101': 'a', // ā -> a
  '\u0113': 'e', // ē -> e
  '\u012B': 'i', // ī -> i
  '\u014D': 'o', // ō -> o
  // ... hundreds more
};

interface HomoglyphResult {
  hasHomoglyphs: boolean;
  normalizedForm: string;
  suspiciousChars: Array<{
    char: string;
    codePoint: string;
    position: number;
    looksLike: string;
  }>;
}

function detectHomoglyphs(email: string): HomoglyphResult {
  const suspicious: HomoglyphResult['suspiciousChars'] = [];
  let normalized = '';

  for (let i = 0; i < email.length; i++) {
    const char = email[i];
    const replacement = HOMOGLYPHS[char];

    if (replacement) {
      suspicious.push({
        char,
        codePoint: `U+${char.codePointAt(0)!.toString(16).toUpperCase().padStart(4, '0')}`,
        position: i,
        looksLike: replacement,
      });
      normalized += replacement;
    } else {
      normalized += char;
    }
  }

  return {
    hasHomoglyphs: suspicious.length > 0,
    normalizedForm: normalized,
    suspiciousChars: suspicious,
  };
}

Any email containing homoglyphs should receive a significant score penalty. In our dataset, 97.3% of emails containing non-ASCII characters that map to ASCII homoglyphs are fraudulent. The false positive rate is extremely low because legitimate email providers do not allow mixed-script addresses.

Email Normalization

Before any analysis, emails must be normalized to account for provider-specific addressing rules. Different email providers handle dots, plus signs, and case differently:

interface NormalizationResult {
  original: string;
  normalized: string;
  provider: string;
  aliases: string[]; // other forms that deliver to the same inbox
}

function normalizeEmail(email: string): NormalizationResult {
  const [rawLocal, domain] = email.toLowerCase().split('@');
  let normalized = rawLocal;
  let provider = 'generic';
  const aliases: string[] = [];

  // Gmail: dots are ignored, plus addressing is stripped
  if (domain === 'gmail.com' || domain === 'googlemail.com') {
    provider = 'gmail';
    // Remove dots
    normalized = normalized.replace(/\./g, '');
    // Remove plus addressing
    normalized = normalized.replace(/\+.*$/, '');

    // Generate common alias forms
    aliases.push(`${normalized}@gmail.com`);
    aliases.push(`${normalized}@googlemail.com`);
  }

  // Outlook/Hotmail: plus addressing is stripped, dots are significant
  else if (['outlook.com', 'hotmail.com', 'live.com'].includes(domain)) {
    provider = 'microsoft';
    normalized = normalized.replace(/\+.*$/, '');
  }

  // Yahoo: hyphen addressing (disposable addresses)
  else if (domain === 'yahoo.com') {
    provider = 'yahoo';
    normalized = normalized.replace(/-.*$/, '');
  }

  // Proton Mail: plus addressing
  else if (domain === 'protonmail.com' || domain === 'proton.me') {
    provider = 'proton';
    normalized = normalized.replace(/\+.*$/, '');
  }

  // Generic: strip plus addressing as a best guess
  else {
    normalized = normalized.replace(/\+.*$/, '');
  }

  return {
    original: email.toLowerCase(),
    normalized: `${normalized}@${domain}`,
    provider,
    aliases,
  };
}

// Key insight: john.smith+promo@gmail.com, johnsmith@gmail.com,
// and j.o.h.n.s.m.i.t.h@gmail.com all deliver to the same inbox.
// Normalization catches multi-account attempts using address variants.

Normalization is the foundation that all other pattern detection builds on. Without it, a fraudster could create dozens of accounts using Gmail dot variations of the same address, and each one would look unique to a naive system.

Combining Signals: The Scoring Pipeline

Each of these analysis techniques produces a score. The final pattern score is a weighted combination:

interface PatternScore {
  entropy: number;        // 0-1
  ngram: number;          // 0-1
  markov: number;         // 0-1
  homoglyph: number;      // 0 or 1 (binary)
  dictionaryAttack: number; // 0-1 (cohort-level)
  final: number;          // 0-1 weighted combination
}

function calculatePatternScore(
  email: string,
  cohort: string[] | null, // other recent signups for batch analysis
  matrix: TransitionMatrix,
): PatternScore {
  const { normalized } = normalizeEmail(email);
  const localPart = normalized.split('@')[0];

  const entropy = 1 - normalizedEntropy(localPart); // invert: lower entropy = more natural
  const ngram = ngramScore(localPart);
  const markov = markovScore(localPart, matrix);
  const homoglyphResult = detectHomoglyphs(email);
  const homoglyph = homoglyphResult.hasHomoglyphs ? 0 : 1;

  let dictionaryAttack = 1; // assume no attack
  if (cohort && cohort.length > 10) {
    const attackResult = detectDictionaryAttack({
      emails: cohort,
      window: { start: new Date(), end: new Date() },
    });
    dictionaryAttack = attackResult.isAttack ? 0 : 1;
  }

  // Weighted combination (weights tuned via logistic regression)
  const weights = {
    entropy: 0.15,
    ngram: 0.25,
    markov: 0.25,
    homoglyph: 0.20,
    dictionaryAttack: 0.15,
  };

  const final =
    entropy * weights.entropy +
    ngram * weights.ngram +
    markov * weights.markov +
    homoglyph * weights.homoglyph +
    dictionaryAttack * weights.dictionaryAttack;

  return { entropy, ngram, markov, homoglyph, dictionaryAttack, final };
}

The weights above are simplified. In production, we use logistic regression trained on a labeled dataset of 2M+ email addresses (both confirmed legitimate and confirmed fraudulent) to optimize these weights. The model is retrained monthly as fraud patterns evolve. For more detail on how we built and open-sourced portions of this system, check out that article.

Performance at Scale

Pattern detection needs to be fast. Our SLA is sub-200ms for the full validation pipeline, and pattern analysis is just one component. Here is how we achieve that:

  • Pre-loaded models: The Markov transition matrix, n-gram frequency table, and homoglyph map are loaded into memory at startup. No disk or network I/O during analysis.
  • O(n) algorithms: Every analysis function runs in linear time relative to the email length. Since email local parts are capped at 64 characters, this is effectively O(1).
  • Batch-aware but not batch-dependent: Individual emails can be scored in isolation. Cohort analysis (dictionary attack detection) runs asynchronously and updates scores retroactively.
  • Memory footprint: The full model set requires approximately 12MB of memory. The Markov matrix is the largest component at ~8MB (26^2 transition probabilities plus special characters).

End-to-end latency for pattern analysis is typically 1-3ms per email. The rest of BigShield's sub-200ms budget is spent on DNS lookups, SMTP verification, and IP reputation checks, which involve network I/O.

Limitations and Edge Cases

No pattern detection system is perfect. Here are the known limitations:

  • Non-Latin names: Our n-gram and Markov models are primarily trained on Latin-script email addresses. Romanized Chinese, Arabic, and Hindi names may score lower than expected. We address this with region-specific model variants.
  • Corporate email conventions: Some companies use employee ID patterns (like e12345@company.com) that look random to our analysis. Whitelisting known corporate domains mitigates this.
  • Privacy-focused users: People who deliberately use random-looking email addresses for privacy reasons (like xj7km@protonmail.com) will trigger false positives. This is why pattern analysis is one signal among many, not a sole decision-maker.

For a broader view of how these pattern detection components fit into the full architecture of our fraud detection platform, that article covers the system design from end to end.

Pattern detection is one of the most powerful tools in the email fraud detection arsenal. If you want to leverage these techniques without building them yourself, BigShield runs this entire pipeline (and 15+ additional signals) on every API call. Try it at bigshield.app.

Ready to stop fake signups?

BigShield validates emails with 20+ signals in under 200ms. Start for free, no credit card required.

Get Started Free

Related Articles