← Back to Blog

OPEN SOURCE BREACH CHECK TOOLS: VERIFYING CREDENTIALS WITHOUT THIRD-PARTY RISK

Published: 2026-05-24

THE PRIVACY PROBLEM WITH BREACH CHECKING

Most breach checking services require you to send the email address or password you want to verify. The service logs that query, knows exactly what you searched for, and can correlate it with your IP address, account, and any prior queries. For a security researcher investigating a sensitive target, this creates an audit trail you may not want. For a corporation running credential hygiene checks against employee accounts, it means sending HR-sensitive data to a third-party API you don't control.

The naive implementation of breach checking — hash the credential, send the hash, get a yes/no back — does not solve the problem. If you send the full SHA-1 of a password, the service can brute-force common passwords against every hash it receives. With a dictionary of 100 million common passwords and a fast hash function, this is feasible for a well-resourced operator. The only genuinely private model is k-anonymity: you send a prefix of the hash so ambiguous that the server cannot determine what you were checking.

Understanding how k-anonymity works, how to use the HIBP API correctly, and how to build offline lookup infrastructure for high-volume use cases gives you breach intelligence without introducing a new trust dependency.

HOW K-ANONYMITY WORKS FOR PASSWORDS

The Pwned Passwords endpoint uses SHA-1 hashing with a k-anonymity model. The protocol works like this: you hash the password locally, take the first 5 hexadecimal characters of the resulting hash (the prefix), and send only that 5-character prefix to the API. The API returns all hash suffixes in its dataset that begin with that same prefix — typically between 400 and 900 results per query. You then check locally whether your complete hash appears in the returned set.

The API never sees the full hash. From the server's perspective, your query is one of approximately 100,000 possible 5-character hex prefixes. If someone is monitoring the API, they see a prefix like 5BAA6 — they cannot determine whether you were checking "password", "letmein", or any specific string. That's the k-anonymity guarantee: your query is indistinguishable from roughly k other possible queries that would produce the same prefix.

For comparison, the HIBP email endpoint does not use k-anonymity — it requires the full email address and a paid API key. This is a deliberate tradeoff. Email breach lookup has lower practical abuse risk than password lookup (you can't reconstruct which email addresses are "common" the same way you can with passwords), and requiring an API key creates accountability. But the email endpoint does expose the queried address to the HIBP service.

For corporate or sensitive use, consider whether this tradeoff is acceptable for your threat model. Cached local lookups (see below) are the alternative.

USING THE HIBP API IN PYTHON

The k-anonymity password check is straightforward to implement. Here is a minimal Python function that returns the number of times a password appears in the breach dataset:

import hashlib
import requests

def check_password(password: str) -> int:
    """
    Returns number of times this password appears in HIBP breach data.
    Returns 0 if not found. Uses k-anonymity — full hash never sent.
    """
    sha1 = hashlib.sha1(password.encode("utf-8")).hexdigest().upper()
    prefix, suffix = sha1[:5], sha1[5:]
    resp = requests.get(
        f"https://api.pwnedpasswords.com/range/{prefix}",
        headers={"Add-Padding": "true"},  # mitigates timing analysis
    )
    resp.raise_for_status()
    for line in resp.text.splitlines():
        h, count = line.split(":")
        if h == suffix:
            return int(count)
    return 0

# Usage
count = check_password("hunter2")
if count:
    print(f"Compromised — seen {count} times in breach data")
else:
    print("Not found in breach data")

The Add-Padding: true header causes the API to pad responses to a fixed length, preventing timing-based inference about how many results your prefix matched. This is a defense-in-depth measure for high-sensitivity environments.

For email lookup, the HIBP v3 API returns structured breach objects. Each breach includes: Name, Domain, BreachDate, AddedDate, ModifiedDate, PwnCount, a DataClasses array describing what data was exposed (email addresses, passwords, IP addresses, phone numbers, etc.), and a Description field. The endpoint is https://haveibeenpwned.com/api/v3/breachedaccount/{email} and requires a hibp-api-key header from a subscription.

OpenOSINT's email investigation workflow uses the HIBP endpoint via the search_breach tool, which wraps the v3 API with rate limiting and formats the breach list for terminal output. It integrates with the AI agent loop so Claude can reason about which breaches are most significant based on the investigation context.

OFFLINE BREACH DATABASES FOR HIGH-VOLUME LOOKUP

If you're checking hundreds of thousands of credentials — password hygiene audits, leaked database correlation, bulk investigation pipelines — per-query API calls are slow, expensive, and create a dependency on external uptime. The HIBP Pwned Passwords list is publicly available for offline use under a Creative Commons license.

The full SHA-1 ordered file is approximately 40 GB compressed, containing over 1.2 billion compromised passwords hashed and sorted by hash. The NTLM version covers the same dataset with NTLM hashes, which is useful for Active Directory environments. Import options for fast local lookup:

For email-based offline lookup, no official dump exists — HIBP does not release email breach data as a downloadable dataset. The practical approach is aggressive API caching: query the API once per email, store results in a local SQLite database keyed by email hash, and never query the same address twice. This reduces API calls to net-new addresses only.

WHAT BREACH DATA ACTUALLY TELLS YOU

Raw breach lookup output gains meaning from context. An email address appearing in the 2016 LinkedIn breach (117 million accounts) confirms the target had a LinkedIn account at that time and provides a SHA-256 or bcrypt hash of their password from that period. The 2013 Adobe breach (153 million accounts) exposed encrypted — not hashed — passwords using 3DES in ECB mode, a catastrophically weak scheme. Those passwords are largely crackable and are present in credential stuffing lists used by attackers today.

Older breaches that included plaintext passwords in their datasets are particularly dangerous. The 2013 Stratfor breach, 2012 LinkedIn breach (before hashing), and various forum dumps from that era contain cleartext or easily cracked credentials. Knowing which breach databases a target's email appears in tells you which credential lists attackers would attempt in a stuffing campaign against that account.

For security researchers, the breach list also reveals behavioral patterns. A target appearing in 12 different breaches across 10 years suggests long-term presence on the internet under consistent identity. A target appearing only in recent data-aggregator breaches (LexisNexis, Exactis, etc.) suggests their email was harvested from marketing lists rather than from direct service use. These distinctions matter for attribution.

OSINT investigations rarely rely on a single signal. Breach data is most valuable when correlated with username enumeration, email verification, and social footprint analysis — the full chain that OpenOSINT's tool suite automates through the AI agent loop.

SEE ALSO


Home · Blog · Tools · GitHub