How Do They Spy on Websites? A Deep Dive into Web Tracking

Aug 11, 2025

“How do they spy on websites?” is a two-sided question. On the one hand, organizations monitor and protect their websites from abuse (fraud, scraping, account takeovers). On the other, analysts and operators want to understand how third parties identify and track visitors. The truth is that modern web tracking and bot detection rely on dozens of signals—not a single “magic” flag. From browser headers and TLS signatures to canvas entropy, WebRTC leaks, and behavior analytics, today’s systems combine multiple indicators to decide whether a visit looks human, automated, benign, or risky.

This guide explains the key techniques—what gets observed, why it matters, and how these signals interact—so you can design reliable, compliant automation and protect user privacy at the same time.

The layers of “who sees what”

There are several stakeholders observing traffic:

Website owners and their server-side logs (IP, user agent, cookies, response codes, timing).
CDNs and bot management vendors (Cloudflare, Akamai, Fastly, Imperva, etc.).
Analytics, advertising, and A/B testing scripts running in the browser.
Embedded SDKs (maps, chat widgets, captchas) that add their own signals.

Each layer collects pieces of the puzzle—headers, TLS handshakes, JavaScript feature support, device/graphics properties, interaction patterns—and fuses them into a score.

User-Agent and the new User-Agent Client Hints (UA-CH)

For years, the User-Agent string was the “identity card” of browsers. It is still widely used, but it’s noisy and often spoofed. Modern Chromium-based browsers now emit User-Agent Client Hints (UA-CH), a structured set of headers that convey brand and version in a more privacy-preserving way.

Common signals:

User-Agent: high-level browser + OS string (often padded for legacy reasons).
Sec-CH-UA, Sec-CH-UA-Platform, Sec-CH-UA-Model, Sec-CH-UA-Full-Version-List: granular brand/version info, gated by server opt-in.
Accept-Language, Accept, Content-Type, Sec-Fetch-*: request intent and locale hints.

Mismatch between User-Agent and UA-CH, or between language, time zone, and IP geolocation, can raise suspicion. Perfect alignment is not mandatory, but egregious conflicts matter.

Example header set for a realistic desktop visit:

GET /pricing HTTP/2
Host: example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.9
Sec-Fetch-Site: none
Sec-Fetch-Mode: navigate
Sec-Fetch-Dest: document
Upgrade-Insecure-Requests: 1

If the server has opted into UA-CH, the browser may also send (or respond with) Sec-CH-UA headers describing brand/version more precisely.

Beyond headers: TLS and network fingerprints

Even if headers look perfect, lower-level characteristics can reveal the client stack:

TLS ClientHello fingerprint (cipher suites, extensions, ALPN): produces a JA3-like signature.
HTTP/2 settings, prioritization, and connection reuse patterns.
IP reputation, ASN, and geolocation (data center vs. residential/mobile ranges).
Connection churn and concurrency (too many short-lived sessions vs. typical user behavior).

Vendors maintain reputation databases and train models on these patterns. A bot using a generic library TLS stack may differ subtly from a real browser. Modern automation should rely on real browsers or indistinguishable stacks—ideally via an execution service that matches production browsers.

Device and canvas fingerprinting

JavaScript-accessible properties can be combined into a relatively stable device fingerprint:

Navigator properties: platform, deviceMemory, hardwareConcurrency, webdriver.
Screen/Window: resolution, color depth, pixel ratio, viewport sizing.
Time zone, locale, Intl formats, and calendar data.
Fonts, audio, and canvas/WebGL rendering (GPU, drivers, shader precision).
Installed media capabilities and codecs.

Small inconsistencies become large when taken together. For instance, an exotic font list with a common GPU driver, or a 4K viewport paired with low device memory, may be improbable. Systems aren’t necessarily deterministic here; they compute a likelihood score.

Behavior analytics and human dynamics

Tracking code can analyze how users interact with pages:

Mouse movement granularity, velocity changes, idle vs. burst patterns.
Scrolling cadence, keypress timing, and focus/blur events.
Page dwell time, bounce patterns, and navigation depth.
DOM interactions: which elements get hovered, clicked, or avoided.

Automated flows often exhibit unrealistic regularity or “perfect precision.” Randomization helps, but realism beats randomness: actions should be sequenced, delayed, and occasionally corrected like a person would.

Headless and automation markers

Older headless modes exposed giveaways (e.g., navigator.webdriver = true, missing plugins, reduced canvas entropy). Modern browsers have narrowed these gaps, but there are still patterns to watch:

WebDriver protocol flags and injected properties.
Missing or stubbed APIs (speech, Bluetooth, WebUSB) in environments that otherwise claim full support.
Uniform window sizes, default fonts, or disabled GPU across all sessions.

Good automation treats headless as a mode, not a disguise. Real browsers, realistic environments, and selective headful runs often improve reliability.

Cookies, storage, and supercookies

State allows correlation across visits:

HTTP cookies and SameSite behaviors.
LocalStorage/SessionStorage and IndexedDB artifacts.
ETags, HSTS, and cache-based “supercookies.”

Blocking all state can look suspicious (new device every page), while uncontrolled state can leak identity. The middle path is managed persistence: reuse sessions where appropriate, clear or rotate state when crossing tenants or use cases.

Challenges and bot management

When risk scores cross thresholds, websites deploy challenges:

CAPTCHAs (reCAPTCHA, hCaptcha, Turnstile) and invisible/behavioral variants.
JavaScript integrity checks and proof-of-work or proof-of-humanity.
Progressive throttling, honeypots, and deceptive flows.

There is no universal bypass—and that’s by design. The aim is to sift risky traffic with minimal user friction.

Practical guidelines for compliant, reliable automation

Prefer real browsers with stable, pinned versions. Align DevTools Protocol (CDP) versions with your framework.
Shape headers and UA-CH coherently: user agent, language, and time zone should make sense together.
Use realistic environments: GPU, fonts, window sizes, and media capabilities that match the claimed platform.
Pace traffic: limit parallelism per domain, reuse sessions when logical, and introduce human-like delays.
Implement early assertions and short-circuit errors to avoid “thrashing” a site.
Separate authentication from business actions; persist session cookies securely.
Rotate IPs responsibly via reputable proxy networks; avoid erratic churn.
Respect robots.txt and terms of service; document lawful bases for data collection.
Capture evidence (screenshots, HTML, logs) for audits and debugging.
Add human-in-the-loop for ambiguous or high-risk steps.

Example: setting coherent headers in an automation context

Below is a minimal example of setting a realistic user agent and language when launching a context. The exact API varies by framework, but the goal is consistent:

// Example using Playwright (Node.js)
import { chromium } from 'playwright';

const browser = await chromium.launch({ headless: true });
const context = await browser.newContext({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36',
  locale: 'en-US',
  viewport: { width: 1366, height: 824 },
});
const page = await context.newPage();
await page.goto('https://example.com');

Many modern browsers will manage UA-CH automatically; focus on consistency rather than spoofing every possible hint.

Ethics and compliance

Responsible operators distinguish between legitimate automation (testing, accessibility, compliance, interoperability) and prohibited uses (circumventing paywalls, violating terms, harvesting personal data without a lawful basis). Build with clear policies, rate limits, and opt-outs where possible, and consult legal guidance for your jurisdiction.

Bottom line

Web “spying” or tracking is not a single technology, but a layered system of signals spanning the network, browser, device, and behavior. Reliability and privacy come from using realistic browsers, coherent identities, measured traffic, and clear compliance standards—not from a single header tweak.

CloudBrowser AI: private, reliable automation with proxies and stealth

CloudBrowser AI provides a production-grade browser automation engine that aligns with modern detection landscapes:

Real Chrome/Chromium execution with pinned versions and evidence (logs, screenshots, HTML).
Built-in proxy configuration (residential, mobile, or data center) to route traffic responsibly.
Stealth profiles that keep headers, UA-CH, time zone, and rendering characteristics coherent.
Session and cookie management to reuse identity where appropriate and rotate safely when required.
Orchestration integrations (n8n, MCP) so you can scale flows with queues, retries, and monitoring.

If you need to automate complex websites while staying private and compliant, CloudBrowser AI lets you choose the right browser build, configure proxies, and run in a realistic, low-noise environment—without rewriting your flows.