The Kukie.io cookie scanner uses a headless Chromium browser to visit every page on your website, record every cookie and storage item that is set, and feed the results into an automatic categorisation engine. This article explains the full pipeline from URL discovery through to the final results.
WordPress users: Cookie scans work the same way whether you use the WordPress plugin or manual installation. Scans are triggered from the Kukie.io dashboard.
Headless Chromium Browser
Each page is loaded in a real Chromium browser instance powered by Playwright. This means the scanner executes JavaScript, renders the DOM, and fires network requests just like a real visitor would. Cookies set via HTTP response headers, JavaScript document.cookie, and third-party script injections are all captured.
The scanner waits for the page to fully load, then scrolls the viewport to trigger lazy-loaded scripts and deferred analytics tags. An additional wait period after scrolling ensures that trackers with delayed initialisation are also detected.
URL Discovery
Before scanning begins, the scanner needs to build a list of pages to visit. This happens in three stages:
1. Sitemap.xml
The scanner first checks for a sitemap.xml file at the root of your domain. If found, it parses all <url> entries and extracts the page URLs. Sitemap index files (files that reference other sitemaps) are followed recursively.
2. Link Crawl
If no sitemap is available, the scanner loads your homepage and extracts all internal links from the HTML. Only links pointing to the same domain are followed. External links, anchor links, and non-page file extensions (PDF, ZIP, images, etc.) are filtered out.
3. Homepage Always Included
Regardless of the discovery method, your homepage is always included in the scan. This ensures that cookies set on the landing page are never missed.
Tip: For the most thorough scan results, make sure your website has a
sitemap.xmlfile. Most CMS platforms (WordPress, Shopify, Squarespace) generate one automatically.
What the Scanner Detects
The scanner collects four types of browser storage:
- HTTP cookies - set via
Set-Cookieresponse headers, including HttpOnly cookies that cannot be read by JavaScript. These are captured using the browser context API. - JavaScript cookies - set via
document.cookieby client-side scripts such as analytics libraries and tag managers. - localStorage - persistent key-value storage used by many modern analytics and advertising tools.
- sessionStorage - per-tab storage that persists until the browser tab is closed.
For each item, the scanner records the name, value, domain, path, expiry, and whether it is first-party or third-party relative to your domain.
Consent Bypass
Many websites use a consent management tool that blocks cookies until the visitor accepts. To discover all cookies your site can potentially set, the Kukie.io scanner simulates full consent before loading each page.
This consent simulation works with 8 or more common consent management platforms. The scanner pre-sets consent cookies and fires Google Consent Mode v2 grant signals so that third-party scripts behave as if the visitor has accepted all categories.
If your site uses the Kukie.io banner itself, the scanner sets a special __KUKIE_SCAN_MODE__ flag that tells the banner script to skip rendering entirely. This prevents the scan from interfering with its own consent flow.
Important: Cookies that are set only in response to user interaction (clicking a button, submitting a form) cannot be detected by the automated scanner. You may need to add these cookies manually.
Scan Limits Per Plan
The number of pages the scanner visits in a single run is capped by your plan:
- Free - up to 100 pages per scan.
- Pro - up to 600 pages per scan.
- Business - up to 2,000 pages per scan (or unlimited, depending on your plan configuration).
If your site has more pages than your plan allows, the scanner processes pages in the order they were discovered (sitemap order or crawl order) and stops when the limit is reached. Upgrade to a higher plan to increase your scan coverage. See Account & Billing for plan details.
Fan-Out Architecture
Kukie.io uses a fan-out scanning architecture for speed and reliability. An orchestrator job discovers URLs and creates one worker job per page. All page jobs run in parallel (within rate limits), so a 500-page site is scanned significantly faster than a sequential crawler would manage.
If an individual page fails (network timeout, server error), the rest of the scan continues. Failed pages are recorded and the scan completes with a "Completed with errors" status. You can review which pages failed in the scan results.
Rate Limiting
To avoid overloading your web server, the scanner enforces a per-domain rate limit. No more than three concurrent requests are made to the same domain at any given time. This keeps the scan fast without impacting your site's performance for real visitors.