Back Original

tezcatl: a 2MB alternative to Puppeteer for scraping on macOS

• ~1,000 words • 6 minute read

I've been working on a small project that uses whereami and nearme to do some local scraping and experimentation with LLMs to build enriched datasets. The details of that project are for a future post, but the short version is I needed to fetch web pages, extract some data and move on.

A quick cut to the chase:

If you're on a Mac and find yourself needing to do light scraping work on websites as they are rendered—not as they are sent over the wire—please consider trying tezcatl:

Some selling points:

  • A little like Puppeteer but only ~2MB instead of nearly 300MB.
  • Built around WebKit, which is already on your Mac, and only cares about returning an accurate snapshot of the DOM.
  • About as lightweight as a tool like this can be; has no dependencies or build steps you'll ever have to think about.
  • Can become a tool in your scraping + parsing toolkit right alongside jq, curl and other CLI affordances you reach for daily.

For those that want more context and story, read on!

The problem

curl handles about 70% of what I need. Most pages serve a usable version of their content in the initial HTML response and you can pipe curl into whatever you want—namely Simon Willison's llm tool in this case, which I adore.

But some sites render everything client-side with JavaScript and come back as soulless husks. A <div id="root"></div> and a pile of <script> tags—an illegible affront to a medium inherently centered around reading and the written word. Oh, the horror. The horror.

The standard answer to this reality is Puppeteer or Playwright. Spin up a headless Chromium, wait for the page to render, grab the DOM. It works, but it's a lot of overhead for what I actually needed, which was just to load a URL, wait until the page loaded and grab the HTML.

I didn't need cross-platform, cross-browser QA automations, screenshots or HAR dumps. I just needed the rendered DOM so I could strip the tags and ask an LLM to do some things with the actual words.

Every Mac ships with WebKit. It's the same engine Safari uses. Apple exposes it through WKWebView, which is how native apps embed web content. You can use it from a CLI tool.

So I made tezcatl. It's not for automating browsers or running tests. It's just for scraping web pages that don't render without JavaScript.

$ tezcatl https://example.com
<html lang="en"><head><title>Example Domain</title>...

$ tezcatl https://spa-site.com --wait=2000
# waits 2s after load for JS to render

It creates an offscreen WKWebView, loads the URL, waits for the navigation delegate to fire, optionally pauses for additional JS settling time, then evaluates JavaScript against the page (if specified) and writes the result to stdout. By default it returns the full rendered DOM. With --eval you can run arbitrary JS instead:

$ tezcatl https://example.com --eval="document.title"
Example Domain

$ tezcatl https://example.com --eval="document.querySelectorAll('a').length"
1

A real example

Apple's own developer documentation (at the time of this writing) renders in the client and seems to be a Vue app. curl gives you this:

This page requires JavaScript.

With tezcatl you can render the page and pull structured data out of it:

$ tezcatl https://developer.apple.com/documentation/ --wait=3000 \
    --eval="JSON.stringify([...document.querySelectorAll('a.card')].slice(0,5).map(c => ({
      title: c.querySelector('.title')?.textContent?.trim(),
      description: c.querySelector('.card-content .content')?.textContent?.trim(),
      url: c.href
    })), null, 2)"

You'll get something like this:

[
    {
        "title": "Explore the new design principles",
        "description": "Learn how to design and develop beautiful interfaces that leverage Liquid Glass.",
        "url": "https://developer.apple.com/documentation/TechnologyOverviews/liquid-glass"
    },
    {
        "title": "Adopting Liquid Glass",
        "description": "Find out how to bring the new material to your app.",
        "url": "https://developer.apple.com/documentation/TechnologyOverviews/adopting-liquid-glass"
    }
]

It pipes with other tools, which was another reason I rolled my own solution. I wanted something that fit into the CLI workflows I was already building with whereami, nearme, lingua and loupe.

# Get the rendered DOM and extract text with lingua
tezcatl https://example.com | lingua detect

# Scrape a title for use in a script
TITLE=$(tezcatl https://example.com --eval="document.title")

# Find the nearest pizza place, grab its website, render it
whereami --json | nearme "pizza" --json | jq -r '.[0].url' | xargs tezcatl

How it's built

Like the rest of my Zig tools, tezcatl talks to the Objective-C runtime directly. No Swift or Objective-C source files. Zig calls objc_msgSend and friends to create a WKWebView, register a navigation delegate class at runtime with objc_allocateClassPair, and wire up the completion handler using the ObjC block ABI.

WebKit's evaluateJavaScript:completionHandler: expects the completion handler to be an Objective-C block—not a function pointer. Blocks have a documented C ABI, so you can build a struct that pretends to be one and WebKit will call your function. In Zig:

const JSBlockLiteral = extern struct {
    isa: *anyopaque,
    flags: c_int,
    reserved: c_int,
    invoke: *const fn (*JSBlockLiteral, ?objc.id, ?objc.id) callconv(.c) void,
    descriptor: *const BlockDescriptor,
};

isa tells the ObjC runtime "I am a block" (you point it at _NSConcreteStackBlock). flags and reserved are bookkeeping the runtime expects; you set them to 0. invoke is the actual function pointer — when WebKit finishes evaluating JS, it calls this with the result and error. descriptor points to a tiny struct that just says how big the block is. The layout has to match what the ObjC compiler would emit for a ^void(id result, id error) block, but as long as the fields are in the right order with the right sizes, the runtime doesn't care what language built it.

Zig is honestly kind of a weird choice for a tool that's designed to be so deeply macOS native. The reason it's written this way is that I've been building a bunch of cross-platform CLI tools in Zig and I have an objc.zig module that I copy between projects — it handles objc_msgSend, class creation, block construction, all the runtime bridging. For tools like whereami and loupe that also build as C-compatible shared libraries, having the bridge in pure Zig means the library is self-contained with no ObjC compilation step. For tezcatl that doesn't matter much, but I'm in the habit and the pattern works.

WebKit is a GUI framework underneath and assumes it's running inside an application with an event loop. In a normal macOS app, NSApplicationMain spins that up and everything works. In a CLI there's no app, no window, no event loop. WebKit will accept your loadRequest: call and then do nothing, because nobody is pumping the events that drive the network and rendering work.

The fix is straightforward: pump it yourself. After telling the WKWebView to load a URL, I call CFRunLoopRunInMode(kCFRunLoopDefaultMode, timeout, false), which hands control to the system run loop until either my navigation delegate callback fires (and calls CFRunLoopStop) or the timeout expires. Same thing after evaluateJavaScript:completionHandler:; pump the loop, wait for the block callback. The Dock icon is suppressed with NSApplicationActivationPolicyAccessory so the whole thing stays invisible.

When it's the wrong tool

If you need to crawl thousands of pages or run on Linux, use Puppeteer or Playwright. They exist for good reason. This is not a replacement!

tezcatl is for when you're on a Mac, you need a handful of pages to render their JS and you want to stay in the terminal. The kind of thing where spinning up a headless Chromium feels like driving a semi truck to the corner store.

On naming

I wanted a name that suggested "seeing the true form of something," since that's what the tool does; strips away the loading spinners and empty <div>s and shows you the page left standing after all the little JavaScript cycles have finished. It also wouldn't hurt if it was a little bit metal.

tezcatl is the Nahuatl word for mirror—specifically an obsidian mirror. It's hard to look it up without running into Tezcatlipoca, the Aztec deity associated with an obsidian mirror that could see through illusions. The Getty has a great writeup on the mirror and its history, and, man, if ever there were to be an icon for a CLI tool... Apologies for mild appropriation, but the word is great and checked all the boxes for me.

It's on GitHub and in my Homebrew tap:

brew install georgemandis/tap/tezcatl
--

If you enjoyed reading this consider sponsoring my work on GitHub, subscribing to my newsletter or sharing it on Hacker News.

Published on Friday, May 29th 2026. Read this post as plain-text.