Ferret

a web scraping system
aiming to simplify data extraction from the web

Declarative

ferret uses a declarative query language so you focus on the data you need, not the code to get it.

Dynamic pages support

ferret handles javascript-rendered pages, page events, and user interactions out of the box.

Embeddable

ferret is designed as a library first and embeds cleanly into any go application.

Declarative web data extraction

ferret is a declarative language for extracting structured data from the web.

you focus on what data you want - lists, fields, transformations - while Ferret takes care of browsers, async loading, retries, and HTML quirks.

the result is concise, readable queries that scale from small experiments to production pipelines.

// Open the GitHub Topics page and load it as a document LET doc = DOCUMENT('https://github.com/topics') // Iterate over topic blocks on the page // Each ".py-4.border-bottom" element represents a single topic card FOR el IN ELEMENTS(doc, '.py-4.border-bottom') // Limit the output to the first 10 topics LIMIT 10 // Extract the main link element for the topic LET url = ELEMENT(el, 'a') // Extract the topic name element LET name = ELEMENT(el, '.f3') // Extract the topic description element LET description = ELEMENT(el, '.f5') // Build and return a structured object for each topic RETURN { // Topic name text name: TRIM(name.innerText), // Short topic description description: TRIM(description.innerText), // Absolute URL to the topic page // The link on the page is relative, so we prepend the GitHub domain url: 'https://github.com' + url.attributes.href }

// Open the SoundCloud Top Charts page using a browser-based driver (CDP) // This allows Ferret to execute JavaScript and work with dynamic content LET doc = DOCUMENT('https://soundcloud.com/charts/top', { driver: 'cdp' }) // Wait until at least one chart tile is present on the page // This is important because SoundCloud loads content asynchronously WAIT_ELEMENT(doc, '.audibleTile', 5000) // Select all track tiles from the page LET tracks = ELEMENTS(doc, '.audibleTile') // Iterate over each track tile and extract useful data FOR track IN tracks RETURN { // Chart position / description text shown on the tile chart: TRIM(INNER_TEXT(track, '.playableTile__descriptionContainer')), // Build an absolute URL to the track page // The link on the page is relative, so we prepend the SoundCloud domain link: "https://soundcloud.com" + TRIM(ELEMENT(track, '.playableTile__artworkLink')?.attributes.href) }

Dynamic pages, no extra work

modern websites rely heavily on JavaScript. Ferret handles that for you.

by running pages in a real browser via Chrome DevTools Protocol, Ferret lets you extract data from dynamic sites without custom scripts or brittle workarounds.


No plugins. No macros. Just Go.

ferret exposes a minimal, explicit extension API that lets you bind Go functions and types into the query runtime.

this makes it easy to keep domain-specific logic in Go while using Ferret for data selection and transformation.

// Define a Go function that can be called from a Ferret query // The function receives a context and a variadic list of Ferret values transform := func(ctx context.Context, args ...core.Value) (core.Value, error) { // Extract the first argument and assert it to a string value str := args[0].(values.String) // Convert the string to upper case, append a suffix, // and return it as a new Ferret string value return values.NewString(strings.ToUpper(str.String() + "_ferret")), nil } // A simple Ferret query that applies the custom TRANSFORM function // to each element in a list query := ` FOR el IN ["foo", "bar", "qaz"] RETURN TRANSFORM(el) ` // Create a new Ferret compiler instance comp := compiler.New() // Register the Go function so it becomes available inside queries // It will be exposed as TRANSFORM(...) if err := comp.RegisterFunction("transform", transform); err != nil { return nil, err } // Compile the query into an executable program program, err := comp.Compile(query)