Ferret
a web scraping system
aiming to simplify data extraction
from the web
Declarative
ferret has a declarative query language that makes it easy to focus on a data that you need to get.
Dynamic pages support
ferret has the ability to scrape JS rendered pages, handle all page events and emulate user interactions.
Embeddable
ferret was designed as a library from the ground up. it can be easily embedded into any Go application.
Hello, Ferret!
ferret helps you to focus on the data you need using an easy to learn declarative language
LET doc = DOCUMENT('https://github.com/topics')
FOR el IN ELEMENTS(doc, '.py-4.border-bottom')
LIMIT 10
LET url = ELEMENT(el, 'a')
LET name = ELEMENT(el, '.f3')
LET description = ELEMENT(el, '.f5')
RETURN {
name: TRIM(name.innerText),
description: TRIM(description.innerText),
url: 'https://github.com' + url.attributes.href
}
LET doc = DOCUMENT('https://soundcloud.com/charts/top', {
driver: 'cdp'
})
WAIT_ELEMENT(doc, '.chartTrack__details', 5000)
LET tracks = ELEMENTS(doc, '.chartTrack__details')
FOR track IN tracks
RETURN {
artist: TRIM(INNER_TEXT(track, '.chartTrack__username')),
track: TRIM(INNER_TEXT(track, '.chartTrack__title'))
}
Dynamic pages handled easily
ferret uses Chrome/Chromium via Chrome Devtools Protocol to handle dynamically rendered web pages
Simple extensibility
ferret is extremely extensible - creating custom functions and types is super easy
transform := func(ctx context.Context, args ...core.Value) (core.Value, error) {
str := args[0].(values.String)
return values.NewString(strings.ToUpper(str.String() + "_ferret")), nil
}
query := `
FOR el IN ["foo", "bar", "qaz"]
RETURN TRANSFORM(el)
`
comp := compiler.New()
if err := comp.RegisterFunction("transform", transform); err != nil {
return nil, err
}
program, err := comp.Compile(query)