colly is a powerful crawler framework written in Go language . It provides a simple API, has strong performance, can automatically handle cookies & sessions, and provides a flexible extension mechanism.

First, we introduce the basic concept of colly. Then we introduce the usage and features of colly with a few examples: pulling GitHub Treading, pulling Baidu novel hotlist, downloading images from Unsplash .

Quick Use

Create the directory and initialize.

1
2
$ mkdir colly && cd colly
$ go mod init github.com/darjun/go-daily-lib/colly

Install the colly library.

1
$ go get -u github.com/gocolly/colly/v2

Use.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
package main

import (
  "fmt"

  "github.com/gocolly/colly/v2"
)

func main() {
  c := colly.NewCollector(
    colly.AllowedDomains("www.baidu.com" ),
  )

  c.OnHTML("a[href]", func(e *colly.HTMLElement) {
    link := e.Attr("href")
    fmt.Printf("Link found: %q -> %s\n", e.Text, link)
    c.Visit(e.Request.AbsoluteURL(link))
  })

  c.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL.String())
  })

  c.OnResponse(func(r *colly.Response) {
    fmt.Printf("Response %s: %d bytes\n", r.Request.URL, len(r.Body))
  })

  c.OnError(func(r *colly.Response, err error) {
    fmt.Printf("Error %s: %v\n", r.Request.URL, err)
  })

  c.Visit("http://www.baidu.com/")
}

The use of colly is relatively simple.

First, call colly.NewCollector() to create a crawler object of type *colly.Collector. Since each page has many links to other pages. If left unrestricted, the run may never stop. So the above restricts to crawl only pages with domain www.baidu.com by passing an option colly.AllowedDomains("www.baidu.com").

Then we call the c.OnHTML method to register the HTML callback and execute the callback function for each a element that has the href attribute. Here we continue to access the URL pointed to by href, i.e. parse the crawled page and then continue to access links to other pages in the page.

Call the c.OnRequest() method to register the request callback, which is executed each time the request is sent, and simply prints the URL of the request.

Call the c.OnResponse() method to register the response callback, which is executed each time a response is received, and simply prints the URL and response size.

Call c.OnError() method to register the error callback, which is executed when an error occurs in the execution of the request, and here simply prints the URL and the error message.

Finally we call c.Visit() to start accessing the first page.

run:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
$ go run main.go
Visiting http://www.baidu.com/
Response http://www.baidu.com/: 303317 bytes
Link found: "百度首页" -> /
Link found: "设置" -> javascript:;
Link found: "登录" -> https://passport.baidu.com/v2/?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2F&sms=5
Link found: "新闻" -> http://news.baidu.com
Link found: "hao123" -> https://www.hao123.com
Link found: "地图" -> http://map.baidu.com
Link found: "直播" -> https://live.baidu.com/
Link found: "视频" -> https://haokan.baidu.com/?sfrom=baidu-top
Link found: "贴吧" -> http://tieba.baidu.com
...

After colly crawls the page, it parses the page using goquery. Then it looks for the registered HTML callback corresponding to the element-selector, and wraps goquery.Selection into a colly.HTMLElement execution callback.

The colly.HTMLElement is actually a simple wrapper around goquery.Selection.

and provides easy-to-use methods for.

  • Attr(k string) : returns the attribute of the current element, in the above example we used e.Attr("href") to get the href attribute.
  • ChildAttr(goquerySelector, attrName string) : returns the attrName attribute of the first child element selected by goquerySelector.
  • ChildAttrs(goquerySelector, attrName string) : returns the attrName attribute of all child elements selected by goquerySelector, returned as a []string.
  • ChildText(goquerySelector string) : Splices the text content of the child elements selected by goquerySelector and returns it.
  • ChildTexts(goquerySelector string) : returns a slice of the text content of the child element selected by goquerySelector, returned as a []string.
  • ForEach(goquerySelector string, callback func(int, *HTMLElement)) : Executes a callback callback for each child element selected by goquerySelector.
  • Unmarshal(v interface{}) : Unmarshal an HTMLElement object into a structure instance by specifying a tag in goquerySelector format to the structure field.

These methods will be used frequently. Here we will introduce the features and usage of colly with some examples.

GitHub Treading

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
type Repository struct {
  Author  string
  Name    string
  Link    string
  Desc    string
  Lang    string
  Stars   int
  Forks   int
  Add     int
  BuiltBy []string
}

func main() {
  c := colly.NewCollector(
    colly.MaxDepth(1),
  )


  repos := make([]*Repository, 0, 15)
  c.OnHTML(".Box .Box-row", func (e *colly.HTMLElement) {
    repo := &Repository{}

    // author & repository name
    authorRepoName := e.ChildText("h1.h3 > a")
    parts := strings.Split(authorRepoName, "/")
    repo.Author = strings.TrimSpace(parts[0])
    repo.Name = strings.TrimSpace(parts[1])

    // link
    repo.Link = e.Request.AbsoluteURL(e.ChildAttr("h1.h3 >a", "href"))

    // description
    repo.Desc = e.ChildText("p.pr-4")

    // language
    repo.Lang = strings.TrimSpace(e.ChildText("div.mt-2 > span.mr-3 > span[itemprop]"))

    // star & fork
    starForkStr := e.ChildText("div.mt-2 > a.mr-3")
    starForkStr = strings.Replace(strings.TrimSpace(starForkStr), ",", "", -1)
    parts = strings.Split(starForkStr, "\n")
    repo.Stars , _=strconv.Atoi(strings.TrimSpace(parts[0]))
    repo.Forks , _=strconv.Atoi(strings.TrimSpace(parts[len(parts)-1]))

    // add
    addStr := e.ChildText("div.mt-2 > span.float-sm-right")
    parts = strings.Split(addStr, " ")
    repo.Add, _ = strconv.Atoi(parts[0])

    // built by
    e.ForEach("div.mt-2 > span.mr-3  img[src]", func (index int, img *colly.HTMLElement) {
      repo.BuiltBy = append(repo.BuiltBy, img.Attr("src"))
    })

    repos = append(repos, repo)
  })

  c.Visit("https://github.com/trending")
  
  fmt.Printf("%d repositories\n", len(repos))
  fmt.Println("first repository:")
  for _, repo := range repos {
      fmt.Println("Author:", repo.Author)
      fmt.Println("Name:", repo.Name)
      break
  }
}

We use ChildText to get the author, repository name, language, number of stars and fork, today’s additions, etc., and ChildAttr to get the link to the repository, which is a relative path that is converted to an absolute path by calling the e.Request.AbsoluteURL() method.

Run.

1
2
3
4
5
$ go run main.go
25 repositories
first repository:
Author: Shopify
Name: dawn

Baidu Novel Hotlist

The web page structure is as follows.

The structure of each section is as follows.

  • each hotlist is each in a div.category-wrap_iQLoo.
  • div.index_1Ew5p under the a element is the ranking.
  • content in div.content_1YWBm.
  • content in a.title_dIF3B is the title.
  • two div.intro_1l0wp in content, the first being the author and the second being the type.
  • div.desc_3CTjT in the content is the description.

From this we define the structure.

1
2
3
4
5
6
7
type Hot struct {
  Rank   string `selector:"a > div.index_1Ew5p"`
  Name   string `selector:"div.content_1YWBm > a.title_dIF3B"`
  Author string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(2)"`
  Type   string `selector:"div.content_1YWBm > div.intro_1l0wp:nth-child(3)"`
  Desc   string `selector:"div.desc_3CTjT"`
}

tag is the CSS selector syntax, which is added so that the HTMLElement.Unmarshal() method can be called directly to populate the Hot object.

Then create the Collector object.

1
c := colly.NewCollector()

Registration Callback.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
c.OnHTML("div.category-wrap_iQLoo", func(e *colly.HTMLElement) {
  hot := &Hot{}

  err := e.Unmarshal(hot)
  if err != nil {
    fmt.Println("error:", err)
    return
  }

  hots = append(hots, hot)
})

c.OnRequest(func(r *colly.Request) {
  fmt.Println("Requesting:", r.URL)
})

c.OnResponse(func(r *colly.Response) {
  fmt.Println("Response:", len(r.Body))
})

OnHTML performs Unmarshal on each entry to generate Hot objects.

OnRequest/OnResponse simply outputs debugging information.

Then, call c.Visit() to access the URL.

1
2
3
4
5
err := c.Visit("https://top.baidu.com/board?tab=novel")
if err != nil {
  fmt.Println("Visit error:", err)
  return
}

Finally add some debugging prints.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
fmt.Printf("%d hots\n", len(hots))
for _, hot := range hots {
  fmt.Println("first hot:")
  fmt.Println("Rank:", hot.Rank)
  fmt.Println("Name:", hot.Name)
  fmt.Println("Author:", hot.Author)
  fmt.Println("Type:", hot.Type)
  fmt.Println("Desc:", hot.Desc)
  break
}

Run output.

1
2
3
4
5
6
7
8
9
Requesting: https://top.baidu.com/board?tab=novel
Response: 118083
30 hots
first hot:
Rank: 1
Name: 逆天邪神
Author: 作者:火星引力
Type: 类型:玄幻
Desc: 掌天毒之珠,承邪神之血,修逆天之力,一代邪神,君临天下!  查看更多>

Unsplash

I basically get my background images for my articles from unsplash, a site that offers a large, rich, and free collection of images. One problem with this site is that it is slow to access. Since we are learning to crawl, we just use the program to download the images automatically.

The unsplash home page is shown in the following image.

The web page structure is as follows.

But the home page shows all the smaller size images, we click on a link to a particular image:.

Because of the three-layer web structure involved (img needs to be accessed once at the end), using a colly.Collector object, the OnHTML callback needs to be set with extra care, putting a relatively large mental burden on the coding. colly supports multiple Collectors, and we use this approach to code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
func main() {
  c1 := colly.NewCollector()
  c2 := c1.Clone()
  c3 := c1.Clone()

  c1.OnHTML("figure[itemProp] a[itemProp]", func(e *colly.HTMLElement) {
    href := e.Attr("href")
    if href == "" {
      return
    }

    c2.Visit(e.Request.AbsoluteURL(href))
  })

  c2.OnHTML("div._1g5Lu > img[src]", func(e *colly.HTMLElement) {
    src := e.Attr("src")
    if src == "" {
      return
    }

    c3.Visit(src)
  })

  c1.OnRequest(func(r *colly.Request) {
    fmt.Println("Visiting", r.URL)
  })

  c1.OnError(func(r *colly.Response, err error) {
    fmt.Println("Visiting", r.Request.URL, "failed:", err)
  })
}

We use 3 Collector objects, the first Collector is used to collect the corresponding image links on the home page, then we use the second Collector to access these image links, and finally we let the third Collector to download the images. Above we also registered request and error callbacks for the first Collector.

The third Collector downloads the specific image content and saves it locally to.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
func main() {
  // ... ignore
  var count uint32
  c3.OnResponse(func(r *colly.Response) {
    fileName := fmt.Sprintf("images/img%d.jpg", atomic.AddUint32(&count, 1))
    err := r.Save(fileName)
    if err != nil {
      fmt.Printf("saving %s failed:%v\n", fileName, err)
    } else {
      fmt.Printf("saving %s success\n", fileName)
    }
  })

  c3.OnRequest(func(r *colly.Request) {
    fmt.Println("visiting", r.URL)
  })
}

The above uses atomic.AddUint32() to generate serial numbers for the images.

Run the program and crawl the results.

Asynchronous

By default, colly crawls web pages synchronously, i.e. one crawls after another, as in the unplash program above. This takes a long time, colly provides asynchronous crawling, we just need to pass the option colly.Async(true) when constructing the Collector object to enable asynchronous: colly.Async(true).

1
2
3
c1 := colly.NewCollector(
  colly.Async(true),
)

However, since it is an asynchronous crawl, the program needs to wait for Collector to finish processing at the end, otherwise it will exit main early and the program will quit: the

1
2
3
c1.Wait()
c2.Wait()
c3.Wait()

Running again, much faster 😀.

Second Edition

Scrolling down the unsplash page, we see that the image behind it is loaded asynchronously. Scroll down the page and view the request through the network tab of the chrome browser.

Requesting the path /photos, setting the per_page and page parameters, returns a JSON array. So there is an alternative way.

Define a structure for each item, where we keep only the necessary fields.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
type Item struct {
  Id     string
  Width  int
  Height int
  Links  Links
}

type Links struct {
  Download string
}

The JSON is then parsed in the OnResponse callback, and the Visit() method of the Collector responsible for downloading the image is called for each Download link.

1
2
3
4
5
6
7
c.OnResponse(func(r *colly.Response) {
  var items []*Item
  json.Unmarshal(r.Body, &items)
  for _, item := range items {
    d.Visit(item.Links.Download)
  }
})

To initialize the visit, we set up a pull of 3 pages, 12 per page (consistent with the number of page requests).

1
2
3
for page := 1; page <= 3; page++ {
  c.Visit(fmt.Sprintf("https://unsplash.com/napi/photos?page=%d&per_page=12", page))
}

Run and view the downloaded images at

Request Limits

Sometimes there are too many concurrent requests and the site will restrict access. That’s where LimitRule comes in. To put it plainly, LimitRule is what limits access speed and concurrency.

1
2
3
4
5
6
7
type LimitRule struct {
  DomainRegexp string
  DomainGlob string
  Delay time.Duration
  RandomDelay time.Duration
  Parallelism    int
}

The common ones are Delay/RandomDelay/Parallism, which indicate the request-to-request delay, random delay, and concurrency, respectively. Also must specify which domains are restricted, set via DomainRegexp or DomainGlob, if neither of these fields is set the Limit() method returns an error. Used in the above example.

1
2
3
4
5
6
7
8
err := c.Limit(&colly.LimitRule{
  DomainRegexp: `unsplash\.com`,
  RandomDelay:  500 * time.Millisecond,
  Parallelism:  12,
})
if err != nil {
  log.Fatal(err)
}

We set a random maximum request-to-request delay of 500ms for the domain unsplash.com, with a maximum of 12 concurrent requests.

Set timeout

Sometimes the network is slow and the http.Client used in colly has a default timeout mechanism, which we can rewrite with the colly.WithTransport() option.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
c.WithTransport(&http.Transport{
  Proxy: http.ProxyFromEnvironment,
  DialContext: (&net.Dialer{
    Timeout:   30 * time.Second,
    KeepAlive: 30 * time.Second,
  }).DialContext,
  MaxIdleConns:          100,
  IdleConnTimeout:       90 * time.Second,
  TLSHandshakeTimeout:   10 * time.Second,
  ExpectContinueTimeout: 1 * time.Second,
})

Extensions

colly provides some extensions in the subpackage extension, the most common one being the random User-Agent. Usually a website will use the User-Agent to identify whether a request is sent by a browser, and crawlers usually set this Header to disguise themselves as browsers. It is also relatively simple to use.

1
2
3
4
5
6
import "github.com/gocolly/colly/v2/extensions"

func main() {
  c := colly.NewCollector()
  extensions.RandomUserAgent(c)
}

The random User-Agent implementation is also simple: a random one from some pre-defined User-Agent array is set in the Header.

1
2
3
4
5
func RandomUserAgent(c *colly.Collector) {
  c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("User-Agent", uaGens[rand.Intn(len(uaGens))]())
  })
}

It is not difficult to implement your own extension, for example, if we need to set a specific Header for each request, the extension can be written like this.

1
2
3
4
5
func MyHeader(c *colly.Collector) {
  c.OnRequest(func(r *colly.Request) {
    r.Headers.Set("My-Header", "dj")
  })
}

Calling the MyHeader() function with the Collector object is sufficient.

1
MyHeader(c)

Summary

colly is the most popular crawler framework in the Go language, supporting a rich set of features. This article introduces some common features, with examples. Due to space constraints, some advanced features are not covered, such as queues, storage and so on. If you are interested in crawling, you can go deeper.


Reference https://darjun.github.io/2021/06/30/godailylib/colly/