通过goquery爬取知乎数据

栏目: Go · 发布时间: 5年前

内容简介：因为毕设模仿知乎做了个网站，需要点数据，所以打算爬点知乎的数据，本来想通过python写个爬虫，但是发现go也有个挺好用的爬虫库——goquery，如果你学过前端，那你完全可以在半个小时之内用goquery写出一个爬虫goquery类似jquery，它是jquery的go语言版本实现，使用它，可以很方便对HTML进行处理。它可以通过HTML Element元素，也可以通过Id选择器，Class选择器，以及属性选择器去筛选数据

goquery的使用

因为毕设模仿知乎做了个网站，需要点数据，所以打算爬点知乎的数据，本来想通过 python 写个爬虫，但是发现 go 也有个挺好用的爬虫库——goquery，如果你学过前端，那你完全可以在半个小时之内用goquery写出一个爬虫

goquery类似jquery，它是jquery的go语言版本实现，使用它，可以很方便对HTML进行处理。

它可以通过HTML Element元素，也可以通过Id选择器，Class选择器，以及属性选择器去筛选数据

github： https://github.com/PuerkitoBio/goquery

以下是我爬取知乎数据的demo代码

package main

import (
    "fmt"
    "log"
    "net/http"
    "strconv"
    "strings"

    "github.com/PuerkitoBio/goquery"
    _ "github.com/go-sql-driver/mysql"
)

func ExampleScrape() {

    for i := 321450693; i > 321450680; i-- {
        res, err := http.Get("https://www.zhihu.com/question/" + strconv.Itoa(i))
        if err != nil || res.StatusCode != 200 {
            continue
        }

        doc, err := goquery.NewDocumentFromReader(res.Body)
        if err != nil {
            log.Fatal(err)
        }

        doc.Find(".QuestionHeader .QuestionHeader-content .QuestionHeader-main").Each(func(i int, s *goquery.Selection) {
            questionTitle := s.Find(".QuestionHeader-title").Text()
            questionContent := s.Find(".QuestionHeader-detail").Text()
            questionContent = questionContent[0 : len(questionContent)-12]

            fmt.Println("questionTitle：", questionTitle)
            fmt.Println("questionContent：", questionContent)
        })

        doc.Find(".ContentItem-actions").Each(func(i int, s *goquery.Selection) {

        })
        doc.Find(".ListShortcut .List .List-item ").Each(func(i int, s *goquery.Selection) {
            head_url, _ := s.Find("a img").Attr("src")
            author := s.Find(".AuthorInfo-head").Text()
            fmt.Println("head_url：", head_url)
            fmt.Println("author：", author)

            voters := s.Find(".Voters").Text()
            voters = strings.Split(voters, " ")[0]
            content := s.Find(".RichContent-inner").Text() //带标签的可以用Html()
            createTime := s.Find(".ContentItem-time").Text()
            createTime = strings.Split(createTime, " ")[1]

            commentCount := s.Find(".ContentItem-actions span").Text()
            fmt.Println("voters：", voters)
            fmt.Println("content：", content)
            fmt.Println("createTime：", createTime)
            fmt.Println("commentCount : ", commentCount)
        })

    }

}

func main() {
    ExampleScrape()
}

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

反应式设计模式

Roland Kuhn、Brian Hanafee、Jamie Allen / 何品、邱嘉和、王石冲、林炜翔审校 / 清华大学出版社 / 2019-1-1 / 98.00 元

《反应式设计模式》介绍反应式应用程序设计的原则、模式和经典实践，讲述如何用断路器模式将运行缓慢的组件与其他组件隔开、如何用事务序列(Saga)模式实现多阶段事务以及如何通过分片模式来划分数据集，分析如何保持源代码的可读性以及系统的可测试性(即使在存在许多潜在交互和失败点的情况下)。主要内容 ? “反应式宣言”指南 ? 流量控制、有界一致性、容错等模式 ? 得之不易的关于“什么行不通”的经验 ? ......一起来看看《反应式设计模式》这本书的介绍吧!

码农工具

通过goquery爬取知乎数据

goquery的使用

反应式设计模式

随机密码生成器

RGB CMYK 转换工具

HEX CMYK 转换工具