内容简介:版权所有,转载请注明:1.安装Golang2.下载爬虫包
版权所有,转载请注明: http://www.lenggirl.com/language/go-picture.html
使用准备
1.安装Golang
2.下载爬虫包
go get -v github.com/hunterhug/marmot/expert go get -v github.com/hunterhug/marmot/miner go get -v github.com/hunterhug/parrot/util
程序
该程序只能抓取HTML中 src="http"
中的图片, 必须带有协议头 http(s)
, 其他如 data-src
和混淆在JS中的无法抓取
See: https://github.com/hunterhug/marmot/blob/master/example/lesson/lesson6.go
package main import ( "errors" "fmt" "net/url" "strings" "github.com/hunterhug/marmot/expert" "github.com/hunterhug/marmot/miner" "github.com/hunterhug/parrot/util" ) // Num of miner, We can run it at the same time to crawl data fast var MinerNum = 5 // You can update this decide whether to proxy var ProxyAddress interface{} func main() { // You can Proxy! // ProxyAddress = "socks5://127.0.0.1:1080" fmt.Println(`Welcome: Input "url" and picture keep "dir"`) for { fmt.Println("---------------------------------------------") url := util.Input(`URL(Like: "http://publicdomainarchive.com")`, "http://publicdomainarchive.com") dir := util.Input(`DIR(Default: "./picture")`, "./picture") fmt.Printf("You will keep %s picture in dir %s\n", url, dir) fmt.Println("---------------------------------------------") // Start Catch err := CatchPicture(url, dir) if err != nil { fmt.Println("Error:" + err.Error()) } } } // Come on! func CatchPicture(picture_url string, dir string) error { // Check valid _, err := url.Parse(picture_url) if err != nil { return err } // Make dir! err = util.MakeDir(dir) if err != nil { return err } // New a worker to get url worker, _ := miner.New(ProxyAddress) result, err := worker.SetUrl(picture_url).SetUa(miner.RandomUa()).Get() if err != nil { return err } // Find all picture pictures := expert.FindPicture(string(result)) // Empty, What a pity! if len(pictures) == 0 { return errors.New("empty") } // Devide pictures into several worker xxx, _ := util.DevideStringList(pictures, MinerNum) // Chanel to info exchange chs := make(chan int, len(pictures)) // Go at the same time for num, imgs := range xxx { // Get pool miner worker_picture, ok := miner.Pool.Get(util.IS(num)) if !ok { // No? set one! worker_temp, _ := miner.New(ProxyAddress) worker_picture = worker_temp worker_temp.SetUa(miner.RandomUa()) miner.Pool.Set(util.IS(num), worker_temp) } // Go save picture! go func(imgs []string, worker *miner.Worker, num int) { for _, img := range imgs { // Check, May be Pass _, err := url.Parse(img) if err != nil { continue } // Change Name of our picture filename := strings.Replace(util.ValidFileName(img), "#", "_", -1) // Exist? if util.FileExist(dir + "/" + filename) { fmt.Println("File Exist:" + dir + "/" + filename) chs <- 0 } else { // Not Exsit? imgsrc, e := worker.SetUrl(img).Get() if e != nil { fmt.Println("Download " + img + " error:" + e.Error()) chs <- 0 return } // Save it! e = util.SaveToFile(dir+"/"+filename, imgsrc) if e == nil { fmt.Printf("SP%d: Keep in %s/%s\n", num, dir, filename) } chs <- 1 } } }(imgs, worker_picture, num) } // Every picture should return for i := 0; i < len(pictures); i++ { <-chs } return nil }
解释均写, 运行后:
superpika@superpika-chen-110:~/code/src/github.com/hunterhug/marmot/example/lesson$ go run lesson6.go Welcome: Input "url" and picture keep "dir" --------------------------------------------- URL(Like: "http://publicdomainarchive.com") DIR(Default: "./picture") You will keep http://publicdomainarchive.com picture in dir ./picture --------------------------------------------- SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_02_03_modern.jpg SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_google_dark.png SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-003-667x1000-192684_667x675.jpg superpika@superpika-chen-110:~/code/src/github.com/hunterhug/marmot/example/lesson$ go run lesson6.go Welcome: Input "url" and picture keep "dir" --------------------------------------------- URL(Like: "http://publicdomainarchive.com") DIR(Default: "./picture") You will keep http://publicdomainarchive.com picture in dir ./picture --------------------------------------------- SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_02_03_modern.jpg SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_google_dark.png SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-003-667x1000-192684_667x675.jpg SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_powered-by-wp-engine.png SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_divi.png SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-002-1000x667.jpg SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_public-domain-mark.png SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos008-1000x625.jpg SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_09_03_Weekly.jpg SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-054-1000x667.jpg SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_10_03_instagram_dark.png SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_02_03_vintage.jpg SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-001-1000x667.jpg SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-070-1000x667.jpg SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_twitter02_dark.png SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos001-1000x750-167066_1000x675.jpg SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-035-1000x667.jpg SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_facebook_dark.png SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-060-1000x667.jpg SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-013-1000x667.jpg --------------------------------------------- URL(Like: "http://publicdomainarchive.com") SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_powered-by-wp-engine.png SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_divi.png SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-002-1000x667.jpg SP4: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_05_03_public-domain-mark.png SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos008-1000x625.jpg SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_09_03_Weekly.jpg SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-054-1000x667.jpg SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_10_03_instagram_dark.png SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_02_03_vintage.jpg SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-001-1000x667.jpg SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-070-1000x667.jpg SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_twitter02_dark.png SP2: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_01_03_public-domain-images-free-stock-photos001-1000x750-167066_1000x675.jpg SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-035-1000x667.jpg SP3: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2014_03_03_03_facebook_dark.png SP0: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_11_03_free-stock-photos-public-domain-images-060-1000x667.jpg SP1: Keep in ./picture/http_04__03__03_publicdomainarchive.com_03_wp-content_03_uploads_03_2017_03_09_03_free-stock-photos-public-domain-images-013-1000x667.jpg --------------------------------------------- URL(Like: "http://publicdomainarchive.com")
[图片上传失败...(image-685ed8-1557639583356)]
以上所述就是小编给大家介绍的《Golang高并发抓取HTML图片》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- 边做边思考,谷歌大脑提出并发RL算法,机械臂抓取速度提高一倍!
- 如何使用代理IP进行数据抓取,PHP爬虫抓取亚马逊商品数据
- 抓取 Grafana Panel 视图
- 常用 Windows 抓取Hash
- Python爬虫:抓取新浪新闻数据
- Python抓取花瓣网高清美图
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
深入浅出程序设计(中文版)
Paul Barry、David Griffiths / 蒋雁翔、童健 / 东南大学出版社 / 2012-1 / 98.00元
《深入浅出程序设计(中文版)》介绍了编写计算机程序的核心概念:变量、判断、循环、函数与对象——无论运用哪种编程语言,都能在动态且多用途的python语言中使用具体示例和练习来运用并巩固这些概念。学习基本的工具来开始编写你感兴趣的程序,而不是其他人认为你应该使用的通用软件,并对软件能做什么(不能做什么)有一个更好的了解。当你完成这些,你就拥有了必要的基础去使用任何一种你需要或想要学习的语言或软件项目......一起来看看 《深入浅出程序设计(中文版)》 这本书的介绍吧!