【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

栏目: 编程语言 · 发布时间: 6年前

内容简介:折腾:期间,通过:

折腾:

【未解决】pyspider运行出错:FETCH_ERROR HTTP 599 Connection timed out after milliseconds

期间,通过:

pyspider HTTP 599 Connection timed out after

难道是需要加user-agent?

答复: 问题:HTTP 599: Operation timed out… – Google Groups

“和 phantomjs 连接超时,看看 phantomjs 是否存活,还有没有响应。”

难道此处是和phantomjs有关系?

毕竟有些页面好像的确是:

给self.crawl加了fetch_type=‘js’后,用到了phantomjs后,加载速度是慢了

-》但是调试期间是可以正常加载的

-〉难道是此处run时,大规模访问,则会导致phantomjs出错?连接超时?加载页面超时?

“和 https://github.com/binux/pyspider/issues/213 类似,可能与 phantomjs 的负载有关。

当 phantomjs 负载高时,可能会出现一定比例的失败

如果你会改代码的话,可以尝试将

pyspider/fetcher/tornado_fetcher.py

> request_conf[‘request_timeout’] = fetch[‘timeout’] + 1

+更长的时间(比如10秒),以等待 phantomjs 返回。看失败率是否有改善”

感觉最大可能就是phantomjs的问题了。

要么去找到根本解决办法

要么去试试给phantomjs的超时时间加长试试

no response from phantomjs while task has fetched result · Issue #213 · binux/pyspider

作者binux问到了,rate是多少

-》看来估计是

win8 pyspider HTTP 599: Resolving timed out after 20000 milliseconds · Issue #552 · binux/pyspider

HTTP 599: Operation timed out – Google Groups

问题:HTTP 599: Operation timed out… – Google Groups

pyspider超时 HTTP 599: Operation timed out after ….. – 足兆叉虫的回答 – SegmentFault 思否

crawl 连接网页超时,HTTP 599 – sandiegoxu的回答 – SegmentFault 思否

不管了,去设置大点的超时时间

先去看看默认超时时间

self.crawl – pyspider

“connect_timeout

timeout for initial connection in seconds. default: 20

timeout

maximum time in seconds to fetch the page. default: 120”

<code>self.crawl(eachModelDetailDict["url"],
           callback=self.carModelSpecPage,
           fetch_type='js',
           retries=5,
           connect_timeout=50,
           timeout=300,
           save=curSerieDict)
</code>

结果:

问题依旧:

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

顺带再去搜索:

pyspider phantomjs HTTP 599 Connection timed out after

Git pyspider version stop from doing any activity at some point of time · Issue #613 · binux/pyspider

pyspider出现连接超时 – CSDN博客

Python爬虫进阶四之PySpider的用法 | 静觅

(python)pyspider超时 HTTP 599: Operation timed out after _其他语言_编程问答

不过前面出错的log中,估计都是还没到最大的5次retry,对应着webui中没有出错:

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

所以可以继续运行,等待足够长时间,看看是否有fail的

不过看results中是有

1. FETCH_ERROR autohomeBrandData > https://www.autohome.com.cn/spec/10499/#pvareaid=2042128 3 minutes ago 28062.6+1.85ms +0

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

不过发现了,是很多retry

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

all里面是没有错的:

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

算了,跑了一夜,但是中间不知道为何出现Print打印的弹框,导致执行被暂停,搞得没继续执行。现在继续,也还是出现error的log。

算了,中断,下次重新运行。

phantomjs HTTP 599 Connection timed out after

一个tornado超时异常的分析过程 – 简书

然后注意到此前出错的也是:

HTTP 599: Connection timed out after 20000 milliseconds

不过增大timeout后,好像错误变成了

<code>[E 180516 09:21:15 tornado_fetcher:212] [599] autohomeBrandData:4e101be6bc40d1e5a5df146b33ce1f3b https://www.autohome.com.cn/spec/27705/#pvareaid=2042128, HTTP 599: Failed to connect to 127.0.0.1 port 25555: Operation timed out 27.62s
[E 180516 09:21:15 processor:202] process autohomeBrandData:4e101be6bc40d1e5a5df146b33ce1f3b https://www.autohome.com.cn/spec/27705/#pvareaid=2042128 -> [599] len:0 -> result:None fol:0 msg:0 err:ParserError('Document is empty',)
[I 180516 09:21:15 scheduler:959] task retry 1/5 autohomeBrandData:5a42194730c3edc4446015f2dc2a013b https://www.autohome.com.cn/spec/27703/#pvareaid=2042128
</code>

tornado_fetcher HTTP 599  Failed to connect to  port 25555

HTTP 599: Failed to connect to 127.0.0.1 port 25555 · Issue #755 · binux/pyspider

“you use phantomjs but not open phantomjs port”

好像是:之前运行:

<code>pyspider -c config.json
</code>

忘了加all参数。

或许是这个问题?导致没有加载phantomjs?

不过从:

tornado_fetcher:364: curl error 599: couldn’t connect to host · Issue #93 · binux/pyspider

看到好像说是要手动指定phantomjs的端口的?

pyspider phantomjs port

Deployment – pyspider

中没有说到phantomjs还要设置port的

估计是之前帖子2015年的,太老了,新的是不需要指定phantomjs的port的?

all模式默认所有子模块,包括phantomjs都运行了。

用pyspider写的爬虫几例

使用–phantomjs-proxy=”10.11.12.11:8080″启动指令以后的错误 · Issue #448 · binux/pyspider

cannot connect to webui. · Issue #309 · binux/pyspider

“ if you are running pyspider with all mode (by command pyspider or pyspider all) and phantomjs is started successfully”

pyspider – 究竟怎么给phantomjs设置代理? – SegmentFault 思否

重新去运行:

<code>pyspider -c config.json all
</code>

试试

此处记录一下最开始的输出:

<code>➜  AutocarData pyspider -c config.json all
phantomjs fetcher running on port 25555
。。。
self.mysqlDb=<AutohomeResultWorker.MysqlDb object at 0x105d58cc0>
[I 180517 09:12:14 result_worker:49] result_worker starting...
[I 180517 09:12:14 tornado_fetcher:638] fetcher starting...
[I 180517 09:12:14 processor:211] processor starting...
[I 180517 09:12:14 scheduler:647] scheduler starting...
[I 180517 09:12:15 scheduler:126] project autohomeBrandData updated, status:RUNNING, paused:False, 0 tasks
[I 180517 09:12:15 scheduler:965] select autohomeBrandData:_on_get_info data:,_on_get_info
[I 180517 09:12:15 tornado_fetcher:188] [200] autohomeBrandData:_on_get_info data:,_on_get_info 0s
[I 180517 09:12:15 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
[D 180517 09:12:15 project_module:145] project: autohomeBrandData updated.
[I 180517 09:12:15 processor:202] process autohomeBrandData:_on_get_info data:,_on_get_info -> [200] len:12 -> result:None fol:0 msg:0 err:None
[I 180517 09:12:15 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333
[I 180517 09:12:15 scheduler:360] autohomeBrandData on_get_info {'min_tick': 0, 'retry_delay': {}, 'crawl_config': {}}
[I 180517 09:12:15 app:76] webui running on 0.0.0.0:5000
</code>

其中是有:

phantomjs fetcher running on port 25555

然后没有all参数的输出:

<code>➜  AutocarData pyspider -c config.json
phantomjs fetcher running on port 25555
[I 180517 09:13:34 result_worker:49] result_worker starting...
[I 180517 09:13:34 processor:211] processor starting...
[I 180517 09:13:34 tornado_fetcher:638] fetcher starting...
[I 180517 09:13:34 scheduler:647] scheduler starting...
[I 180517 09:13:35 scheduler:126] project autohomeBrandData updated, status:STOP, paused:False, 0 tasks
[I 180517 09:13:35 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333
[I 180517 09:13:35 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
[I 180517 09:13:35 app:76] webui running on 0.0.0.0:5000
</code>

即:

pyspider没有指定模式时,默认也是all

-》加不加all,都没关系,phantomjs都会运行的

再去换个更快的公司的网络,去运行试试,看看是否出错

结果错误依旧:

<code>[E 180517 09:32:53 tornado_fetcher:212] [599] autohomeBrandData:29ae096711906a0100db8641da85ee6f https://www.autohome.com.cn/spec/29931/#pvareaid=2042128, HTTP 599: Connection timed out after 61314 milliseconds 62.30s
[E 180517 09:32:53 tornado_fetcher:212] [599] autohomeBrandData:d89d93ca8126d2264393bf31d0b7f60a https://www.autohome.com.cn/spec/29930/#pvareaid=2042128, HTTP 599: Connection timed out after 61276 milliseconds 62.24s
[E 180517 09:32:53 processor:202] process autohomeBrandData:a65ac03901c5832a0c841f3c58d99e69 https://www.autohome.com.cn/spec/18006/#pvareaid=2042128 -> [599] len:0 -> result:None fol:0 msg:0 err:ParserError('Document is empty',)
</code>

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

可以看出,此处时超时60多秒了,还是连接出错

所以,看起来不是简单的继续增加超时时间能解决的了

然后,也是部分时候,phantomjs是正常工作的,所以才能正常保存数据:

msrp值不为0:

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

说明此处代码的确执行到了:

<code>
        self.crawl(eachModelDetailDict["url"],
                   callback=self.carModelSpecPage,
                   fetch_type='js',
                   retries=5,
                   connect_timeout=50,
                   timeout=300,
                   save=curSerieDict)

@catch_status_code_error
def carModelSpecPage(self, response):
</code>

所以,真的是搞不懂为何出错了,以及如何彻底解决此问题。

mac中去看看进程:

<code>➜  ~ lsof -i:25555
COMMAND     PID   USER   FD   TYPE             DEVICE SIZE/OFF NODE NAME
phantomjs 96892 crifan    8u  IPv4 0x4b4cb78e55774f75      0t0  TCP localhost:25555->localhost:51131 (ESTABLISHED)
phantomjs 96892 crifan   12u  IPv4 0x4b4cb78e5679cf75      0t0  TCP *:25555 (LISTEN)
phantomjs 96892 crifan   13u  IPv4 0x4b4cb78e55aec615      0t0  TCP localhost:25555->localhost:49769 (ESTABLISHED)
。。。
localhost:25555->localhost:49517 (ESTABLISHED)
phantomjs 96892 crifan  186u  IPv4 0x4b4cb78e59206235      0t0  TCP localhost:25555->localhost:51333 (ESTABLISHED)
python3.6 96895 crifan   44u  IPv4 0x4b4cb78e58760235      0t0  TCP localhost:64752->localhost:25555 (ESTABLISHED)
。。。
python3.6 96895 crifan  139u  IPv4 0x4b4cb78e569b98d5      0t0  TCP localhost:51235->localhost:25555 (ESTABLISHED)
➜  ~
</code>

也是有phantomjs在运行的

有30个 python 3.6进程

有31个phantomjs进程

应该都使用的了才对

停止了pyspider后,进程没了:

<code>➜  ~ lsof -i:25555
➜  ~
</code>

也是期望的现象。

去加上user-agent试试

pyspider phantomjs port

demo.pyspider.org 部署经验 | Binuxの杂货铺

pyspider+PhantomJS的代理设置 – 简书

我此处也是最新的2.1.1的最新版本:

<code>➜  ~ phantomjs --version
2.1.1
➜  ~ which phantomjs
/usr/local/bin/phantomjs
➜  ~ ll /usr/local/bin/phantomjs*
lrwxr-xr-x  1 crifan  admin   107B 11 22 23:14 /usr/local/bin/phantomjs -> /Users/crifan/dev/dev_root/projects/亚马逊/自动下单/ms_store_auto_order/libs/webdriver/mac/phantomjs
➜  ~ ll /Users/crifan/dev/dev_root/projects/亚马逊/自动下单/ms_store_auto_order/libs/webdriver/mac/
total 111232
-rwxr-xr-x@ 1 crifan  staff    11M 10  3  2017 chromedriver
-rwxr-xr-x@ 1 crifan  staff    43M  1 24  2016 phantomjs
</code>

Python pyspider 安装与开发 – 简书

为何要给phantomjs设置代理?不清楚

Download | PhantomJS

此处官网最新版本还是2.1.1

pyspider phantomjs 内存泄漏

pyspider分布式部署 | 听云

“      “phantomjs-proxy”:”192.168.218.10:24444″,

1. phantomjs-proxy 指定了使用的渲染 工具 地址,有必要还需要限定内存大小,以避免内存泄露”

demo.pyspider.org 部署经验(转) – microman – 博客园

都是在说phantomjs的部署

Binuxの杂货铺

原来网上的那些都是转载自:

demo.pyspider.org 部署经验

极客起源 – geekori.com – 问题详情 – pyspider phantomjs内存泄漏和假死怎么解决呢?

“假死:定时重启。。。”

phantomjs doesn’t release memory · Issue #207 · binux/pyspider

“(google “phantomjs memory”).

I have a phantomjs instance with crawl rate of 5 pages per minute, it cost ~500MB. And my solution is restart it every hour.”

phantomjs memory leak · Issue #439 · binux/pyspider

“But sometimes phantomjs will some how be blocked instead of killed when memory usage near but not reach the limit.”

phantomjs有时候可能会假死?

phantomjs 内存泄漏

phantomjs set memory limit

Phantomjs Memory Leak Issue · Issue #14308 · ariya/phantomjs

Phantomjs性能优化

【未解决】pyspider中如何给phantomjs传递额外参数

Command Line Interface | PhantomJS

竟然找不到memory方面的配置,更找不到memory limit了

pyspider phantomjs  memory limit

Deployment of demo.pyspider.org – pyspider

“phantomjs

phantomjs have memory leak issue, memory limit applied, and it’s recommended to restart it every hour.”

Phantomjs, performance · Issue #458 · binux/pyspider

connect to scheduler rpc error: error(111, ‘Connection refused’) with docker – Google Groups

所以感觉是:

抽空先单独运行一个phantomjs

然后再去运行pyspider 指定proxy是这个phantomjs?

<code>➜  AutocarData pyspider phantomjs --help
Usage: pyspider phantomjs [OPTIONS] [ARGS]...

  Run phantomjs fetcher if phantomjs is installed.

Options:
  --phantomjs-path TEXT  phantomjs path
  --port INTEGER         phantomjs port
  --auto-restart TEXT    auto restart phantomjs if crashed
  --help                 Show this message and exit.
➜  AutocarData pyspider --help
Usage: pyspider [OPTIONS] COMMAND [ARGS]...

  A powerful spider system in python.

Options:
  -c, --config FILENAME           a json file with default values for
                                  subcommands. {"webui": {"port":5001}}
  --logging-config TEXT           logging config file for built-in python
                                  logging module  [default: /Users/crifan/.loc
                                  al/share/virtualenvs/AutocarData-xI-
                                  iqIq4/lib/python3.6/site-
                                  packages/pyspider/logging.conf]
  --debug                         debug mode
  --queue-maxsize INTEGER         maxsize of queue
  --taskdb TEXT                   database url for taskdb, default: sqlite
  --projectdb TEXT                database url for projectdb, default: sqlite
  --resultdb TEXT                 database url for resultdb, default: sqlite
  --message-queue TEXT            connection url to message queue, default:
                                  builtin multiprocessing.Queue
  --amqp-url TEXT                 [deprecated] amqp url for rabbitmq. please
                                  use --message-queue instead.
  --beanstalk TEXT                [deprecated] beanstalk config for beanstalk
                                  queue. please use --message-queue instead.
  --phantomjs-proxy TEXT          phantomjs proxy ip:port
  --data-path TEXT                data dir path
  --add-sys-path / --not-add-sys-path
                                  add current working directory to python lib
                                  search path
  --version                       Show the version and exit.
  --help                          Show this message and exit.

Commands:
  all            Run all the components in subprocess or...
  bench          Run Benchmark test.
  fetcher        Run Fetcher.
  one            One mode not only means all-in-one, it runs...
  phantomjs      Run phantomjs fetcher if phantomjs is...
  processor      Run Processor.
  result_worker  Run result worker.
  scheduler      Run Scheduler, only one scheduler is allowed.
  send_message   Send Message to project from command line
  webui          Run WebUI
</code>

单独开个窗口去运行phantomjs:

<code>➜  AutocarData pyspider phantomjs --help
Usage: pyspider phantomjs [OPTIONS] [ARGS]...

  Run phantomjs fetcher if phantomjs is installed.

Options:
  --phantomjs-path TEXT  phantomjs path
  --port INTEGER         phantomjs port
  --auto-restart TEXT    auto restart phantomjs if crashed
  --help                 Show this message and exit.
➜  AutocarData pyspider phantomjs --port 23450 --auto-restart true
phantomjs fetcher running on port 23450

</code>

其中:

–auto-restart TEXT    auto restart phantomjs if crashed

用了:–auto-restart true

也不知道语法是否对

另外:

<code>➜  AutocarData pyspider all --help
Usage: pyspider all [OPTIONS]

  Run all the components in subprocess or thread

Options:
  --fetcher-num INTEGER         instance num of fetcher
  --processor-num INTEGER       instance num of processor
  --result-worker-num INTEGER   instance num of result worker
  --run-in [subprocess|thread]  run each components in thread or subprocess.
                                always using thread for windows.
  --help                        Show this message and exit.
</code>

给:

config.json

加上:

<code>{
  "resultdb":   "mysql+resultdb://root:crifan_mysql@127.0.0.1:3306/AutohomeResultdb",
  "result_worker":{
      "result_cls": "AutohomeResultWorker.AutohomeResultWorker"
   },
  "phantomjs-proxy": "127.0.0.1:23450"
}
</code>

然后再去运行:

<code>➜  AutocarData pyspider -c config.json
[I 180517 11:11:15 result_worker:49] result_worker starting...
[I 180517 11:11:15 processor:211] processor starting...
[I 180517 11:11:15 tornado_fetcher:638] fetcher starting...
[I 180517 11:11:15 scheduler:647] scheduler starting...
[I 180517 11:11:15 scheduler:126] project autohomeBrandData updated, status:RUNNING, paused:False, 0 tasks
[I 180517 11:11:15 scheduler:965] select autohomeBrandData:_on_get_info data:,_on_get_info
[I 180517 11:11:15 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
[I 180517 11:11:15 tornado_fetcher:188] [200] autohomeBrandData:_on_get_info data:,_on_get_info 0s
[D 180517 11:11:15 project_module:145] project: autohomeBrandData updated.
[I 180517 11:11:15 processor:202] process autohomeBrandData:_on_get_info data:,_on_get_info -> [200] len:12 -> result:None fol:0 msg:0 err:None
[I 180517 11:11:16 scheduler:360] autohomeBrandData on_get_info {'min_tick': 0, 'retry_delay': {}, 'crawl_config': {}}
[I 180517 11:11:16 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333
[I 180517 11:11:16 app:76] webui running on 0.0.0.0:5000
</code>

可见,此处没有单独创建phantomjs了,

然后继续运行看看效果

然后在此期间phantomjs窗口输出了结果,都是200 OK的:

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

<code>➜  AutocarData pyspider phantomjs --port 23450 --auto-restart true
phantomjs fetcher running on port 23450
[200] https://www.autohome.com.cn/spec/32172/#pvareaid=2042128 3.661
[200] https://www.autohome.com.cn/spec/32678/#pvareaid=2042128 3.811
。。。
[200] https://www.autohome.com.cn/spec/7955/#pvareaid=2042128 30.801
。。。
[200] https://www.autohome.com.cn/spec/6097/#pvareaid=2042128 35.638
[200] https://www.autohome.com.cn/spec/6096/#pvareaid=2042128 22.73
[200] https://www.autohome.com.cn/spec/8830/#pvareaid=2042128 29.952
[200] https://www.autohome.com.cn/spec/10574/#pvareaid=2042128 31.178
。。。
[200] https://www.autohome.com.cn/spec/10569/#pvareaid=2042128 33.969
。。。
[200] https://www.autohome.com.cn/spec/10658/#pvareaid=2042128 8.827
[200] https://www.autohome.com.cn/spec/12153/#pvareaid=2042128 16.626
</code>

继续运行看看是否会出错

不过可以看到,很多页面执行很慢,慢的有30多秒才执行完毕的,的确够慢的

去看内存发现:

phantomjs占用了1G多的内存

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

还是出现了同样问题:

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

而此处phantomjs进程也出现部分连接时间特别长的,比如50多秒的:

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

<code>[200] https://www.autohome.com.cn/spec/33133/#pvareaid=2042128 33.75
[200] https://www.autohome.com.cn/spec/20057/#pvareaid=2042128 35.36
[200] https://www.autohome.com.cn/spec/12585/#pvareaid=2042128 40.707
[200] https://www.autohome.com.cn/spec/22145/#pvareaid=2042128 41.128
[200] https://www.autohome.com.cn/spec/13913/#pvareaid=2042128 47.393
[200] https://www.autohome.com.cn/spec/19793/#pvareaid=2042128 49.275
[200] https://www.autohome.com.cn/spec/20529/#pvareaid=2042128 49.392
[200] https://www.autohome.com.cn/spec/30554/#pvareaid=2042128 49.745
[200] https://www.autohome.com.cn/spec/1001716/#pvareaid=2042128 50.309
[200] https://www.autohome.com.cn/spec/30685/#pvareaid=2042128 49.213
[200] https://www.autohome.com.cn/spec/13911/#pvareaid=2042128 28.214
[200] https://www.autohome.com.cn/spec/24979/#pvareaid=2042128 30.104
[200] https://www.autohome.com.cn/spec/27899/#pvareaid=2042128 27.286
[200] https://www.autohome.com.cn/spec/14245/#pvareaid=2042128 31.383
[200] https://www.autohome.com.cn/spec/606/#pvareaid=2042128 31.034
[200] https://www.autohome.com.cn/spec/10108/#pvareaid=2042128 34.614
[200] https://www.autohome.com.cn/spec/15468/#pvareaid=2042128 34.815
[200] https://www.autohome.com.cn/spec/31844/#pvareaid=2042128 35.035
[200] https://www.autohome.com.cn/spec/1352/#pvareaid=2042128 35.193
[200] https://www.autohome.com.cn/spec/12631/#pvareaid=2042128 35.258
[200] https://www.autohome.com.cn/spec/23886/#pvareaid=2042128 33.805
[200] https://www.autohome.com.cn/spec/25316/#pvareaid=2042128 29.204
[200] https://www.autohome.com.cn/spec/26384/#pvareaid=2042128 29.158
[200] https://www.autohome.com.cn/spec/20500/#pvareaid=2042128 30.431
[200] https://www.autohome.com.cn/spec/31848/#pvareaid=2042128 31.86
[200] https://www.autohome.com.cn/spec/25690/#pvareaid=2042128 32.249
[200] https://www.autohome.com.cn/spec/24594/#pvareaid=2042128 33.069
[200] https://www.autohome.com.cn/spec/23713/#pvareaid=2042128 32.747
[200] https://www.autohome.com.cn/spec/23885/#pvareaid=2042128 32.561
[200] https://www.autohome.com.cn/spec/24519/#pvareaid=2042128 33.507
[200] https://www.autohome.com.cn/spec/29967/#pvareaid=2042128 28.702
[200] https://www.autohome.com.cn/spec/30073/#pvareaid=2042128 28.299

</code>

且pyspider已经被我中断停止运行了,结果竟然phantojs还有输出

所以很明显是:

部分页面是超过了self.crawl传递的connect_timeout=50的50秒超时设置,而超时了。

去搞清楚:

【已解决】pyspider中phantomjs中的proxy是什么意思

【未解决】pyspider中如何给phantomjs传递额外参数

貌似解决了参数传递,那继续去运行看看

<code>➜  AutocarData cat config.json
{
  "resultdb":   "mysql+resultdb://root:crifan_mysql@127.0.0.1:3306/AutohomeResultdb",
  "result_worker":{
      "result_cls": "AutohomeResultWorker.AutohomeResultWorker"
   },
  "phantomjs-proxy": "127.0.0.1:23450",
  "phantomjs" : {
    "port": 23450,
    "auto-restart": true,
    "load-images": false,
    "debug": true
  }
}%                                                                                                                                                            ➜  AutocarData pyspider -c config.json phantomjs
phantomjs fetcher running on port 23450

</code>

另外终端再去试试

<code>  AutocarData pyspider -c config.json all
[I 180517 22:05:29 result_worker:49] result_worker starting...
[I 180517 22:05:29 tornado_fetcher:638] fetcher starting...
[I 180517 22:05:29 processor:211] processor starting...
[I 180517 22:05:29 scheduler:647] scheduler starting...
[I 180517 22:05:29 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333
[I 180517 22:05:29 scheduler:126] project autohomeBrandData updated, status:RUNNING, paused:False, 0 tasks
[I 180517 22:05:29 scheduler:965] select autohomeBrandData:_on_get_info data:,_on_get_info
[I 180517 22:05:29 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0
[I 180517 22:05:29 tornado_fetcher:188] [200] autohomeBrandData:_on_get_info data:,_on_get_info 0s
[D 180517 22:05:29 project_module:145] project: autohomeBrandData updated.
[I 180517 22:05:29 processor:202] process autohomeBrandData:_on_get_info data:,_on_get_info -> [200] len:12 -> result:None fol:0 msg:0 err:None
[I 180517 22:05:30 app:76] webui running on 0.0.0.0:5000
[I 180517 22:05:30 scheduler:360] autohomeBrandData on_get_info {'min_tick': 0, 'retry_delay': {}, 'crawl_config': {}}
</code>

等看后续是否还会出错

现在希望是:把load-images为false传递给了phantomjs了,这样加载页面但是不加载图片,希望减少出错的概率,减少页面加载时间,这样或许就不会超时?

虽然phantomjs可以执行一些页面:

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

结果还是出错了:

<code>[E 180517 22:19:57 tornado_fetcher:212] [599] autohomeBrandData:b0ccf36b6fc5473ee7d3742d33e1fa4f https://www.autohome.com.cn/spec/6726/#pvareaid=2042128, HTTP 599: Failed to connect to 127.0.0.1 port 23450: Operation timed out 27.83s
[E 180517 22:19:57 tornado_fetcher:212] [599] autohomeBrandData:6739df8b7e3b21f5a667923d46efbc80 https://www.autohome.com.cn/spec/8152/#pvareaid=2042128, HTTP 599: Failed to connect to 127.0.0.1 port 23450: Operation timed out 27.82s
[E 180517 22:19:57 processor:202] process autohomeBrandData:b0ccf36b6fc5473ee7d3742d33e1fa4f https://www.autohome.com.cn/spec/6726/#pvareaid=2042128 -> [599] len:0 -> result:None fol:0 msg:0 err:ParserError('Document is empty',)
[E 180517 22:19:57 processor:202] process autohomeBrandData:6739df8b7e3b21f5a667923d46efbc80 https://www.autohome.com.cn/spec/8152/#pvareaid=2042128 -> [599] len:0 -> result:None fol:0 msg:0 err:ParserError('Document is empty',)
</code>

【未解决】PySpider中给crawl加fetch_type为js后内部调用phantomjs结果导致大量出错:HTTP 599 Conn...

始终无法完全的彻底的解决这个问题。

【总结】

此处,为了想要获取页面中的部分的数据(比如某个车辆的价格),但是该数据是js内部异步加载的

而PySpider中的解决方案是:

给self.crawl加了fetch_type=‘js’ -》

内部用到了phantomjs,去调用js加载页面内容

此方案(在调试期间)是基本可用的。

但是,在运行PySpider后,大批量去爬取时,导致每个页面的加载时间都很长:

之前单个页面不到1秒,用了phantomjs后,每个页面都要2,3秒,甚至几十秒

以至于大量爬取页面时出错:

<code>tornado_fetcher HTTP 599: Connection timed out after 20000 milliseconds 20.00s
</code>

或者:

<code>tornado_fetcher HTTP 599: Failed to connect to 127.0.0.1 port 23450: Operation timed out 27.83s
</code>

先后试了多种方案:

  • 加大timeout时间

  • 换个更快的网络

  • 给pyspider加all参数

  • 想要给phantomjs加额外参数:未果

  • 猜测可能是phantomjs有内存泄漏的bug:不确定,也无法解决

  • 加了User-Agent

  • 等等等等

都无法解决。

目前只能放弃使用phantomjs了。


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Java从入门到精通

Java从入门到精通

魔乐科技MLDN软件实训中心 / 人民邮电出版社 / 2010-4 / 59.00元

《Java从入门到精通》主要内容涵盖Java应用程序的创建及语言特点,Java开发工具Eclipse的使用,类和对象,Java程序异常处理,Java多线程,Java网络程序设计和Java数据库编程等,并通过五子棋和人事管理系统的设计两大项目讲解Java实用操作。《Java从入门到精通》在DVD光盘中赠送了Java SE类库查询手册,Java程序员职业规划,Java开发经验及技巧大汇总等丰富资源,包......一起来看看 《Java从入门到精通》 这本书的介绍吧!

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具