内容简介:折腾:期间,通过:
折腾:
【未解决】pyspider运行出错:FETCH_ERROR HTTP 599 Connection timed out after milliseconds
期间,通过:
pyspider HTTP 599 Connection timed out after
难道是需要加user-agent?
答复: 问题:HTTP 599: Operation timed out… – Google Groups
“和 phantomjs 连接超时,看看 phantomjs 是否存活,还有没有响应。”
难道此处是和phantomjs有关系?
毕竟有些页面好像的确是:
给self.crawl加了fetch_type=‘js’后,用到了phantomjs后,加载速度是慢了
-》但是调试期间是可以正常加载的
-〉难道是此处run时,大规模访问,则会导致phantomjs出错?连接超时?加载页面超时?
“和 https://github.com/binux/pyspider/issues/213 类似,可能与 phantomjs 的负载有关。
当 phantomjs 负载高时,可能会出现一定比例的失败
如果你会改代码的话,可以尝试将
pyspider/fetcher/tornado_fetcher.py
> request_conf[‘request_timeout’] = fetch[‘timeout’] + 1
+更长的时间(比如10秒),以等待 phantomjs 返回。看失败率是否有改善”
感觉最大可能就是phantomjs的问题了。
要么去找到根本解决办法
要么去试试给phantomjs的超时时间加长试试
no response from phantomjs while task has fetched result · Issue #213 · binux/pyspider
作者binux问到了,rate是多少
-》看来估计是
win8 pyspider HTTP 599: Resolving timed out after 20000 milliseconds · Issue #552 · binux/pyspider
HTTP 599: Operation timed out – Google Groups
问题:HTTP 599: Operation timed out… – Google Groups
pyspider超时 HTTP 599: Operation timed out after ….. – 足兆叉虫的回答 – SegmentFault 思否
crawl 连接网页超时,HTTP 599 – sandiegoxu的回答 – SegmentFault 思否
不管了,去设置大点的超时时间
先去看看默认超时时间
“connect_timeout
timeout for initial connection in seconds. default: 20
timeout
maximum time in seconds to fetch the page. default: 120”
<code>self.crawl(eachModelDetailDict["url"], callback=self.carModelSpecPage, fetch_type='js', retries=5, connect_timeout=50, timeout=300, save=curSerieDict) </code>
结果:
问题依旧:
顺带再去搜索:
pyspider phantomjs HTTP 599 Connection timed out after
(python)pyspider超时 HTTP 599: Operation timed out after _其他语言_编程问答
不过前面出错的log中,估计都是还没到最大的5次retry,对应着webui中没有出错:
所以可以继续运行,等待足够长时间,看看是否有fail的
不过看results中是有
1. FETCH_ERROR autohomeBrandData > https://www.autohome.com.cn/spec/10499/#pvareaid=2042128 3 minutes ago 28062.6+1.85ms +0
不过发现了,是很多retry
all里面是没有错的:
算了,跑了一夜,但是中间不知道为何出现Print打印的弹框,导致执行被暂停,搞得没继续执行。现在继续,也还是出现error的log。
算了,中断,下次重新运行。
phantomjs HTTP 599 Connection timed out after
然后注意到此前出错的也是:
HTTP 599: Connection timed out after 20000 milliseconds
不过增大timeout后,好像错误变成了
<code>[E 180516 09:21:15 tornado_fetcher:212] [599] autohomeBrandData:4e101be6bc40d1e5a5df146b33ce1f3b https://www.autohome.com.cn/spec/27705/#pvareaid=2042128, HTTP 599: Failed to connect to 127.0.0.1 port 25555: Operation timed out 27.62s [E 180516 09:21:15 processor:202] process autohomeBrandData:4e101be6bc40d1e5a5df146b33ce1f3b https://www.autohome.com.cn/spec/27705/#pvareaid=2042128 -> [599] len:0 -> result:None fol:0 msg:0 err:ParserError('Document is empty',) [I 180516 09:21:15 scheduler:959] task retry 1/5 autohomeBrandData:5a42194730c3edc4446015f2dc2a013b https://www.autohome.com.cn/spec/27703/#pvareaid=2042128 </code>
tornado_fetcher HTTP 599 Failed to connect to port 25555
HTTP 599: Failed to connect to 127.0.0.1 port 25555 · Issue #755 · binux/pyspider
“you use phantomjs but not open phantomjs port”
好像是:之前运行:
<code>pyspider -c config.json </code>
忘了加all参数。
或许是这个问题?导致没有加载phantomjs?
不过从:
tornado_fetcher:364: curl error 599: couldn’t connect to host · Issue #93 · binux/pyspider
看到好像说是要手动指定phantomjs的端口的?
pyspider phantomjs port
中没有说到phantomjs还要设置port的
估计是之前帖子2015年的,太老了,新的是不需要指定phantomjs的port的?
all模式默认所有子模块,包括phantomjs都运行了。
使用–phantomjs-proxy=”10.11.12.11:8080″启动指令以后的错误 · Issue #448 · binux/pyspider
cannot connect to webui. · Issue #309 · binux/pyspider
“ if you are running pyspider with all mode (by command pyspider or pyspider all) and phantomjs is started successfully”
pyspider – 究竟怎么给phantomjs设置代理? – SegmentFault 思否
重新去运行:
<code>pyspider -c config.json all </code>
试试
此处记录一下最开始的输出:
<code>➜ AutocarData pyspider -c config.json all phantomjs fetcher running on port 25555 。。。 self.mysqlDb=<AutohomeResultWorker.MysqlDb object at 0x105d58cc0> [I 180517 09:12:14 result_worker:49] result_worker starting... [I 180517 09:12:14 tornado_fetcher:638] fetcher starting... [I 180517 09:12:14 processor:211] processor starting... [I 180517 09:12:14 scheduler:647] scheduler starting... [I 180517 09:12:15 scheduler:126] project autohomeBrandData updated, status:RUNNING, paused:False, 0 tasks [I 180517 09:12:15 scheduler:965] select autohomeBrandData:_on_get_info data:,_on_get_info [I 180517 09:12:15 tornado_fetcher:188] [200] autohomeBrandData:_on_get_info data:,_on_get_info 0s [I 180517 09:12:15 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [D 180517 09:12:15 project_module:145] project: autohomeBrandData updated. [I 180517 09:12:15 processor:202] process autohomeBrandData:_on_get_info data:,_on_get_info -> [200] len:12 -> result:None fol:0 msg:0 err:None [I 180517 09:12:15 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333 [I 180517 09:12:15 scheduler:360] autohomeBrandData on_get_info {'min_tick': 0, 'retry_delay': {}, 'crawl_config': {}} [I 180517 09:12:15 app:76] webui running on 0.0.0.0:5000 </code>
其中是有:
phantomjs fetcher running on port 25555
的
然后没有all参数的输出:
<code>➜ AutocarData pyspider -c config.json phantomjs fetcher running on port 25555 [I 180517 09:13:34 result_worker:49] result_worker starting... [I 180517 09:13:34 processor:211] processor starting... [I 180517 09:13:34 tornado_fetcher:638] fetcher starting... [I 180517 09:13:34 scheduler:647] scheduler starting... [I 180517 09:13:35 scheduler:126] project autohomeBrandData updated, status:STOP, paused:False, 0 tasks [I 180517 09:13:35 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333 [I 180517 09:13:35 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [I 180517 09:13:35 app:76] webui running on 0.0.0.0:5000 </code>
即:
pyspider没有指定模式时,默认也是all
-》加不加all,都没关系,phantomjs都会运行的
再去换个更快的公司的网络,去运行试试,看看是否出错
结果错误依旧:
<code>[E 180517 09:32:53 tornado_fetcher:212] [599] autohomeBrandData:29ae096711906a0100db8641da85ee6f https://www.autohome.com.cn/spec/29931/#pvareaid=2042128, HTTP 599: Connection timed out after 61314 milliseconds 62.30s [E 180517 09:32:53 tornado_fetcher:212] [599] autohomeBrandData:d89d93ca8126d2264393bf31d0b7f60a https://www.autohome.com.cn/spec/29930/#pvareaid=2042128, HTTP 599: Connection timed out after 61276 milliseconds 62.24s [E 180517 09:32:53 processor:202] process autohomeBrandData:a65ac03901c5832a0c841f3c58d99e69 https://www.autohome.com.cn/spec/18006/#pvareaid=2042128 -> [599] len:0 -> result:None fol:0 msg:0 err:ParserError('Document is empty',) </code>
可以看出,此处时超时60多秒了,还是连接出错
所以,看起来不是简单的继续增加超时时间能解决的了
然后,也是部分时候,phantomjs是正常工作的,所以才能正常保存数据:
msrp值不为0:
说明此处代码的确执行到了:
<code> self.crawl(eachModelDetailDict["url"], callback=self.carModelSpecPage, fetch_type='js', retries=5, connect_timeout=50, timeout=300, save=curSerieDict) @catch_status_code_error def carModelSpecPage(self, response): </code>
所以,真的是搞不懂为何出错了,以及如何彻底解决此问题。
mac中去看看进程:
<code>➜ ~ lsof -i:25555 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME phantomjs 96892 crifan 8u IPv4 0x4b4cb78e55774f75 0t0 TCP localhost:25555->localhost:51131 (ESTABLISHED) phantomjs 96892 crifan 12u IPv4 0x4b4cb78e5679cf75 0t0 TCP *:25555 (LISTEN) phantomjs 96892 crifan 13u IPv4 0x4b4cb78e55aec615 0t0 TCP localhost:25555->localhost:49769 (ESTABLISHED) 。。。 localhost:25555->localhost:49517 (ESTABLISHED) phantomjs 96892 crifan 186u IPv4 0x4b4cb78e59206235 0t0 TCP localhost:25555->localhost:51333 (ESTABLISHED) python3.6 96895 crifan 44u IPv4 0x4b4cb78e58760235 0t0 TCP localhost:64752->localhost:25555 (ESTABLISHED) 。。。 python3.6 96895 crifan 139u IPv4 0x4b4cb78e569b98d5 0t0 TCP localhost:51235->localhost:25555 (ESTABLISHED) ➜ ~ </code>
也是有phantomjs在运行的
有30个 python 3.6进程
有31个phantomjs进程
应该都使用的了才对
停止了pyspider后,进程没了:
<code>➜ ~ lsof -i:25555 ➜ ~ </code>
也是期望的现象。
去加上user-agent试试
pyspider phantomjs port
demo.pyspider.org 部署经验 | Binuxの杂货铺
我此处也是最新的2.1.1的最新版本:
<code>➜ ~ phantomjs --version 2.1.1 ➜ ~ which phantomjs /usr/local/bin/phantomjs ➜ ~ ll /usr/local/bin/phantomjs* lrwxr-xr-x 1 crifan admin 107B 11 22 23:14 /usr/local/bin/phantomjs -> /Users/crifan/dev/dev_root/projects/亚马逊/自动下单/ms_store_auto_order/libs/webdriver/mac/phantomjs ➜ ~ ll /Users/crifan/dev/dev_root/projects/亚马逊/自动下单/ms_store_auto_order/libs/webdriver/mac/ total 111232 -rwxr-xr-x@ 1 crifan staff 11M 10 3 2017 chromedriver -rwxr-xr-x@ 1 crifan staff 43M 1 24 2016 phantomjs </code>
为何要给phantomjs设置代理?不清楚
此处官网最新版本还是2.1.1
pyspider phantomjs 内存泄漏
“ “phantomjs-proxy”:”192.168.218.10:24444″,
1. phantomjs-proxy 指定了使用的渲染 工具 地址,有必要还需要限定内存大小,以避免内存泄露”
demo.pyspider.org 部署经验(转) – microman – 博客园
都是在说phantomjs的部署
原来网上的那些都是转载自:
极客起源 – geekori.com – 问题详情 – pyspider phantomjs内存泄漏和假死怎么解决呢?
“假死:定时重启。。。”
phantomjs doesn’t release memory · Issue #207 · binux/pyspider
“(google “phantomjs memory”).
I have a phantomjs instance with crawl rate of 5 pages per minute, it cost ~500MB. And my solution is restart it every hour.”
phantomjs memory leak · Issue #439 · binux/pyspider
“But sometimes phantomjs will some how be blocked instead of killed when memory usage near but not reach the limit.”
phantomjs有时候可能会假死?
phantomjs 内存泄漏
phantomjs set memory limit
Phantomjs Memory Leak Issue · Issue #14308 · ariya/phantomjs
【未解决】pyspider中如何给phantomjs传递额外参数
Command Line Interface | PhantomJS
竟然找不到memory方面的配置,更找不到memory limit了
pyspider phantomjs memory limit
Deployment of demo.pyspider.org – pyspider
“phantomjs
phantomjs have memory leak issue, memory limit applied, and it’s recommended to restart it every hour.”
Phantomjs, performance · Issue #458 · binux/pyspider
connect to scheduler rpc error: error(111, ‘Connection refused’) with docker – Google Groups
所以感觉是:
抽空先单独运行一个phantomjs
然后再去运行pyspider 指定proxy是这个phantomjs?
<code>➜ AutocarData pyspider phantomjs --help Usage: pyspider phantomjs [OPTIONS] [ARGS]... Run phantomjs fetcher if phantomjs is installed. Options: --phantomjs-path TEXT phantomjs path --port INTEGER phantomjs port --auto-restart TEXT auto restart phantomjs if crashed --help Show this message and exit. ➜ AutocarData pyspider --help Usage: pyspider [OPTIONS] COMMAND [ARGS]... A powerful spider system in python. Options: -c, --config FILENAME a json file with default values for subcommands. {"webui": {"port":5001}} --logging-config TEXT logging config file for built-in python logging module [default: /Users/crifan/.loc al/share/virtualenvs/AutocarData-xI- iqIq4/lib/python3.6/site- packages/pyspider/logging.conf] --debug debug mode --queue-maxsize INTEGER maxsize of queue --taskdb TEXT database url for taskdb, default: sqlite --projectdb TEXT database url for projectdb, default: sqlite --resultdb TEXT database url for resultdb, default: sqlite --message-queue TEXT connection url to message queue, default: builtin multiprocessing.Queue --amqp-url TEXT [deprecated] amqp url for rabbitmq. please use --message-queue instead. --beanstalk TEXT [deprecated] beanstalk config for beanstalk queue. please use --message-queue instead. --phantomjs-proxy TEXT phantomjs proxy ip:port --data-path TEXT data dir path --add-sys-path / --not-add-sys-path add current working directory to python lib search path --version Show the version and exit. --help Show this message and exit. Commands: all Run all the components in subprocess or... bench Run Benchmark test. fetcher Run Fetcher. one One mode not only means all-in-one, it runs... phantomjs Run phantomjs fetcher if phantomjs is... processor Run Processor. result_worker Run result worker. scheduler Run Scheduler, only one scheduler is allowed. send_message Send Message to project from command line webui Run WebUI </code>
单独开个窗口去运行phantomjs:
<code>➜ AutocarData pyspider phantomjs --help Usage: pyspider phantomjs [OPTIONS] [ARGS]... Run phantomjs fetcher if phantomjs is installed. Options: --phantomjs-path TEXT phantomjs path --port INTEGER phantomjs port --auto-restart TEXT auto restart phantomjs if crashed --help Show this message and exit. ➜ AutocarData pyspider phantomjs --port 23450 --auto-restart true phantomjs fetcher running on port 23450 </code>
其中:
–auto-restart TEXT auto restart phantomjs if crashed
用了:–auto-restart true
也不知道语法是否对
另外:
<code>➜ AutocarData pyspider all --help Usage: pyspider all [OPTIONS] Run all the components in subprocess or thread Options: --fetcher-num INTEGER instance num of fetcher --processor-num INTEGER instance num of processor --result-worker-num INTEGER instance num of result worker --run-in [subprocess|thread] run each components in thread or subprocess. always using thread for windows. --help Show this message and exit. </code>
给:
config.json
加上:
<code>{ "resultdb": "mysql+resultdb://root:crifan_mysql@127.0.0.1:3306/AutohomeResultdb", "result_worker":{ "result_cls": "AutohomeResultWorker.AutohomeResultWorker" }, "phantomjs-proxy": "127.0.0.1:23450" } </code>
然后再去运行:
<code>➜ AutocarData pyspider -c config.json [I 180517 11:11:15 result_worker:49] result_worker starting... [I 180517 11:11:15 processor:211] processor starting... [I 180517 11:11:15 tornado_fetcher:638] fetcher starting... [I 180517 11:11:15 scheduler:647] scheduler starting... [I 180517 11:11:15 scheduler:126] project autohomeBrandData updated, status:RUNNING, paused:False, 0 tasks [I 180517 11:11:15 scheduler:965] select autohomeBrandData:_on_get_info data:,_on_get_info [I 180517 11:11:15 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [I 180517 11:11:15 tornado_fetcher:188] [200] autohomeBrandData:_on_get_info data:,_on_get_info 0s [D 180517 11:11:15 project_module:145] project: autohomeBrandData updated. [I 180517 11:11:15 processor:202] process autohomeBrandData:_on_get_info data:,_on_get_info -> [200] len:12 -> result:None fol:0 msg:0 err:None [I 180517 11:11:16 scheduler:360] autohomeBrandData on_get_info {'min_tick': 0, 'retry_delay': {}, 'crawl_config': {}} [I 180517 11:11:16 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333 [I 180517 11:11:16 app:76] webui running on 0.0.0.0:5000 </code>
可见,此处没有单独创建phantomjs了,
然后继续运行看看效果
然后在此期间phantomjs窗口输出了结果,都是200 OK的:
<code>➜ AutocarData pyspider phantomjs --port 23450 --auto-restart true phantomjs fetcher running on port 23450 [200] https://www.autohome.com.cn/spec/32172/#pvareaid=2042128 3.661 [200] https://www.autohome.com.cn/spec/32678/#pvareaid=2042128 3.811 。。。 [200] https://www.autohome.com.cn/spec/7955/#pvareaid=2042128 30.801 。。。 [200] https://www.autohome.com.cn/spec/6097/#pvareaid=2042128 35.638 [200] https://www.autohome.com.cn/spec/6096/#pvareaid=2042128 22.73 [200] https://www.autohome.com.cn/spec/8830/#pvareaid=2042128 29.952 [200] https://www.autohome.com.cn/spec/10574/#pvareaid=2042128 31.178 。。。 [200] https://www.autohome.com.cn/spec/10569/#pvareaid=2042128 33.969 。。。 [200] https://www.autohome.com.cn/spec/10658/#pvareaid=2042128 8.827 [200] https://www.autohome.com.cn/spec/12153/#pvareaid=2042128 16.626 </code>
继续运行看看是否会出错
不过可以看到,很多页面执行很慢,慢的有30多秒才执行完毕的,的确够慢的
去看内存发现:
phantomjs占用了1G多的内存
还是出现了同样问题:
而此处phantomjs进程也出现部分连接时间特别长的,比如50多秒的:
<code>[200] https://www.autohome.com.cn/spec/33133/#pvareaid=2042128 33.75 [200] https://www.autohome.com.cn/spec/20057/#pvareaid=2042128 35.36 [200] https://www.autohome.com.cn/spec/12585/#pvareaid=2042128 40.707 [200] https://www.autohome.com.cn/spec/22145/#pvareaid=2042128 41.128 [200] https://www.autohome.com.cn/spec/13913/#pvareaid=2042128 47.393 [200] https://www.autohome.com.cn/spec/19793/#pvareaid=2042128 49.275 [200] https://www.autohome.com.cn/spec/20529/#pvareaid=2042128 49.392 [200] https://www.autohome.com.cn/spec/30554/#pvareaid=2042128 49.745 [200] https://www.autohome.com.cn/spec/1001716/#pvareaid=2042128 50.309 [200] https://www.autohome.com.cn/spec/30685/#pvareaid=2042128 49.213 [200] https://www.autohome.com.cn/spec/13911/#pvareaid=2042128 28.214 [200] https://www.autohome.com.cn/spec/24979/#pvareaid=2042128 30.104 [200] https://www.autohome.com.cn/spec/27899/#pvareaid=2042128 27.286 [200] https://www.autohome.com.cn/spec/14245/#pvareaid=2042128 31.383 [200] https://www.autohome.com.cn/spec/606/#pvareaid=2042128 31.034 [200] https://www.autohome.com.cn/spec/10108/#pvareaid=2042128 34.614 [200] https://www.autohome.com.cn/spec/15468/#pvareaid=2042128 34.815 [200] https://www.autohome.com.cn/spec/31844/#pvareaid=2042128 35.035 [200] https://www.autohome.com.cn/spec/1352/#pvareaid=2042128 35.193 [200] https://www.autohome.com.cn/spec/12631/#pvareaid=2042128 35.258 [200] https://www.autohome.com.cn/spec/23886/#pvareaid=2042128 33.805 [200] https://www.autohome.com.cn/spec/25316/#pvareaid=2042128 29.204 [200] https://www.autohome.com.cn/spec/26384/#pvareaid=2042128 29.158 [200] https://www.autohome.com.cn/spec/20500/#pvareaid=2042128 30.431 [200] https://www.autohome.com.cn/spec/31848/#pvareaid=2042128 31.86 [200] https://www.autohome.com.cn/spec/25690/#pvareaid=2042128 32.249 [200] https://www.autohome.com.cn/spec/24594/#pvareaid=2042128 33.069 [200] https://www.autohome.com.cn/spec/23713/#pvareaid=2042128 32.747 [200] https://www.autohome.com.cn/spec/23885/#pvareaid=2042128 32.561 [200] https://www.autohome.com.cn/spec/24519/#pvareaid=2042128 33.507 [200] https://www.autohome.com.cn/spec/29967/#pvareaid=2042128 28.702 [200] https://www.autohome.com.cn/spec/30073/#pvareaid=2042128 28.299 </code>
且pyspider已经被我中断停止运行了,结果竟然phantojs还有输出
所以很明显是:
部分页面是超过了self.crawl传递的connect_timeout=50的50秒超时设置,而超时了。
去搞清楚:
【已解决】pyspider中phantomjs中的proxy是什么意思
【未解决】pyspider中如何给phantomjs传递额外参数
貌似解决了参数传递,那继续去运行看看
<code>➜ AutocarData cat config.json { "resultdb": "mysql+resultdb://root:crifan_mysql@127.0.0.1:3306/AutohomeResultdb", "result_worker":{ "result_cls": "AutohomeResultWorker.AutohomeResultWorker" }, "phantomjs-proxy": "127.0.0.1:23450", "phantomjs" : { "port": 23450, "auto-restart": true, "load-images": false, "debug": true } }% ➜ AutocarData pyspider -c config.json phantomjs phantomjs fetcher running on port 23450 </code>
另外终端再去试试
<code> AutocarData pyspider -c config.json all [I 180517 22:05:29 result_worker:49] result_worker starting... [I 180517 22:05:29 tornado_fetcher:638] fetcher starting... [I 180517 22:05:29 processor:211] processor starting... [I 180517 22:05:29 scheduler:647] scheduler starting... [I 180517 22:05:29 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333 [I 180517 22:05:29 scheduler:126] project autohomeBrandData updated, status:RUNNING, paused:False, 0 tasks [I 180517 22:05:29 scheduler:965] select autohomeBrandData:_on_get_info data:,_on_get_info [I 180517 22:05:29 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [I 180517 22:05:29 tornado_fetcher:188] [200] autohomeBrandData:_on_get_info data:,_on_get_info 0s [D 180517 22:05:29 project_module:145] project: autohomeBrandData updated. [I 180517 22:05:29 processor:202] process autohomeBrandData:_on_get_info data:,_on_get_info -> [200] len:12 -> result:None fol:0 msg:0 err:None [I 180517 22:05:30 app:76] webui running on 0.0.0.0:5000 [I 180517 22:05:30 scheduler:360] autohomeBrandData on_get_info {'min_tick': 0, 'retry_delay': {}, 'crawl_config': {}} </code>
等看后续是否还会出错
现在希望是:把load-images为false传递给了phantomjs了,这样加载页面但是不加载图片,希望减少出错的概率,减少页面加载时间,这样或许就不会超时?
虽然phantomjs可以执行一些页面:
结果还是出错了:
<code>[E 180517 22:19:57 tornado_fetcher:212] [599] autohomeBrandData:b0ccf36b6fc5473ee7d3742d33e1fa4f https://www.autohome.com.cn/spec/6726/#pvareaid=2042128, HTTP 599: Failed to connect to 127.0.0.1 port 23450: Operation timed out 27.83s [E 180517 22:19:57 tornado_fetcher:212] [599] autohomeBrandData:6739df8b7e3b21f5a667923d46efbc80 https://www.autohome.com.cn/spec/8152/#pvareaid=2042128, HTTP 599: Failed to connect to 127.0.0.1 port 23450: Operation timed out 27.82s [E 180517 22:19:57 processor:202] process autohomeBrandData:b0ccf36b6fc5473ee7d3742d33e1fa4f https://www.autohome.com.cn/spec/6726/#pvareaid=2042128 -> [599] len:0 -> result:None fol:0 msg:0 err:ParserError('Document is empty',) [E 180517 22:19:57 processor:202] process autohomeBrandData:6739df8b7e3b21f5a667923d46efbc80 https://www.autohome.com.cn/spec/8152/#pvareaid=2042128 -> [599] len:0 -> result:None fol:0 msg:0 err:ParserError('Document is empty',) </code>
始终无法完全的彻底的解决这个问题。
【总结】
此处,为了想要获取页面中的部分的数据(比如某个车辆的价格),但是该数据是js内部异步加载的
而PySpider中的解决方案是:
给self.crawl加了fetch_type=‘js’ -》
内部用到了phantomjs,去调用js加载页面内容
此方案(在调试期间)是基本可用的。
但是,在运行PySpider后,大批量去爬取时,导致每个页面的加载时间都很长:
之前单个页面不到1秒,用了phantomjs后,每个页面都要2,3秒,甚至几十秒
以至于大量爬取页面时出错:
<code>tornado_fetcher HTTP 599: Connection timed out after 20000 milliseconds 20.00s </code>
或者:
<code>tornado_fetcher HTTP 599: Failed to connect to 127.0.0.1 port 23450: Operation timed out 27.83s </code>
先后试了多种方案:
-
加大timeout时间
-
换个更快的网络
-
给pyspider加all参数
-
想要给phantomjs加额外参数:未果
-
猜测可能是phantomjs有内存泄漏的bug:不确定,也无法解决
-
加了User-Agent
-
等等等等
都无法解决。
目前只能放弃使用phantomjs了。
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- javascript – 在手动调用事件处理程序时给出某些条件的IE中出错
- 如何避免空指针出错?
- 如何避免特效渲染出错?
- OmniROM:“Flex checkpolicy”出错
- Python安装软件包出错
- 让gulp watch出错时不退出
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Java从入门到精通
魔乐科技MLDN软件实训中心 / 人民邮电出版社 / 2010-4 / 59.00元
《Java从入门到精通》主要内容涵盖Java应用程序的创建及语言特点,Java开发工具Eclipse的使用,类和对象,Java程序异常处理,Java多线程,Java网络程序设计和Java数据库编程等,并通过五子棋和人事管理系统的设计两大项目讲解Java实用操作。《Java从入门到精通》在DVD光盘中赠送了Java SE类库查询手册,Java程序员职业规划,Java开发经验及技巧大汇总等丰富资源,包......一起来看看 《Java从入门到精通》 这本书的介绍吧!