如何在R(https链接)中webscrape安全页面(使用XML包中的readHTMLTable)？

栏目: 服务器 · 发布时间: 6年前

内容简介：http://stackoverflow.com/questions/10692066/how-to-webscrape-secured-pages-in-r-https-links-using-readhtmltable-from-xml

有关如何使用XML包中的readHTMLTable的很好的答案,并且我使用常规的http页面进行了这一操作,但是我无法用https页面解决我的问题.

我正在这个网站上读表(url string)：

library(RTidyHTML)
library(XML)
url <- "https://ned.nih.gov/search/ViewDetails.aspx?NIHID=0010121048"
h = htmlParse(url)
tables <- readHTMLTable(url)

但是我得到这个错误：文件 https://ned.nih.gov/search/Vi…does 不存在.

我试图通过这个(前两行)的https问题(从使用谷歌找到解决方案(如这里： http://tonybreyal.wordpress.com/2012/01/13/r-a-quick-scrape-of-top-grossing-films-from-boxofficemojo-com/ )).

这个技巧有助于查看更多的页面,但是任何尝试提取表格都无法正常工作.任何建议赞赏.我需要表格,如组织,组织标题,经理.

#attempt to get past the https problem 
 raw <- getURL(url, followlocation = TRUE, cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
 head(raw)
[1] "\r\n<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; 
...
 h = htmlParse(raw)
Error in htmlParse(raw) : File ...
tables <- readHTMLTable(raw)
Error in htmlParse(doc) : File ...

新包httr提供了一个围绕RCurl的包装器,以便更容易地刮擦各种页面.

不过,这个页面给了我很多的麻烦.以下工作,但无疑有更简单的方法.

library("httr")
library("XML")

# Define certicificate file
cafile <- system.file("CurlSSL", "cacert.pem", package = "RCurl")

# Read page
page <- GET(
  "https://ned.nih.gov/", 
  path="search/ViewDetails.aspx", 
  query="NIHID=0010121048",
  config(cainfo = cafile)
)

# Use regex to extract the desired table
x <- text_content(page)
tab <- sub('.*(<table class="grid".*?>.*</table>).*', '\\1', x)

# Parse the table
readHTMLTable(tab)

结果：

$ctl00_ContentPlaceHolder_dvPerson
                V1                                      V2
1      Legal Name:                    Dr Francis S Collins
2  Preferred Name:                      Dr Francis Collins
3          E-mail:                 francis.collins@nih.gov
4        Location: BG 1 RM 1261 CENTER DRBETHESDA MD 20814
5       Mail Stop:                                       Â
6           Phone:                            301-496-2433
7             Fax:                                       Â
8              IC:             OD (Office of the Director)
9    Organization:            Office of the Director (HNA)
10 Classification:                                Employee
11            TTY:                                       Â

在这里获取httr： http://cran.r-project.org/web/packages/httr/index.html

编辑：有关RCurl包的常见问题的有用页面： http://www.omegahat.org/RCurl/FAQ.html

http://stackoverflow.com/questions/10692066/how-to-webscrape-secured-pages-in-r-https-links-using-readhtmltable-from-xml

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

Python金融衍生品大数据分析：建模、模拟、校准与对冲

【德】Yves Hilpisch（伊夫·希尔皮斯科） / 蔡立耑 / 电子工业出版社 / 2017-8 / 99.00

Python 在衍生工具分析领域占据重要地位，使机构能够快速、有效地提供定价、交易及风险管理的结果。《Python金融衍生品大数据分析：建模、模拟、校准与对冲》精心介绍了有效定价期权的四个领域：基于巿场定价的过程、完善的巿场模型、数值方法及技术。书中的内容分为三个部分。第一部分着眼于影响股指期权价值的风险，以及股票和利率的相关实证发现。第二部分包括套利定价理论、离散及连续时间的风险中性定价，并介绍......一起来看看《Python金融衍生品大数据分析：建模、模拟、校准与对冲》这本书的介绍吧!

码农工具

如何在R(https链接)中webscrape安全页面(使用XML包中的readHTMLTable)？

Python金融衍生品大数据分析：建模、模拟、校准与对冲

图片转BASE64编码

XML、JSON 在线转换

HEX CMYK 转换工具