内容简介:在 AWS 中,可以开启详细账单的功能。开启详细账单后,AWS 每天会多次将详细的账单数据存入到指定的 S3 bucket 中[1]。账单数据是一个 CSV 文件,示例如下:第一行是每个字段的名字,后面的行是相应的数据。
问题
在 AWS 中,可以开启详细账单的功能。开启详细账单后,AWS 每天会多次将详细的账单数据存入到指定的 S3 bucket 中[1]。
账单数据是一个 CSV 文件,示例如下:
"InvoiceID","PayerAccountId","LinkedAccountId","RecordType","ProductName","RateId","SubscriptionId","PricingPlanId","UsageType","Operation","AvailabilityZone","ReservedInstance","ItemDescription","UsageStartDate","UsageEndDate","UsageQuantity","Rate","Cost" "Estimated","xxxxxxxxxxxx","xxxxxxxxxxxx","LineItem","Amazon Simple Queue Service","16850885","1846142824","1292565","CNN1-Requests-Tier1","GetQueueAttributes","","N","First 1,000,000 Amazon SQS Requests per month are free","2019-01-01 00:00:00","2019-01-01 01:00:00","60.0000000000","0.0000000000","0.0000000000" "Estimated","xxxxxxxxxxxx","xxxxxxxxxxxx","LineItem","Amazon Simple Queue Service","16850885","1846142824","1292565","CNN1-Requests-Tier1","GetQueueUrl","","N","First 1,000,000 Amazon SQS Requests per month are free","2019-01-01 00:00:00","2019-01-01 01:00:00","180.0000000000","0.0000000000","0.0000000000"
第一行是每个字段的名字,后面的行是相应的数据。
如果想用 Hive 进行分析,按照如下方式建表,得到的每一个字段内容都会包含双引号,不方便分析。
hive > CREATE EXTERNAL TABLE IF NOT EXISTS aws_bill ( InvoiceID string, PayerAccountId string, LinkedAccountId string, RecordType string, ProductName string, RateId string, SubscriptionId string, PricingPlanId string, UsageType string, Operation string, AvailabilityZone string, ReservedInstance string, ItemDescription string, UsageStartDate string, UsageEndDate string, UsageQuantity double, Rate double, Cost double ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://feichashao-hadoop/bill/'; hive> select * from aws_bill "Estimated","xxxxxxxxxxxx","xxxxxxxxxxxx","LineItem","Amazon Simple Queue Service","16850885","1846142824","1292565","CNN1-Requests-Tier1","GetQueueAttributes","","N","First 1,000,000 Amazon SQS Requests per month are free","2019-01-01 00:00:00","2019-01-01 01:00:00",NULL,NULL,NULL
而期望的结果是,字段中不含有双引号,如
Estimated xxxxxxxxxxxx xxxxxxxxxxxx LineItem Amazon Simple Queue Service 16850885 1846142824 1292565CNN1-Requests-Tier1 GetQueueAttributes N First 1,000,000 Amazon SQS Requests per month are free 2019-01-01 00:00:00 2019-01-01 01:00:00 60.0000000000 0.0000000000 0.0000000000
方法一:CSV Serde
Hive 没有原生的方法来去除字段中的双引号。不过我们可以在建表的时候,使用 CSV Serde[2]。
建表的方法如下:
Hive > CREATE EXTERNAL TABLE IF NOT EXISTS aws_bill_serde ( InvoiceID string, PayerAccountId string, [...省略...] Cost double) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION 's3://feichashao-hadoop/bill/';
这样建表,select * from aws_bill_serde 得到的结果没有双引号。
不过,用 SERDE 建表的话,无论我们指定数据类是 string 还是 double, describe table 时看到的类型全都是 string. 如果要做运算会变得不方便。
hive> describe aws_bill_serde; OK invoiceid string from deserializer payeraccountid string from deserializer linkedaccountid string from deserializer recordtype string from deserializer productname string from deserializer rateid string from deserializer subscriptionid string from deserializer pricingplanid string from deserializer usagetype string from deserializer operation string from deserializer availabilityzone string from deserializer reservedinstance string from deserializer itemdescription string from deserializer usagestartdate string from deserializer usageenddate string from deserializer usagequantity string from deserializer cost string from deserializer
方法二:预处理 CSV 文件
既然不想要双引号,那就预先处理 csv 文件,把双引号从根本上去除。
使用 sed 命令可以实现:
$ sed 's/"//g' old.csv > new.csv
参考文档
[1] https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/billing-getting-started.html
[2] https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:- Go中单引号和双引号和反引号(飘号)
- JS声明对象时属性名加引号与不加引号的问题及解决方法
- 如何设置 Visual Studio Code 格式化 React 时不要将单引号转为双引号?
- Linux 中引号的那些事
- MySQL中一个双引号错位引发的血案
- Javascript:在HTML中转义双引号
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
新内容创业:我这样打造爆款IP
南立新、曲琳 / 机械工业出版社 / 2016-5-10 / 39.00
这是个内容创业爆棚的时代,在采访几十家内容创业公司,与一线最优秀的创业者独家对话之后,作者写作了这本书,其中包括对这个行业的真诚感触,以及希望沉淀下来的体系化思考。 本书共分三个部分讲述了爆红大号的内容创业模式和方法。其中第一部分,讲述了新的生产方式,即内容形态发展的现状--正在被塑造;第二部分,讲述了新的盈利探索,即从贩卖产品到贩卖内容的转变,该部分以多个案例进行佐证,内容翔实;第三部分,......一起来看看 《新内容创业:我这样打造爆款IP》 这本书的介绍吧!