内容简介:Hive并不是简简单单写SQL,因为我们要进行层层调优,如果连Hive的内部运行机制都搞不清,那么hive对我们来说仅仅是一个黑箱,高效率的调优无从谈起,所以我们很有必要了解下Hive是如何将SQL转化为MapReduce任务的呢?我们以下面这个SQL为例整个编译过程分为六个阶段:
Hive并不是简简单单写SQL,因为我们要进行层层调优,如果连Hive的内部运行机制都搞不清,那么hive对我们来说仅仅是一个黑箱,高效率的调优无从谈起,所以我们很有必要了解下Hive是如何将 SQL 转化为MapReduce任务的呢?
Hive 底层执行流程
我们以下面这个SQL为例
FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.key, sum(substr(src.value,4)) GROUP BY src.key;
整个编译过程分为六个阶段:
1.Antlr定义SQL的语法规则,完成SQL词法,语法解析,将SQL
HiveLexerX , HiveParser 分别是Antlr对SQL编译后自动生成的词法解析和语法解析类,在这两个类中进行复杂的解析。
例子中的AST tree为
ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF src)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) (TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4)))) (TOK_GROUPBY (TOK_COLREF src key))))
2.遍历AST Tree,抽象出查询的基本组成单元QueryBlock
AST Tree 仍然非常复杂,不够结构化,不方便直接翻译为 MapReduce 程序, AST
Tree 转化为 QueryBlock (QB)就是将 SQL 进一部抽象和结构化。
AST Tree 生成 QueryBlock 的过程是一个递归的过程,先序遍历 AST Tree ,遇到不
同的Token 节点(理解为特殊标记),保存到相应的属性中,主要包含以下几个过程
TOK_QUERY => 创建 QB 对象,循环递归子节点 TOK_FROM => 将表名语法部分保存到 QB 对象的 aliasToTabs 等属性中 TOK_INSERT => 循环递归子节点 TOK_DESTINATION => 将输出目标的语法部分保存在 QBParseInfo 对象的nameToDest 属性中 TOK_SELECT => 分别将查询表达式的语法部分保存在 destToSelExpr 、destToAggregationExprs 、 destToDistinctFuncExprs 三个属性中 TOK_WHERE => 将 Where 部分的语法保存在 QBParseInfo 对象的destToWhereExpr 属性中
3.遍历QueryBlock,翻译为执行操作树OperatorTree
Hive 最终生成的 MapReduce 任务, Map 阶段和 Reduce 阶段均由 Operator Tree
组成。逻辑操作符,就是在 Map 阶段或者 Reduce 阶段完成单一特定的操作。
基本的操作符包括
TableScanOperator、SelectOperator、FilterOperator、JoinOperator、GroupByOperator、ReduceSinkOperator
QueryBlock 生成 Operator Tree 就是遍历上一个过程中生成的 QB 和 QBParseInfo
对象的保存
语法的属性,包含如下几个步骤:
QB#aliasToSubq => 有子查询,递归调用 QB#aliasToTabs => TableScanOperator QBParseInfo#joinExpr => QBJoinTree => ReduceSinkOperator + JoinOperator QBParseInfo#destToWhereExpr => FilterOperator QBParseInfo#destToGroupby => ReduceSinkOperator +GroupByOperator QBParseInfo#destToOrderby => ReduceSinkOperator + ExtractOperator
由于 Join/GroupBy/OrderBy 均需要在 Reduce 阶段完成,所以在生成相应操作的Operator 之前都会先生成一个 ReduceSinkOperator ,将字段组合并序列化为 Reduce Key/value,Partition Key
SQL例子翻译成OperatorTree
STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: src Reduce Output Operator key expressions: expr: key type: string sort order: + Map-reduce partition columns: expr: rand() type: double tag: -1 value expressions: expr: substr(value, 4) type: string Reduce Operator Tree: Group By Operator aggregations: expr: sum(UDFToDouble(VALUE.0)) keys: expr: KEY.0 type: string mode: partial1 File Output Operator compressed: false table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.mapred.SequenceFileOutputFormat name: binary_table Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: /tmp/hive-zshao/67494501/106593589.10001 Reduce Output Operator key expressions: expr: 0 type: string sort order: + Map-reduce partition columns: expr: 0 type: string tag: -1 value expressions: expr: 1 type: double Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE.0) keys: expr: KEY.0 type: string mode: final Select Operator expressions: expr: 0 type: string expr: 1 type: double Select Operator expressions: expr: UDFToInteger(0) type: int expr: 1 type: double File Output Operator compressed: false table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe name: dest_g1 Stage: Stage-0 Move Operator tables: replace: true table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe name: dest_g1
4.Logical Optimizer进行OperatorTree变换,合并不必要的
使用 ReduceSinkOperator ,减少shuffle数据量。大部分逻辑层优化器通过变换 OperatorTree ,合并操作符,达到减少 MapReduce Job ,减少 shuffle 数据量的目的。
5.遍历OperatorTree,翻译为Task tree
OperatorTree 转化为 Task tree的过程分为下面几个阶段
- 对输出表生成 MoveTask
- 从 OperatorTree 的其中一个根节点向下深度优先遍历
- ReduceSinkOperator 标示 Map/Reduce 的界限,多个 Job 间的界限
- 遍历其他根节点,遇过碰到 JoinOperator 合并 MapReduceTask
- 生成 StatTask 更新元数据
- 剪断 Map 与 Reduce 间的 Operator 的关系
6. PhysicalOptimizer 对Task tree优化,生成最终的执行计划
7、执行
以上就是HiveSQL的底层执行流程
打印SQL运行相关信息
我们在开发中,可以使用下面这个语句来打印SQL语句的相关运行信息
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
注:我的版本是 hive-1.1.0-cdh5.7.0 ,所以只可用三个可选属性,如果您版本比较高的话,可以去 官网 查阅对应属性
下面我对三种可选属性进行简单介绍
EXTENDED
EXTENDED:打印SQL解析成AST&Operator Tree最全面的信息
hive (g6_hadoop)> explain EXTENDED insert OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain; OK Explain ABSTRACT SYNTAX TREE: TOK_QUERY TOK_FROM TOK_TABREF TOK_TABNAME g6_access_orc TOK_INSERT TOK_DESTINATION TOK_TAB TOK_TABNAME g6_access_orc_explain TOK_SELECT TOK_SELEXPR TOK_TABLE_OR_COL domain TOK_SELEXPR TOK_FUNCTION count 1 num TOK_WHERE > TOK_TABLE_OR_COL traffic '99900' TOK_GROUPBY TOK_TABLE_OR_COL domain STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 Stage-2 depends on stages: Stage-0 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: g6_access_orc Statistics: Num rows: 260326 Data size: 188215698 Basic stats: COMPLETE Column stats: NONE GatherStats: false Filter Operator isSamplingPred: false predicate: (traffic > 99900) (type: boolean) Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: domain (type: string) outputColumnNames: domain Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count(1) keys: domain (type: string) mode: hash outputColumnNames: _col0, _col1 Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE tag: -1 value expressions: _col1 (type: bigint) auto parallelism: false Path -> Alias: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc [g6_access_orc] Path -> Partition: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc Partition base file name: g6_access_orc input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns cdn,region,level,time,ip,domain,url,traffic columns.comments columns.types string:string:string:string:string:string:string:bigint field.delim file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc name g6_hadoop.g6_access_orc numFiles 1 numRows 260326 rawDataSize 188215698 serialization.ddl struct g6_access_orc { string cdn, string region, string level, string time, string ip, string domain, string url, i64 traffic} serialization.format serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde totalSize 8567798 transient_lastDdlTime 1557676635 serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns cdn,region,level,time,ip,domain,url,traffic columns.comments columns.types string:string:string:string:string:string:string:bigint field.delim file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc name g6_hadoop.g6_access_orc numFiles 1 numRows 260326 rawDataSize 188215698 serialization.ddl struct g6_access_orc { string cdn, string region, string level, string time, string ip, string domain, string url, i64 traffic} serialization.format serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde totalSize 8567798 transient_lastDdlTime 1557676635 serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde name: g6_hadoop.g6_access_orc name: g6_hadoop.g6_access_orc Truncated Path -> Alias: /g6_hadoop.db/g6_access_orc [g6_access_orc] Needs Tagging: false Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0, _col1 Statistics: Num rows: 43387 Data size: 31368801 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false GlobalTableId: 1 directory: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000 NumFilesPerFileSink: 1 Statistics: Num rows: 43387 Data size: 31368801 Basic stats: COMPLETE Column stats: NONE Stats Publishing Key Prefix: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000/ table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns domain,num columns.comments columns.types string:bigint field.delim | file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain name g6_hadoop.g6_access_orc_explain numFiles 1 numRows 7 rawDataSize 149 serialization.ddl struct g6_access_orc_explain { string domain, i64 num} serialization.format | serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 156 transient_lastDdlTime 1558661108 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: g6_hadoop.g6_access_orc_explain TotalFiles: 1 GatherStats: true MultiFileSpray: false Stage: Stage-0 Move Operator tables: replace: true source: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat properties: COLUMN_STATS_ACCURATE true bucket_count -1 columns domain,num columns.comments columns.types string:bigint field.delim | file.inputformat org.apache.hadoop.mapred.TextInputFormat file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain name g6_hadoop.g6_access_orc_explain numFiles 1 numRows 7 rawDataSize 149 serialization.ddl struct g6_access_orc_explain { string domain, i64 num} serialization.format | serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe totalSize 156 transient_lastDdlTime 1558661108 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: g6_hadoop.g6_access_orc_explain Stage: Stage-2 Stats-Aggr Operator Stats Aggregation Key Prefix: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000/ Time taken: 1.359 seconds, Fetched: 198 row(s)
AUTHORIZATION
AUTHORIZATION :打印SQL运行相关权限
hive (g6_hadoop)> explain AUTHORIZATION insert OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain; OK Explain INPUTS: g6_hadoop@g6_access_orc OUTPUTS: g6_hadoop@g6_access_orc_explain CURRENT_USER: hadoop OPERATION: QUERY AUTHORIZATION_FAILURES: No privilege 'Update' found for outputs { database:g6_hadoop, table:g6_access_orc_explain} No privilege 'Select' found for inputs { database:g6_hadoop, table:g6_access_orc, columnName:domain} Time taken: 0.599 seconds, Fetched: 11 row(s)
DEPENDENCY
DEPENDENCY:打印SQL输入表的相关信息
hive (g6_hadoop)> explain DEPENDENCY insert OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain; Explain {"input_partitions":[],"input_tables":[{"tablename":"g6_hadoop@g6_access_orc","tabletype":"MANAGED_TABLE"}]} Time taken: 0.135 seconds, Fetched: 1 row(s)
以上所述就是小编给大家介绍的《Hive 底层执行流程》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- avue 1.5.2 优化大量底层代码,crud 和 form 底层公用
- Synchronized 关键字使用、底层原理、JDK1.6 之后的底层优化以及 和ReenTrantLock 的对比
- Docker 底层原理浅析
- NSDictionary底层实现原理
- PHP 数组底层实现
- NSObject 底层本质
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Cypherpunks
Julian Assange、Jacob Appelbaum、Andy Müller-Maguhn、Jérémie Zimmermann / OR Books / 2012-11 / GBP 8.99
Cypherpunks are activists who advocate the widespread use of strong cryptography (writing in code) as a route to progressive change. Julian Assange, the editor-in-chief of and visionary behind WikiLea......一起来看看 《Cypherpunks》 这本书的介绍吧!