内容简介:Hive并不是简简单单写SQL,因为我们要进行层层调优,如果连Hive的内部运行机制都搞不清,那么hive对我们来说仅仅是一个黑箱,高效率的调优无从谈起,所以我们很有必要了解下Hive是如何将SQL转化为MapReduce任务的呢?我们以下面这个SQL为例整个编译过程分为六个阶段:
Hive并不是简简单单写SQL,因为我们要进行层层调优,如果连Hive的内部运行机制都搞不清,那么hive对我们来说仅仅是一个黑箱,高效率的调优无从谈起,所以我们很有必要了解下Hive是如何将 SQL 转化为MapReduce任务的呢?
Hive 底层执行流程
我们以下面这个SQL为例
FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.key, sum(substr(src.value,4)) GROUP BY src.key;
整个编译过程分为六个阶段:
1.Antlr定义SQL的语法规则,完成SQL词法,语法解析,将SQL
HiveLexerX , HiveParser 分别是Antlr对SQL编译后自动生成的词法解析和语法解析类,在这两个类中进行复杂的解析。
例子中的AST tree为
ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF src)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) (TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4)))) (TOK_GROUPBY (TOK_COLREF src key))))
2.遍历AST Tree,抽象出查询的基本组成单元QueryBlock
AST Tree 仍然非常复杂,不够结构化,不方便直接翻译为 MapReduce 程序, AST
Tree 转化为 QueryBlock (QB)就是将 SQL 进一部抽象和结构化。
AST Tree 生成 QueryBlock 的过程是一个递归的过程,先序遍历 AST Tree ,遇到不
同的Token 节点(理解为特殊标记),保存到相应的属性中,主要包含以下几个过程
TOK_QUERY => 创建 QB 对象,循环递归子节点 TOK_FROM => 将表名语法部分保存到 QB 对象的 aliasToTabs 等属性中 TOK_INSERT => 循环递归子节点 TOK_DESTINATION => 将输出目标的语法部分保存在 QBParseInfo 对象的nameToDest 属性中 TOK_SELECT => 分别将查询表达式的语法部分保存在 destToSelExpr 、destToAggregationExprs 、 destToDistinctFuncExprs 三个属性中 TOK_WHERE => 将 Where 部分的语法保存在 QBParseInfo 对象的destToWhereExpr 属性中
3.遍历QueryBlock,翻译为执行操作树OperatorTree
Hive 最终生成的 MapReduce 任务, Map 阶段和 Reduce 阶段均由 Operator Tree
组成。逻辑操作符,就是在 Map 阶段或者 Reduce 阶段完成单一特定的操作。
基本的操作符包括
TableScanOperator、SelectOperator、FilterOperator、JoinOperator、GroupByOperator、ReduceSinkOperator
QueryBlock 生成 Operator Tree 就是遍历上一个过程中生成的 QB 和 QBParseInfo
对象的保存
语法的属性,包含如下几个步骤:
QB#aliasToSubq => 有子查询,递归调用 QB#aliasToTabs => TableScanOperator QBParseInfo#joinExpr => QBJoinTree => ReduceSinkOperator + JoinOperator QBParseInfo#destToWhereExpr => FilterOperator QBParseInfo#destToGroupby => ReduceSinkOperator +GroupByOperator QBParseInfo#destToOrderby => ReduceSinkOperator + ExtractOperator
由于 Join/GroupBy/OrderBy 均需要在 Reduce 阶段完成,所以在生成相应操作的Operator 之前都会先生成一个 ReduceSinkOperator ,将字段组合并序列化为 Reduce Key/value,Partition Key
SQL例子翻译成OperatorTree
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
src
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: rand()
type: double
tag: -1
value expressions:
expr: substr(value, 4)
type: string
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(UDFToDouble(VALUE.0))
keys:
expr: KEY.0
type: string
mode: partial1
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.mapred.SequenceFileOutputFormat
name: binary_table
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
/tmp/hive-zshao/67494501/106593589.10001
Reduce Output Operator
key expressions:
expr: 0
type: string
sort order: +
Map-reduce partition columns:
expr: 0
type: string
tag: -1
value expressions:
expr: 1
type: double
Reduce Operator Tree:
Group By Operator
aggregations:
expr: sum(VALUE.0)
keys:
expr: KEY.0
type: string
mode: final
Select Operator
expressions:
expr: 0
type: string
expr: 1
type: double
Select Operator
expressions:
expr: UDFToInteger(0)
type: int
expr: 1
type: double
File Output Operator
compressed: false
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
name: dest_g1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
name: dest_g1
4.Logical Optimizer进行OperatorTree变换,合并不必要的
使用 ReduceSinkOperator ,减少shuffle数据量。大部分逻辑层优化器通过变换 OperatorTree ,合并操作符,达到减少 MapReduce Job ,减少 shuffle 数据量的目的。
5.遍历OperatorTree,翻译为Task tree
OperatorTree 转化为 Task tree的过程分为下面几个阶段
- 对输出表生成 MoveTask
- 从 OperatorTree 的其中一个根节点向下深度优先遍历
- ReduceSinkOperator 标示 Map/Reduce 的界限,多个 Job 间的界限
- 遍历其他根节点,遇过碰到 JoinOperator 合并 MapReduceTask
- 生成 StatTask 更新元数据
- 剪断 Map 与 Reduce 间的 Operator 的关系
6. PhysicalOptimizer 对Task tree优化,生成最终的执行计划
7、执行
以上就是HiveSQL的底层执行流程
打印SQL运行相关信息
我们在开发中,可以使用下面这个语句来打印SQL语句的相关运行信息
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
注:我的版本是 hive-1.1.0-cdh5.7.0 ,所以只可用三个可选属性,如果您版本比较高的话,可以去 官网 查阅对应属性
下面我对三种可选属性进行简单介绍
EXTENDED
EXTENDED:打印SQL解析成AST&Operator Tree最全面的信息
hive (g6_hadoop)> explain EXTENDED insert OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;
OK
Explain
ABSTRACT SYNTAX TREE:
TOK_QUERY
TOK_FROM
TOK_TABREF
TOK_TABNAME
g6_access_orc
TOK_INSERT
TOK_DESTINATION
TOK_TAB
TOK_TABNAME
g6_access_orc_explain
TOK_SELECT
TOK_SELEXPR
TOK_TABLE_OR_COL
domain
TOK_SELEXPR
TOK_FUNCTION
count
1
num
TOK_WHERE
>
TOK_TABLE_OR_COL
traffic
'99900'
TOK_GROUPBY
TOK_TABLE_OR_COL
domain
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: g6_access_orc
Statistics: Num rows: 260326 Data size: 188215698 Basic stats: COMPLETE Column stats: NONE
GatherStats: false
Filter Operator
isSamplingPred: false
predicate: (traffic > 99900) (type: boolean)
Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: domain (type: string)
outputColumnNames: domain
Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
Group By Operator
aggregations: count(1)
keys: domain (type: string)
mode: hash
outputColumnNames: _col0, _col1
Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Map-reduce partition columns: _col0 (type: string)
Statistics: Num rows: 86775 Data size: 62738325 Basic stats: COMPLETE Column stats: NONE
tag: -1
value expressions: _col1 (type: bigint)
auto parallelism: false
Path -> Alias:
hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc [g6_access_orc]
Path -> Partition:
hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc
Partition
base file name: g6_access_orc
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns cdn,region,level,time,ip,domain,url,traffic
columns.comments
columns.types string:string:string:string:string:string:string:bigint
field.delim
file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc
name g6_hadoop.g6_access_orc
numFiles 1
numRows 260326
rawDataSize 188215698
serialization.ddl struct g6_access_orc { string cdn, string region, string level, string time, string ip, string domain, string url, i64 traffic}
serialization.format
serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
totalSize 8567798
transient_lastDdlTime 1557676635
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns cdn,region,level,time,ip,domain,url,traffic
columns.comments
columns.types string:string:string:string:string:string:string:bigint
field.delim
file.inputformat org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc
name g6_hadoop.g6_access_orc
numFiles 1
numRows 260326
rawDataSize 188215698
serialization.ddl struct g6_access_orc { string cdn, string region, string level, string time, string ip, string domain, string url, i64 traffic}
serialization.format
serialization.lib org.apache.hadoop.hive.ql.io.orc.OrcSerde
totalSize 8567798
transient_lastDdlTime 1557676635
serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
name: g6_hadoop.g6_access_orc
name: g6_hadoop.g6_access_orc
Truncated Path -> Alias:
/g6_hadoop.db/g6_access_orc [g6_access_orc]
Needs Tagging: false
Reduce Operator Tree:
Group By Operator
aggregations: count(VALUE._col0)
keys: KEY._col0 (type: string)
mode: mergepartial
outputColumnNames: _col0, _col1
Statistics: Num rows: 43387 Data size: 31368801 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
GlobalTableId: 1
directory: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000
NumFilesPerFileSink: 1
Statistics: Num rows: 43387 Data size: 31368801 Basic stats: COMPLETE Column stats: NONE
Stats Publishing Key Prefix: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000/
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns domain,num
columns.comments
columns.types string:bigint
field.delim |
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain
name g6_hadoop.g6_access_orc_explain
numFiles 1
numRows 7
rawDataSize 149
serialization.ddl struct g6_access_orc_explain { string domain, i64 num}
serialization.format |
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 156
transient_lastDdlTime 1558661108
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: g6_hadoop.g6_access_orc_explain
TotalFiles: 1
GatherStats: true
MultiFileSpray: false
Stage: Stage-0
Move Operator
tables:
replace: true
source: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
properties:
COLUMN_STATS_ACCURATE true
bucket_count -1
columns domain,num
columns.comments
columns.types string:bigint
field.delim |
file.inputformat org.apache.hadoop.mapred.TextInputFormat
file.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
location hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain
name g6_hadoop.g6_access_orc_explain
numFiles 1
numRows 7
rawDataSize 149
serialization.ddl struct g6_access_orc_explain { string domain, i64 num}
serialization.format |
serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
totalSize 156
transient_lastDdlTime 1558661108
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: g6_hadoop.g6_access_orc_explain
Stage: Stage-2
Stats-Aggr Operator
Stats Aggregation Key Prefix: hdfs://ruozeclusterg6/user/hive/warehouse/g6_hadoop.db/g6_access_orc_explain/.hive-staging_hive_2019-05-23_23-37-06_889_2210604962026719569-1/-ext-10000/
Time taken: 1.359 seconds, Fetched: 198 row(s)
AUTHORIZATION
AUTHORIZATION :打印SQL运行相关权限
hive (g6_hadoop)> explain AUTHORIZATION insert OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;
OK
Explain
INPUTS:
g6_hadoop@g6_access_orc
OUTPUTS:
g6_hadoop@g6_access_orc_explain
CURRENT_USER:
hadoop
OPERATION:
QUERY
AUTHORIZATION_FAILURES:
No privilege 'Update' found for outputs { database:g6_hadoop, table:g6_access_orc_explain}
No privilege 'Select' found for inputs { database:g6_hadoop, table:g6_access_orc, columnName:domain}
Time taken: 0.599 seconds, Fetched: 11 row(s)
DEPENDENCY
DEPENDENCY:打印SQL输入表的相关信息
hive (g6_hadoop)> explain DEPENDENCY insert OVERWRITE table g6_access_orc_explain select domain,count(1) num from g6_access_orc where traffic>'99900' group by domain;
Explain
{"input_partitions":[],"input_tables":[{"tablename":"g6_hadoop@g6_access_orc","tabletype":"MANAGED_TABLE"}]}
Time taken: 0.135 seconds, Fetched: 1 row(s)
以上所述就是小编给大家介绍的《Hive 底层执行流程》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:- avue 1.5.2 优化大量底层代码,crud 和 form 底层公用
- Synchronized 关键字使用、底层原理、JDK1.6 之后的底层优化以及 和ReenTrantLock 的对比
- Docker 底层原理浅析
- NSDictionary底层实现原理
- PHP 数组底层实现
- NSObject 底层本质
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
An Introduction to Genetic Algorithms
Melanie Mitchell / MIT Press / 1998-2-6 / USD 45.00
Genetic algorithms have been used in science and engineering as adaptive algorithms for solving practical problems and as computational models of natural evolutionary systems. This brief, accessible i......一起来看看 《An Introduction to Genetic Algorithms》 这本书的介绍吧!