内容简介:先附上一句SQL,使用tpc-ds的表结构,我们围绕这句SQL讲。其中,extraStrategies是提供给外部人员可以自己添加的策略。调用这些strategies的代码如下:将strategies逐个去应用在逻辑计划上,然后做flat操作,返回一个
先附上一句SQL,使用tpc-ds的表结构,我们围绕这句 SQL 讲。
- SQL:
SQL> select avg(cs_ext_discount_amt) from catalog_sales, date_dim where d_date between ‘1999-02-22’ and cast(‘1999-05-22’ as date) and d_date_sk = cs_sold_date_sk group by cs_sold_date_sk;
- 逻辑计划:
Aggregate [cs_sold_date_sk#24], [cast((avg(UnscaledValue(cs_ext_discount_amt#46)) / 100.0) as decimal(11,6)) AS avg(cs_ext_discount_amt)#149] +- Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] +- Join Inner, (d_date_sk#58 = cs_sold_date_sk#24) :- Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] : +- Filter isnotnull(cs_sold_date_sk#24) : +- Relation[cs_sold_date_sk#24,cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,... 10 more fields] +- Project [d_date_sk#58] +- Filter (((isnotnull(d_date#60) && (cast(d_date#60 as string) >= 1999-02-22)) && (d_date#60 <= 10733)) && isnotnull(d_date_sk#58)) +- Relation[d_date_sk#58,d_date_id#59,d_date#60,d_month_seq#61,d_week_seq#62,d_quarter_seq#63,d_year#64,d_dow#65,d_moy#66,d_dom#67,d_qoy#68,d_fy_year#69,d_fy_quarter_seq#70,d_fy_week_seq#71,d_day_name#72,d_quarter_name#73,d_holiday#74,d_weekend#75,d_following_holiday#76,d_first_dom#77,d_last_dom#78,d_same_day_ly#79,d_same_day_lq#80,d_current_day#81,... 4 more fields]
物理计划源码分析
物理策略
def strategies: Seq[Strategy] = extraStrategies ++ ( FileSourceStrategy :: DataSourceStrategy :: DDLStrategy :: SpecialLimits :: Aggregation :: JoinSelection :: InMemoryScans :: BasicOperators :: Nil)
其中,extraStrategies是提供给外部人员可以自己添加的策略。调用这些strategies的代码如下:
// Collect physical plan candidates. val candidates = strategies.iterator.flatMap(_(plan))
将strategies逐个去应用在逻辑计划上,然后做flat操作,返回一个 PhysicalPlan
的iterator。那么每个策略什么作用?
FileSourceStrategy
一个针对Hadoop文件系统做的策略,当执行计划的底层Relation是 HadoopFsRelation
时会调用到,用来扫描文件。
DataSourceStrategy
Spark针对DataSource预定义了四种scan接口, TableScan
、 PrunedScan
、 PrunedFilteredScan
、 CatalystScan
(其中 CatalystScan
是unstable的,也是不常用的),如果开发者(用户)自己实现的DataSource是实现了这四种接口之一的,在scan到执行计划的底层Relation时,就会调用来扫描文件。
DDLStrategy(2.2中已经消失了,2.1中有)
会在create table的时候调用,因为后续版本不会存在,所以不做解释。
SpecialLimits
在Spark SQL中加limit n时候回调用到(如果不指定,Spark 默认也会limit 20),在源码中,会给每种case的limit节点的子节点使用 PlanLater
,这是个很神奇的东西下文会讲到。
Aggregation
顾名思义,执行聚合函数的策略。
JoinSelection
执行join的策略。Join的执行策略也同样分BroadcastJoin(也就是MapSideJoin),和ShuffledJoin,这个之后的文章会展开讲。
InMemoryScans
当数据在内存中被缓存过,就会用到该策略。
BasicOperators
一些基本操作的执行策略,如flatMap,sort,project等,但是实际上大都是给这些节点的子节点套上一个 PlanLater
。
PlanLater
Spark SQL物理计划里一个非常重要的概念。字面意思很好理解,就是之后再计划。那么经过以上策略逐个去执行以后,原来的逻辑计划会变成什么样呢?
ReturnAnswer +- GlobalLimit 21 +- LocalLimit 21 +- PlanLater Aggregate [cs_sold_date_sk#24], [cast((avg(UnscaledValue(cs_ext_discount_amt#46)) / 100.0) as decimal(11,6)) AS avg(cs_ext_discount_amt)#149] , Aggregate [cs_sold_date_sk#24], [cast((avg(UnscaledValue(cs_ext_discount_amt#46)) / 100.0) as decimal(11,6)) AS avg(cs_ext_discount_amt)#149] +- PlanLater Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] , Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] +- PlanLater Join Inner, (d_date_sk#58 = cs_sold_date_sk#24) :- PlanLater Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] , Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] : +- Filter isnotnull(cs_sold_date_sk#24) : +- Relation[cs_sold_date_sk#24,cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,... 10 more fields] +- PlanLater Project [d_date_sk#58] , Project [d_date_sk#58] +- Filter (((isnotnull(d_date#60) && (cast(d_date#60 as string) >= 1999-02-22)) && (d_date#60 <= 10733)) && isnotnull(d_date_sk#58)) +- Relation[d_date_sk#58,d_date_id#59,d_date#60,d_month_seq#61,d_week_seq#62,d_quarter_seq#63,d_year#64,d_dow#65,d_moy#66,d_dom#67,d_qoy#68,d_fy_year#69,d_fy_quarter_seq#70,d_fy_week_seq#71,d_day_name#72,d_quarter_name#73,d_holiday#74,d_weekend#75,d_following_holiday#76,d_first_dom#77,d_last_dom#78,d_same_day_ly#79,d_same_day_lq#80,d_current_day#81,... 4 more fields]
有什么差别呢?主要有二:
-
-
顶层多了个
ReturnAnswer
和Limit
节点
-
顶层多了个
-
-
Aggregate
、Project
、Join
节点都用了PlanLater
-
(其实 Filter
节点也是可以用 PlanLater
的,但是由于逻辑计划已经将 Filter
下推至底部,所以最底部的Project->Filter->Relation的三层节点是可以直接调用一个策略去执行的,因此只需要三层节点的最上层也就是Project节点使用 PlanLater
即可。)
言归正传,语法树顶部多了 ReturnAnswer
和 Limit
节点,很容易理解, Limit
是Spark SQL默认限制行数, ReturnAnswer
是将结果返回。那么加的PlanLater有什么作用?我的理解是,将物理计划分割成一段段,每一段物理计划会有其对应策略来执行。具体源码如下:
def plan(plan: LogicalPlan): Iterator[PhysicalPlan] = { // Obviously a lot to do here still... // Collect physical plan candidates. val candidates = strategies.iterator.flatMap(_(plan)) // The candidates may contain placeholders marked as [[planLater]], // so try to replace them by their child plans. val plans = candidates.flatMap { candidate => val placeholders = collectPlaceholders(candidate) if (placeholders.isEmpty) { // Take the candidate as is because it does not contain placeholders. Iterator(candidate) } else { // Plan the logical plan marked as [[planLater]] and replace the placeholders. placeholders.iterator.foldLeft(Iterator(candidate)) { case (candidatesWithPlaceholders, (placeholder, logicalPlan)) => // Plan the logical plan for the placeholder. val childPlans = this.plan(logicalPlan) candidatesWithPlaceholders.flatMap { candidateWithPlaceholders => childPlans.map { childPlan => // Replace the placeholder by the child plan candidateWithPlaceholders.transformUp { case p if p == placeholder => childPlan } } } } } } val pruned = prunePlans(plans) assert(pruned.hasNext, s"No plan for $plan") pruned }
可以看到,经过策略迭代器和flat过后的candidates候选计划们(一般来说只有一个,是最顶层的planLater),然后收集placeholder(其实就是planlater),这个时候对placeholders进行迭代,并对每个placeholder的child plan递归调用plan方法。举例文章这句SQL,递归调用plan方法,得到每个placeholder及其child plan节点(也就是 case (candidatesWithPlaceholders, (placeholder, logicalPlan))这句话的placeholder和logicalPlan两个变量)如下:
placeholder: PlanLater Aggregate [cs_sold_date_sk#24], [cast((avg(UnscaledValue(cs_ext_discount_amt#46)) / 100.0) as decimal(11,6)) AS avg(cs_ext_discount_amt)#149] logicalPlan: Aggregate [cs_sold_date_sk#24], [cast((avg(UnscaledValue(cs_ext_discount_amt#46)) / 100.0) as decimal(11,6)) AS avg(cs_ext_discount_amt)#149] +- Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] +- Join Inner, (d_date_sk#58 = cs_sold_date_sk#24) :- Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] : +- Filter isnotnull(cs_sold_date_sk#24) : +- Relation[cs_sold_date_sk#24,cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,... 10 more fields] +- Project [d_date_sk#58] +- Filter (((isnotnull(d_date#60) && (cast(d_date#60 as string) >= 1999-02-22)) && (d_date#60 <= 10733)) && isnotnull(d_date_sk#58)) +- Relation[d_date_sk#58,d_date_id#59,d_date#60,d_month_seq#61,d_week_seq#62,d_quarter_seq#63,d_year#64,d_dow#65,d_moy#66,d_dom#67,d_qoy#68,d_fy_year#69,d_fy_quarter_seq#70,d_fy_week_seq#71,d_day_name#72,d_quarter_name#73,d_holiday#74,d_weekend#75,d_following_holiday#76,d_first_dom#77,d_last_dom#78,d_same_day_ly#79,d_same_day_lq#80,d_current_day#81,... 4 more fields]
placeholder: PlanLater Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] logicalPlan: Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] +- Join Inner, (d_date_sk#58 = cs_sold_date_sk#24) :- Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] : +- Filter isnotnull(cs_sold_date_sk#24) : +- Relation[cs_sold_date_sk#24,cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,... 10 more fields] +- Project [d_date_sk#58] +- Filter (((isnotnull(d_date#60) && (cast(d_date#60 as string) >= 1999-02-22)) && (d_date#60 <= 10733)) && isnotnull(d_date_sk#58)) +- Relation[d_date_sk#58,d_date_id#59,d_date#60,d_month_seq#61,d_week_seq#62,d_quarter_seq#63,d_year#64,d_dow#65,d_moy#66,d_dom#67,d_qoy#68,d_fy_year#69,d_fy_quarter_seq#70,d_fy_week_seq#71,d_day_name#72,d_quarter_name#73,d_holiday#74,d_weekend#75,d_following_holiday#76,d_first_dom#77,d_last_dom#78,d_same_day_ly#79,d_same_day_lq#80,d_current_day#81,... 4 more fields]
placeholder: PlanLater Join Inner, (d_date_sk#58 = cs_sold_date_sk#24) logicalPlan: Join Inner, (d_date_sk#58 = cs_sold_date_sk#24) :- Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] : +- Filter isnotnull(cs_sold_date_sk#24) : +- Relation[cs_sold_date_sk#24,cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,... 10 more fields] +- Project [d_date_sk#58] +- Filter (((isnotnull(d_date#60) && (cast(d_date#60 as string) >= 1999-02-22)) && (d_date#60 <= 10733)) && isnotnull(d_date_sk#58)) +- Relation[d_date_sk#58,d_date_id#59,d_date#60,d_month_seq#61,d_week_seq#62,d_quarter_seq#63,d_year#64,d_dow#65,d_moy#66,d_dom#67,d_qoy#68,d_fy_year#69,d_fy_quarter_seq#70,d_fy_week_seq#71,d_day_name#72,d_quarter_name#73,d_holiday#74,d_weekend#75,d_following_holiday#76,d_first_dom#77,d_last_dom#78,d_same_day_ly#79,d_same_day_lq#80,d_current_day#81,... 4 more fields]
placeholder: PlanLater Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] logicalPlan: Project [cs_sold_date_sk#24, cs_ext_discount_amt#46] +- Filter isnotnull(cs_sold_date_sk#24) +- Relation[cs_sold_date_sk#24,cs_sold_time_sk#25,cs_ship_date_sk#26,cs_bill_customer_sk#27,cs_bill_cdemo_sk#28,cs_bill_hdemo_sk#29,cs_bill_addr_sk#30,cs_ship_customer_sk#31,cs_ship_cdemo_sk#32,cs_ship_hdemo_sk#33,cs_ship_addr_sk#34,cs_call_center_sk#35,cs_catalog_page_sk#36,cs_ship_mode_sk#37,cs_warehouse_sk#38,cs_item_sk#39,cs_promo_sk#40,cs_order_number#41,cs_quantity#42,cs_wholesale_cost#43,cs_list_price#44,cs_sales_price#45,cs_ext_discount_amt#46,cs_ext_sales_price#47,... 10 more fields]
placeholder: PlanLater Project [d_date_sk#58] logicalPlan: Project [d_date_sk#58] +- Filter (((isnotnull(d_date#60) && (cast(d_date#60 as string) >= 1999-02-22)) && (d_date#60 <= 10733)) && isnotnull(d_date_sk#58)) +- Relation[d_date_sk#58,d_date_id#59,d_date#60,d_month_seq#61,d_week_seq#62,d_quarter_seq#63,d_year#64,d_dow#65,d_moy#66,d_dom#67,d_qoy#68,d_fy_year#69,d_fy_quarter_seq#70,d_fy_week_seq#71,d_day_name#72,d_quarter_name#73,d_holiday#74,d_weekend#75,d_following_holiday#76,d_first_dom#77,d_last_dom#78,d_same_day_ly#79,d_same_day_lq#80,d_current_day#81,... 4 more fields]
那么可以看到,递归到最底处,就是project->filter->relation的三层节点组合,由于我实际是重写过了DataSource,这个时候会调用 DataSourceStrategy
,去读取获取数据,然后递归逐个返回根据每个planLater分割点会有对应的策略去对数据进行相应的操作。
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
进化式运营:从互联网菜鸟到绝顶高手
李少加 / 电子工业出版社 / 2016-11 / 59
互联网运营作为一个新兴的岗位,一方面它是企业的核心岗职,身负重任,另一方面,又由于其短暂的历史,缺乏成熟体系的工作方法论,而目前业界主流的运营方法却是从企业视角出发,存在极大的改进空间。 《进化式运营:从互联网菜鸟到绝顶高手》作者基于自身十年的互联网洞察、实践经验,并融合了信息论、心理学、经济学、管理学、甚至包括生态学、进化论等跨学科跨学业的知识,从无到有地构建了一套全新的互联网运营体系:基......一起来看看 《进化式运营:从互联网菜鸟到绝顶高手》 这本书的介绍吧!