Background

Materialized view can improve peformance of batch queries processing significantly, but the first and the most important step of the pipeline is finding common sql subexpressions. We researched on several existed tools, including signature-based methods proposed by Zhou. etc. and syntax rewriting based method Cosette developed by database group of Washington University. These two methods can not effectively get equivalent subexpressions. So we developed a new tool by combining rules and SMT solver for large scale query dataset on ODPS.

Related Work

ODPS SQL

This is the SQL module on ODPS whose optimizer is powered by optimizer Calcite. We have accumulated all queries submitted to ODPS service every day, this is the motivation of this project. Playback is a sub-module of ODPS-SQL, which afford methods to extract, transform and parse records in the databases that store daily queries in history, it helps us to focus on sql related research and development.

Signature-based method

This method tries to denote different operator in query by different signature, each signature contains informations in the query. This method bring less overhead, however, it’s hard for this method to recoganize equivalent queries with different join order, select condition and aggregation, and it’s also impossible for this method to be extended to solve containment relationship.

Rule-based deduction method

Cosette is rule-based deduction method, it mainly uses many rules to rewrite the query to another equivalent form. This method is still based on syntax while not semantic, thus, it’s hard to handle complicated math expressions. And what’s worse, Cosette do not afford easily used API for RelNode in Calcite, and many operators have not been supported yet.

SMT solver based method

SMT is the problem of determining whether such a formula is satisfiable. Imagine an instance of the boolean satisfiability problem (SAT) in which some of the binary variables are replaced by predicates over a suitable set of non-binary variables. A predicate is basically a binary-valued function of non-binary variables. Example predicates include linear inequalities (e.g., { 3x+2y-z >= 4}) or equalities involving uninterpreted terms and function symbols (e.g., {f(f(u,v),v)=f(u,v)} f(f(u,v),v)=f(u,v) where {f} f is some unspecified function of two arguments). In this work, we use z3 from microsoft as the efficient solver tool. There is a method developed by Qi Zhou, however, this method has severe performance problems with large queries (either wide or deep).

Framework

Program Structure

For different operators, this method adopts different strategies. To avoid heavy overhead with large queries, we partition the expression according to Aggregation and TableScan operator horizontally and postpone process the join operators to accalerate join equivalence checking. Equivalence of complicated condition and math expressions is hard to deal with by structure matching, thus we still use microsoft z3 to solve SPJ subexpressions where most condition constrains and math expressions exist. For TableSink and AdhocSink,only the columns matching matters. For Aggregation, we check the aggregation function and parameter equavalence. And for table scan, we only check the table name. For other unknow or rare operator, we check nothing but make it transparent to the whole checking process. program_structure.png

Expressions Solver

Within each part of the query (SPJ Subquery), the root is one of the SPJ operators and the leaves are Aggregation or TableScan. We construct one variable for each output column from leaf nodes. By matching Table Columns (TableScan) or using match information passed from below (Aggregation), we can get the first kind of constrain that is corresponding/matching input variables in two expressions to be compared must be the same. To determine if two SPJ subexpressions are equivalent, we need another constrain that is one tuple would be selected or not in both subexpressions, this is the condition constrain. Notice that we also need to check whether the output of each column on the root can be matched. After constructing these constrains, we use z3 to solve a unsatisfiable problem generated from them. expression_solver.png In the example above, two queries are equal if and only if we can prove that for all possible values of input variables $1,$2,$3,$4,$5,$6,$7,$8, the condition constrains and output expression of each column must be equal.

Online Deployment

Thanks to ODPS UDF and UDJ, Our method can be integrated into ODPS SQL seemlessly and easy to use for potential up-coming users. First, we use one UDTF to compile all the queries and extract subexpressions. Second, we compare each pair of subexpressions from different queries and output the equivalent pairs. Unforturnately, we have too much queries to be compared in one bucket which would bring n-square cost complexity. Table names existing in Table Scan, group by keys, attributes existing in where clause and database partition conditions are taken as UDJ Join conditions to reduce number of elements in single bucket.

Id Bucket Size Avg Mem Avg CPU
1 22299 71.14884263412884 1156.1174412345235
2 22295 2936.3897362282432 82333.42705903463
3 22290 43.823199461521206 717.687861790442
4 22241 201.96977873718293 3346.6372099298437
5 22234 91.23442350083225 1967.3051419317108
6 22230 187.0596175478065 3116.2835995500564
7 20662 94571.00087200853 987321.249103769
8 18516 196.38150788507238 3618.9239576582413
9 18414 188.47615944390137 3569.6854567177147
10 18054 269262.9001883239 4203192.010634762
11 17698 15275.535046993546 439658.8311629487
12 17642 4944.97333484625 78828.1712243277
13 13920 219.32748958183646 4124.467308521339
14 13846 206.61562906254514 3918.0907121190235
15 12964 27813.329219376734 435978.3647022524
16 11514 143.98019457956914 2766.084781097985
17 10307 301424.54642475984 8164997.613854662
18 10307 301424.54642475984 8164997.613854662
19 10300 329857.9854368932 8998025.726407766
20 10300 329857.9854368932 8998025.726407766
21 Average 1590155.1328669668 65276155.6693216

sql_equivalence_cpu sql_equivalence_mem Besides, we found that huge amount of similar queries have much less runtime resource consuming than average which would bring little benefits to share, therefore, we prune these queries to further tailor the buckets.

Evaluation

Performance

Because of the limitation of Cosette and Signature-based method, they are not suitable for our project, In the evaluation, we only show the improvement of method proposed in this work comparing to Qi’s method.
We randomly select 8 compilable DML sql queries and 2 special sql queries to show the scalability of the methods. Notice that No.9 query is very deep with lots of embedded joins and No.10 query is very wide with lots of columns in the aggregation operation. All the following experiments are conducted for self comparing in order to make comparator to walk through the whole queries.

time/s 1# 2# 3# 4# 5# 6# 7# 8# 9# 10#
Ji 0.268 1.355 0.712 0.617 0.184 0.348 0.233 0.35 14.146 0.388
Qi 0.105 3.952 0.107 0.103 0.141 wrong wrong wrong >1000 >1000

From table above, Qi’s method made 3 misjudgement, and timeout for large queries(either deep or wide).

Special Cases

In order to improve performence significantly, we ignored some criterions for sql equivalence judgement. However, we do not loss accuracy according to the result calculated from daily queries for the following reasons.

  1. WHEN...CASE is relatively stable for the equavalent queries.
  2. JOIN type is relatively stable for a static join graph.
  3. Two intermediate results with different order can be shared, so SORT is not that important.
Case Status
WHEN...CASE clause Only Type
Differences between INNER JOIN, LEFT JOIN, RIGHT JOIN and OUTER JOIN Taken As Inner
UNION TODO
SORT Always True
VALUES TODO

Appendix

SQL used in evaluation

1# select * from zeus_server_info where ds=’20180824’ and cabinet_id in (select armory_id from zeus_cabinet_info where ds=’20180824’ and room_id=201326910);
2# SELECT * FROM (SELECT 0, CAST (COUNT(CASE WHEN biztimelong IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_0,CAST (COUNT(CASE WHEN sel_selnick IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_1,CAST (COUNT(1) AS STRING) AS col_2,CAST (COUNT(CASE WHEN righttype IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_3,CAST (COUNT(CASE WHEN suborderactlpayfee IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_4,CAST (COUNT(CASE WHEN orderid IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_5,CAST (COUNT(CASE WHEN tradeendtime IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_6,CAST (COUNT(CASE WHEN orderpaytime IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_7,CAST (COUNT(CASE WHEN staterightsale IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_8,CAST (COUNT(CASE WHEN rightid IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_9,CAST (COUNT(CASE WHEN byr_usernick IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_10,CAST (COUNT(CASE WHEN refdstat IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_11,CAST (COUNT(CASE WHEN rightcostdays IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_12,CAST (COUNT(CASE WHEN selid IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_13,CAST (COUNT(CASE WHEN creattime IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_14,CAST (COUNT(CASE WHEN biztype IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_15,CAST (COUNT(CASE WHEN refdcashfeefen IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_16,CAST (COUNT(CASE WHEN questdesc IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_17,CAST (COUNT(CASE WHEN recvsituval IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_18,CAST (COUNT(CASE WHEN rptreasn IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_19,CAST (COUNT(CASE WHEN prodtitle IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_20,CAST (COUNT(CASE WHEN prodid IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_21,CAST (COUNT(CASE WHEN byrid IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_22,CAST (COUNT(CASE WHEN recvsitu IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_23,CAST (COUNT(CASE WHEN bizendtime IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_24,CAST (COUNT(CASE WHEN refdfeefen IS NULL THEN TRUE ELSE NULL END) AS STRING) AS col_25 FROM secods.odl_event_crm_rights_end WHERE ds = “201808251730”) t;
3# insert overwrite table mdl_tb_activity_false_trade_15min_fdt partition(ds=’20180825183000’,flag=’zhongshu’) select t2.order_id,t2.parent_order_id,t2.item_id,t2.category_level1_id,t2.category_leaf_id,t2.gmt_create,t2.gmt_pay,t2.price,t2.buyamount,t2.pay_fee,t2.postfee,t2.discountrate,t2.changepriceratiostr,t2.buyer_id,t2.buyer_nick,t2.seller_id,t2.seller_nick,t2.isblacklistbuyer,t2.sellertype,t2.isb2c,t2.isvirtualproduct,t2.isjhs,t2.iswap,t2.biztype,t2.receivingphone,t2.couponfee from(select seller_id,buyer_id,gmt_member_reg,buyer_cnt1,buyer_cnt2,gmt_pay_hour from mdl_tb_activity_trade_15min_byr_reg_fdt4where ds=’20180825183000’ and flag_zs=1 and buyer_cnt2>=10 and buyer_cnt2/buyer_cnt1>=0.3 )t1 join ( select*,substr(gmt_pay,1,13)as gmt_pay_hour from mdl_tb_activity_trade_15min_fdt where ds=’20180825183000’)t2 on(t1.seller_id=t2.seller_id and t1.buyer_id=t2.buyer_id and t1.gmt_pay_hour=t2.gmt_pay_hour);
4# INSERT OVERWRITE table idl_sec_log_dama_http_fht partition (ds=’20180825’, hh = ‘08’) SELECT request_datetime,host,uri,src_ip,src_port,dst_ip,dst_port,method,referer,user_agent,x_forward_for,cookie,post_data,ret_code,rsp_content_type,rqs_content_type,content_length,jump_location,set_cookie,ttl,token,region,https FROM secods.odl_beaver_log_abtn_http_cloud_split WHERE ds=’20180825’ and hh = ‘08’ and split_type=’dama_log’ and src_ip is not null;
5# select count(*) from alifin_bi.s_49341_hbshouqian_pid_info_tohb1 where dt=’20180824’;
6# SELECT * from yt_paycenter.t_resource_state where dt = ‘20180823’ and id >= ‘553459001’ and id < ‘553464001’ and resource_type = ‘video’;
7# SELECT *FROM (SELECT 0, CAST (COUNT(1) AS STRING) AS col_0,CAST (COUNT(1) AS STRING) AS col_1 FROM alisec_app.adl_yt_pwdlogin_trust_coverage_15min_fmt WHERE ds = “201808240000”) t;
8# SELECT ‘支付’ AS 业务场景 , T1.USER_ID AS 用户ID , T1.PROD_ID AS 产品码 , T1.ID AS 单据ID , T1.OUT_ORDER_NO AS 支付宝交易号 , T1.TRANS_AMOUNT AS pcpay金额 , T2.AMOUNT AS paycore金额 , T1.TRANS_DT AS 业务时间FROM ( SELECT USER_ID , PROD_ID , CARD_NO , ID , OUT_ORDER_NO , TRANS_AMOUNT , TRANS_DT FROM antods.ods_pcc_payment_order_all_tt_delta_hh WHERE DT = ‘20180825’ AND HOUR = ‘17’ AND TO_CHAR(SUBSTR(TRANS_DT, 1, 19), ‘yyyymmddhhmi’) >= CONCAT(TO_CHAR(DATEADD(TO_DATE(‘20180825’,’yyyymmdd’),-1,’DD’),’yyyymmdd’),’2315’) AND TO_CHAR(SUBSTR(TRANS_DT, 1, 19), ‘yyyymmddhhmi’) <= CONCAT(‘20180825’, ‘17’, ‘15’) AND STATUS = ‘Y’) T1LEFT OUTER JOIN ( SELECT paytool_payment_id , AMOUNT FROM antods.ods_pmt_comm_fd_dtl_ob_tt_delta_hh WHERE DT =’20180825’ AND HOUR = ‘17’ AND FUND_STATUS = ‘S’ AND TRANS_STATUS = ‘C’ AND TO_CHAR(SUBSTR(TRANS_DT, 1, 19), ‘yyyymmddhhmi’) >=CONCAT(TO_CHAR(DATEADD(TO_DATE(‘20180825’,’yyyymmdd’),-1,’DD’),’yyyymmdd’),’2315’) AND TO_CHAR(SUBSTR(TRANS_DT, 1, 19), ‘yyyymmddhhmi’) <= CONCAT(‘20180825’, ‘17’, ‘15’)) T2ON T2.paytool_payment_id = T1.IDWHERE COALESCE(T1.TRANS_AMOUNT, ‘/’) <> COALESCE(T2.AMOUNT, ‘/’);
9# INSERT OVERWRITE TABLE address_match_feature_test PARTITION (ds=’20180827’)select a.*,if(b.community_tf is not null,b.community_tf,0) as community_tf,if(c.community_sum is not null,c.community_sum,0) as community_sum, if(group_tf is not null,group_tf,0) as group_tf,if(group_sum is not null,group_sum,0)as group_sum, if(road_tf is not null,road_tf,0)as road_tf,if(road_sum is not null,road_sum,0)as road_sum, if(subroad_tf is not null,subroad_tf,0)as subroad_tf,if(subroad_sum is not null,subroad_sum,0)as subroad_sum, if(roadno_tf is not null,roadno_tf,0)as roadno_tf,if(roadno_sum is not null,roadno_sum,0)as roadno_sum, if(subroadno_tf is not null,subroadno_tf,0)as subroadno_tf,if(subroadno_sum is not null,subroadno_sum,0)as subroadno_sum, if(poi_tf is not null,poi_tf,0) as poi_tf,if(poi_sum is not null,poi_sum,0)as poi_sum, if(subpoi_tf is not null,subpoi_tf,0)as subpoi_tf,if(subpoi_sum is not null,subpoi_sum,0)as subpoi_sum, if(houseno_tf is not null,houseno_tf,0)as houseno_tf,if(houseno_sum is not null,houseno_sum,0)as houseno_sum, if(roadnameno_tf is not null,roadnameno_tf,0)as roadnameno_tf,if(roadnameno_sum is not null,roadnameno_sum,0)as roadnameno_sum, if(roadnamesub_tf is not null,roadnamesub_tf,0)as roadnamesub_tf,if(roadnamesub_sum is not null,roadnamesub_sum,0)as roadnamesub_sum, if(poisub_tf is not null,poisub_tf,0) as poisub_tf,if(poisub_sum is not null,poisub_sum,0)as poisub_sum, if(subpoino_tf is not null,subpoino_tf,0)as subpoino_tf,if(subpoino_sum is not null,subpoino_sum,0)as subpoino_sum, if(poino_tf is not null,poino_tf,0)as poino_tf,if(poino_sum is not null,poino_sum,0)as poino_sum, if(cellno_tf is not null,cellno_tf,0)as cellno_tf,if(cellno_sum is not null,cellno_sum,0)as cellno_sumfrom ( select mail_no, station_id , area_id, community, group, road, subroad, roadno, subroadno, poi, subpoi, houseno, roadnameno, roadnamesub, poisub, subpoino, poino, cellno, label from address_match_test where ds=’20180827’ )a left outer join ( select station_id,area_id,community,count(community) as community_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,community )bon a.area_id=b.area_id and a.community=b.communityleft outer join ( select station_id,community,count(community) as community_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,community )con a.station_id=c.station_id and a.community=c.communityleft outer join ( select station_id,area_id,group,count(group) as group_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,group )don a.area_id=d.area_id and a.group=d.groupleft outer join ( select station_id,group,count(group) as group_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,group )fon a.station_id=f.station_id and a.group=f.groupleft outer join ( select station_id,area_id,road,count(road) as road_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,road )gon a.area_id=g.area_id and a.road=g.roadleft outer join ( select station_id,road,count(road) as road_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,road )hon a.station_id=h.station_id and a.road=h.roadleft outer join ( select station_id,area_id,subroad,count(subroad) as subroad_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,subroad )ion a.area_id=i.area_id and a.subroad=i.subroadleft outer join ( select station_id,subroad,count(subroad) as subroad_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,subroad )jon a.station_id=j.station_id and a.subroad=j.subroadleft outer join ( select station_id,area_id,roadno,count(roadno) as roadno_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,roadno )kon a.area_id=k.area_id and a.roadno=k.roadnoleft outer join ( select station_id,roadno,count(roadno) as roadno_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,roadno )lon a.station_id=l.station_id and a.roadno=l.roadnoleft outer join ( select station_id,area_id,subroadno,count(subroadno) as subroadno_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,subroadno )mon a.area_id=m.area_id and a.subroadno=m.subroadnoleft outer join ( select station_id,subroadno,count(subroadno) as subroadno_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,subroadno )non a.station_id=n.station_id and a.subroadno=n.subroadnoleft outer join ( select station_id,area_id,poi,count(poi) as poi_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,poi )oon a.area_id=o.area_id and a.poi=o.poileft outer join ( select station_id,poi,count(poi) as poi_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,poi )pon a.station_id=p.station_id and a.poi=p.poileft outer join ( select station_id,area_id,subpoi,count(subpoi) as subpoi_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,subpoi )qon a.area_id=q.area_id and a.subpoi=q.subpoileft outer join ( select station_id,subpoi,count(subpoi) as subpoi_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,subpoi )ron a.station_id=r.station_id and a.subpoi=r.subpoi left outer join ( select station_id,area_id,houseno,count(houseno) as houseno_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,houseno )son a.area_id=s.area_id and a.houseno=s.housenoleft outer join ( select station_id,houseno,count(houseno) as houseno_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,houseno )ton a.station_id=t.station_id and a.houseno=t.houseno left outer join ( select station_id,area_id,roadnameno,count(roadnameno) as roadnameno_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,roadnameno )uon a.area_id=u.area_id and a.roadnameno=u.roadnamenoleft outer join ( select station_id,roadnameno,count(roadnameno) as roadnameno_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,roadnameno )von a.station_id=v.station_id and a.roadnameno=v.roadnameno left outer join ( select station_id,area_id,roadnamesub,count(roadnamesub) as roadnamesub_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,roadnamesub )won a.area_id=w.area_id and a.roadnamesub=w.roadnamesubleft outer join ( select station_id,roadnamesub,count(roadnamesub) as roadnamesub_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,roadnamesub )xon a.station_id=x.station_id and a.roadnamesub=x.roadnamesubleft outer join ( select station_id,area_id,poisub,count(poisub) as poisub_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,poisub )yon a.area_id=y.area_id and a.poisub=y.poisubleft outer join ( select station_id,poisub,count(poisub) as poisub_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,poisub )zon a.station_id=z.station_id and a.poisub=z.poisubleft outer join ( select station_id,area_id,subpoino,count(subpoino) as subpoino_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,subpoino )aaon a.area_id=aa.area_id and a.subpoino=aa.subpoinoleft outer join ( select station_id,subpoino,count(subpoino) as subpoino_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,subpoino )abon a.station_id=ab.station_id and a.subpoino=ab.subpoinoleft outer join ( select station_id,area_id,poino,count(poino) as poino_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,poino )acon a.area_id=ac.area_id and a.poino=ac.poinoleft outer join ( select station_id,poino,count(poino) as poino_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,poino )adon a.station_id=ad.station_id and a.poino=ad.poinoleft outer join ( select station_id,area_id,cellno,count(cellno) as cellno_tf from address_match_sample_pos where ds=’20180827’group by station_id,area_id,cellno )aeon a.area_id=ae.area_id and a.cellno=ae.cellnoleft outer join ( select station_id,cellno,count(cellno) as cellno_sumfrom address_match_sample_pos where ds=’20180827’group by station_id,cellno )afon a.station_id=af.station_id and a.cellno=af.cellno;
10# INSERT OVERWRITE TABLE ads_rec_youku_d_ffmv1fg_test_d PARTITION(ds=’20180827’)SELECT sample_id, MAX(label) as label, MAX(features) as featuresFROM (SELECT sample_id, label, CONCAT( ‘c1:’,c1,’:1,’, ‘c2:’,c2,’:1,’, ‘c3:’,c3,’:1,’, ‘c4:’,c4,’:1,’, ‘c5:’,c5,’:1,’, ‘c6:’,c6,’:1,’, ‘c7:’,c7,’:1,’, ‘c8:’,c8,’:1,’, ‘c9:’,c9,’:1,’, ‘c10:’,c10,’:1,’, ‘c11:’,c11,’:1,’, ‘c12:’,c12,’:1,’, ‘c13:’,c13,’:1,’, ‘c14:’,c14,’:1,’, ‘c15:’,c15,’:1,’, ‘u16:’,u16,’:1,’, ‘u22:’,u22,’:1,’, ‘u23:’,u23,’:1,’, ‘u24:’,u24,’:1,’, ‘u25:’,u25,’:1,’, ‘u26:’,u26,’:1,’, ‘u32:’,u32,’:1,’, ‘u33:’,u33,’:1,’, ‘u34:’,u34,’:1,’, ‘u35:’,u35,’:1,’, ‘u36:’,u36,’:1,’, ‘u37:’,u37,’:1,’, ‘u38:’,u38,’:1,’, ‘u39:’,u39,’:1,’, ‘u40:’,u40,’:1,’, ‘u41:’,u41,’:1,’, ‘u42:’,u42,’:1,’, ‘i1:’,i1,’:1,’, ‘i3:’,i3,’:1,’, ‘i4:’,i4,’:1,’, ‘i5:’,i5,’:1,’, ‘i6:’,i6,’:1,’, ‘i11:’,i11,’:1,’, ‘i12:’,i12,’:1,’, ‘i13:’,i13,’:1,’, ‘i14:’,i14,’:1,’, ‘i16:’,i16,’:1,’, ‘i19:’,i19,’:1,’, ‘i21:’,i21,’:1,’, ‘i24:’,i24,’:1,’, ‘i31:’,i31,’:1,’, ‘i32:’,i32,’:1,’, ‘i33:’,i33,’:1,’, ‘i34:’,i34,’:1,’, ‘i35:’,i35,’:1,’, ‘i36:’,i36,’:1,’, ‘i37:’,i37,’:1,’, ‘i38:’,i38,’:1,’, ‘i39:’,i39,’:1,’, ‘i40:’,i40,’:1,’, ‘i41:’,i41,’:1,’, ‘i42:’,i42,’:1,’, ‘i43:’,i43,’:1,’, ‘i44:’,i44,’:1,’, ‘i45:’,i45,’:1,’, ‘i46:’,i46,’:1,’, ‘i52:’,i52,’:1,’, ‘y1:’,y1,’:1,’, ‘y2:’,y2,’:1,’, ‘y3:’,y3,’:1,’, ‘y4:’,y4,’:1,’, ‘y5:’,y5,’:1,’, ‘y6:’,y6,’:1,’, ‘y7:’,y7,’:1,’, ‘y8:’,y8,’:1,’, ‘y9:’,y9,’:1,’, ‘y10:’,y10,’:1,’, ‘y11:’,y11,’:1,’, ‘y12:’,y12,’:1,’, ‘y13:’,y13,’:1,’, ‘y14:’,y14,’:1,’, ‘y15:’,y15,’:1,’, ‘y16:’,y16,’:1,’, ‘y17:’,y17,’:1,’, ‘y18:’,y18,’:1,’, ‘y19:’,y19,’:1,’, ‘y20:’,y20,’:1,’, ‘y21:’,y21,’:1,’, ‘y22:’,y22,’:1,’, ‘y23:’,y23,’:1,’, ‘y24:’,y24,’:1,’, ‘y25:’,y25,’:1,’, ‘y26:’,y26,’:1,’, ‘y27:’,y27,’:1,’, ‘y28:’,y28,’:1,’, ‘y29:’,y29,’:1,’, ‘y30:’,y30,’:1,’, ‘y31:’,y31,’:1,’, ‘y32:’,y32,’:1,’, ‘y33:’,y33,’:1,’, ‘y34:’,y34,’:1,’, ‘y35:’,y35,’:1,’, ‘y36:’,y36,’:1,’, ‘y37:’,y37,’:1,’, ‘y38:’,y38,’:1,’, ‘y39:’,y39,’:1,’, ‘y40:’,y40,’:1,’, ‘y41:’,y41,’:1,’, ‘y42:’,y42,’:1,’, ‘y43:’,y43,’:1,’, ‘y44:’,y44,’:1,’, ‘y45:’,y45,’:1,’, ‘y46:’,y46,’:1,’, ‘y47:’,y47,’:1,’, ‘y48:’,y48,’:1,’, ‘y49:’,y49,’:1,’, ‘y50:’,y50,’:1’ ) AS features FROM (SELECT CONCAT(CAST(RAND() AS STRING),’’, ‘’ ,’,’,a.utdid,’,’,a.content_id) AS sample_id, label, COALESCE(CONCAT(‘c1’,’‘,ui_click_1d),’0’) AS c1, COALESCE(CONCAT(‘c2’,’‘,ui_non_click_1d),’0’) AS c2, COALESCE(CONCAT(‘c3’,’‘,ui_click_3d),’0’) AS c3, COALESCE(CONCAT(‘c4’,’‘,ui_non_click_3d),’0’) AS c4, COALESCE(CONCAT(‘c5’,’‘,ui_click_7d),’0’) AS c5, COALESCE(CONCAT(‘c6’,’‘,ui_non_click_7d),’0’) AS c6, COALESCE(CONCAT(‘c7’,’‘,ui_click_14d),’0’) AS c7, COALESCE(CONCAT(‘c8’,’‘,ui_non_click_14d),’0’) AS c8, COALESCE(CONCAT(‘c9’,’‘,ui_vv_1d),’0’) AS c9, COALESCE(CONCAT(‘c10’,’‘,ui_vv_3d),’0’) AS c10, COALESCE(CONCAT(‘c11’,’‘,ui_vv_7d),’0’) AS c11, COALESCE(CONCAT(‘c12’,’‘,ui_vv_14d),’0’) AS c12, COALESCE(CONCAT(‘c13’,’‘,ui_last_vv_time),’0’) AS c13, COALESCE(CONCAT(‘c14’,’‘,week_day),’0’) AS c14, COALESCE(CONCAT(‘c15’,’‘,hour),’0’) AS c15, COALESCE(CONCAT(‘u16’,’‘,active),’0’) AS u16, COALESCE(CONCAT(‘u22’,’‘,u_video_play_cnt_1d),’0’) AS u22, COALESCE(CONCAT(‘u23’,’‘,u_video_play_cnt_2d),’0’) AS u23, COALESCE(CONCAT(‘u24’,’‘,u_video_play_cnt_3d),’0’) AS u24, COALESCE(CONCAT(‘u25’,’‘,u_video_play_cnt_7d),’0’) AS u25, COALESCE(CONCAT(‘u26’,’‘,u_video_play_cnt_14d),’0’) AS u26, COALESCE(CONCAT(‘u32’,’‘,u_video_play_ts_1d),’0’) AS u32, COALESCE(CONCAT(‘u33’,’‘,u_video_play_ts_2d),’0’) AS u33, COALESCE(CONCAT(‘u34’,’‘,u_video_play_ts_3d),’0’) AS u34, COALESCE(CONCAT(‘u35’,’‘,u_video_play_ts_7d),’0’) AS u35, COALESCE(CONCAT(‘u36’,’‘,u_video_play_ts_14d),’0’) AS u36, COALESCE(CONCAT(‘u37’,’‘,u_video_play_days_1d),’0’) AS u37, COALESCE(CONCAT(‘u38’,’‘,u_video_play_days_2d),’0’) AS u38, COALESCE(CONCAT(‘u39’,’‘,u_video_play_days_3d),’0’) AS u39, COALESCE(CONCAT(‘u40’,’‘,u_video_play_days_7d),’0’) AS u40, COALESCE(CONCAT(‘u41’,’‘,u_video_play_days_14d),’0’) AS u41, COALESCE(CONCAT(‘u42’,’‘,u_video_view_time_score),’0’) AS u42, COALESCE(CONCAT(‘i1’,’‘,a.content_id),’0’) AS i1, COALESCE(CONCAT(‘i3’,’‘,i_vdo_len),’0’) AS i3, COALESCE(CONCAT(‘i4’,’‘,i_vdo_type),’0’) AS i4, COALESCE(CONCAT(‘i5’,’‘,i_mbr_id),’0’) AS i5, COALESCE(CONCAT(‘i6’,’‘,i_show_id),’0’) AS i6, COALESCE(CONCAT(‘i11’,’‘,i_vv),’0’) AS i11, COALESCE(CONCAT(‘i12’,’‘,i_video_publishday),’0’) AS i12, COALESCE(CONCAT(‘i13’,’‘,i_video_titlelength),’0’) AS i13, COALESCE(CONCAT(‘i14’,’‘,i_video_symbolcount),’0’) AS i14, COALESCE(CONCAT(‘i16’,’‘,i_video_click_1d),’0’) AS i16, COALESCE(CONCAT(‘i19’,’‘,i_video_click_7d),’0’) AS i19, COALESCE(CONCAT(‘i21’,’‘,i_video_click_user_1d),’0’) AS i21, COALESCE(CONCAT(‘i24’,’‘,i_video_click_user_7d),’0’) AS i24, COALESCE(CONCAT(‘i31’,’‘,i_video_click_avg_cnt),’0’) AS i31, COALESCE(CONCAT(‘i32’,’‘,i_video_expose_1d),’0’) AS i32, COALESCE(CONCAT(‘i33’,’‘,i_video_expose_2d),’0’) AS i33, COALESCE(CONCAT(‘i34’,’‘,i_video_expose_3d),’0’) AS i34, COALESCE(CONCAT(‘i35’,’‘,i_video_expose_7d),’0’) AS i35, COALESCE(CONCAT(‘i36’,’‘,i_video_expose_14d),’0’) AS i36, COALESCE(CONCAT(‘i37’,’‘,i_play_cnt_1d),’0’) AS i37, COALESCE(CONCAT(‘i38’,’‘,i_play_cnt_2d),’0’) AS i38, COALESCE(CONCAT(‘i39’,’‘,i_play_cnt_3d),’0’) AS i39, COALESCE(CONCAT(‘i40’,’‘,i_play_cnt_7d),’0’) AS i40, COALESCE(CONCAT(‘i41’,’‘,i_play_cnt_14d),’0’) AS i41, COALESCE(CONCAT(‘i42’,’‘,i_play_user_cnt_1d),’0’) AS i42, COALESCE(CONCAT(‘i43’,’‘,i_play_user_cnt_2d),’0’) AS i43, COALESCE(CONCAT(‘i44’,’‘,i_play_user_cnt_3d),’0’) AS i44, COALESCE(CONCAT(‘i45’,’‘,i_play_user_cnt_7d),’0’) AS i45, COALESCE(CONCAT(‘i46’,’‘,i_play_user_cnt_14d),’0’) AS i46, COALESCE(CONCAT(‘i52’,’‘,i_view_time_score),’0’) AS i52, COALESCE(concat(‘y1’,’‘,gender,’‘,a.content_id),’’) as y1, COALESCE(concat(‘y2’,’‘,age,’‘,a.content_id),’’) as y2, COALESCE(concat(‘y3’,’‘,constellation,’‘,a.content_id),’’) as y3, COALESCE(concat(‘y4’,’‘,province,’‘,a.content_id),’’) as y4, COALESCE(concat(‘y5’,’‘,os_name,’‘,a.content_id),’’) as y5, COALESCE(concat(‘y6’,’‘,user_id,’‘,vdo_chnl_id),’’) as y6, COALESCE(concat(‘y7’,’‘,gender,’‘,vdo_chnl_id),’’) as y7, COALESCE(concat(‘y8’,’‘,age,’‘,vdo_chnl_id),’’) as y8, COALESCE(concat(‘y9’,’‘,constellation,’‘,vdo_chnl_id),’’) as y9, COALESCE(concat(‘y10’,’‘,province,’‘,vdo_chnl_id),’’) as y10, COALESCE(concat(‘y11’,’‘,os_name,’‘,vdo_chnl_id),’’) as y11, COALESCE(concat(‘y12’,’‘,gender,’‘,age,’‘,vdo_chnl_id),’’) as y12, COALESCE(concat(‘y13’,’‘,gender,’‘,province,’‘,vdo_chnl_id),’’) as y13, COALESCE(concat(‘y14’,’‘,gender,’‘,os_name,’‘,vdo_chnl_id),’’) as y14, COALESCE(concat(‘y15’,’‘,user_id,’‘,i_vdo_len),’’) as y15, COALESCE(concat(‘y16’,’‘,gender,’‘,i_vdo_len),’’) as y16, COALESCE(concat(‘y17’,’‘,age,’‘,i_vdo_len),’’) as y17, COALESCE(concat(‘y18’,’‘,os_name,’‘,i_vdo_len),’’) as y18, COALESCE(concat(‘y19’,’‘,gender,’‘,age,’‘,i_vdo_len),’’) as y19, COALESCE(concat(‘y20’,’‘,gender,’‘,os_name,’‘,i_vdo_len),’’) as y20, COALESCE(concat(‘y21’,’‘,user_id,’‘,i_video_publishday),’’) as y21, COALESCE(concat(‘y22’,’‘,gender,’‘,i_video_publishday),’’) as y22, COALESCE(concat(‘y23’,’‘,age,’‘,i_video_publishday),’’) as y23, COALESCE(concat(‘y24’,’‘,os_name,’‘,i_video_publishday),’’) as y24, COALESCE(concat(‘y25’,’‘,user_id,’‘,i_mbr_id),’’) as y25, COALESCE(concat(‘y26’,’‘,gender,’‘,i_mbr_id),’’) as y26, COALESCE(concat(‘y27’,’‘,age,’‘,i_mbr_id),’’) as y27, COALESCE(concat(‘y28’,’‘,constellation,’‘,i_mbr_id),’’) as y28, COALESCE(concat(‘y29’,’‘,province,’‘,i_mbr_id),’’) as y29, COALESCE(concat(‘y30’,’‘,os_name,’‘,i_mbr_id),’’) as y30, COALESCE(concat(‘y31’,’‘,gender,’‘,age,’‘,i_mbr_id),’’) as y31, COALESCE(concat(‘y32’,’‘,gender,’‘,province,’‘,i_mbr_id),’’) as y32, COALESCE(concat(‘y33’,’‘,gender,’‘,os_name,’‘,i_mbr_id),’’) as y33, COALESCE(concat(‘y34’,’‘,user_id,’‘,i_vdo_type),’’) as y34, COALESCE(concat(‘y35’,’‘,gender,’‘,i_vdo_type),’’) as y35, COALESCE(concat(‘y36’,’‘,age,’‘,i_vdo_type),’’) as y36, COALESCE(concat(‘y37’,’‘,os_name,’‘,i_vdo_type),’’) as y37, COALESCE(concat(‘y38’,’‘,gender,’‘,age,’‘,i_vdo_type),’’) as y38, COALESCE(concat(‘y39’,’‘,user_id,’‘,i_show_id),’’) as y39, COALESCE(concat(‘y40’,’‘,gender,’‘,i_show_id),’’) as y40, COALESCE(concat(‘y41’,’‘,age,’‘,i_show_id),’’) as y41, COALESCE(concat(‘y42’,’‘,province,’‘,i_show_id),’’) as y42, COALESCE(concat(‘y43’,’‘,os_name,’‘,i_show_id),’’) as y43, COALESCE(concat(‘y44’,’‘,age,’‘,vdo_chnl_id,’‘,i_mbr_id),’’) as y44, COALESCE(concat(‘y45’,’‘,gender,’‘,vdo_chnl_id,’‘,i_mbr_id),’’) as y45, COALESCE(concat(‘y46’,’‘,gender,’‘,vdo_chnl_id,’‘,i_video_publishday),’’) as y46, COALESCE(concat(‘y47’,’‘,user_id,’‘,vdo_chnl_id,’‘,i_video_publishday),’’) as y47, COALESCE(concat(‘y48’,’‘,user_id,’‘,vdo_chnl_id,’‘,i_vdo_type),’’) as y48, COALESCE(concat(‘y49’,’‘,gender,’‘,vdo_chnl_id,’‘,i_vdo_type),’’) as y49, COALESCE(concat(‘y50’,’‘,age,’‘,gender,’‘,week_day,’‘,a.hour),’’) as y50 FROM (SELECT utdid,content_id,ui_click_1d,ui_non_click_1d,ui_click_3d,ui_non_click_3d, ui_click_7d,ui_non_click_7d,ui_click_14d,ui_non_click_14d ui_vv_1d,ui_vv_3d,ui_vv_7d,ui_vv_14d,ui_non_click_14d, ui_last_vv_time,week_day,hour,if(label>0,1,0) AS label FROM ytrec.ads_rec_youkurecsys_feed_mergedsample_discoverylrv1_d WHERE ds=20180827) a JOIN (select utdid,user_id,gender,age,constellation,province,os_name,active,u_video_click_cnt_1d,u_video_click_cnt_2d,u_video_click_cnt_3d, u_video_click_cnt_7d,u_video_click_cnt_14d,u_video_click_item_cnt_1d,u_video_click_item_cnt_2d,u_video_click_item_cnt_3d, u_video_click_item_cnt_7d,u_video_click_item_cnt_14d,u_video_click_days_1d,u_video_click_days_2d,u_video_click_days_3d, u_video_click_days_7d,u_video_click_days_14d,u_video_play_cnt_1d,u_video_play_cnt_2d,u_video_play_cnt_3d, u_video_play_cnt_7d,u_video_play_cnt_14d,u_video_play_item_cnt_1d,u_video_play_item_cnt_2d,u_video_play_item_cnt_3d, u_video_play_item_cnt_7d,u_video_play_item_cnt_14d,u_video_play_ts_1d,u_video_play_ts_2d,u_video_play_ts_3d, u_video_play_ts_7d,u_video_play_ts_14d,u_video_play_days_1d,u_video_play_days_2d,u_video_play_days_3d, u_video_play_days_7d,u_video_play_days_14d,u_video_view_time_score,u_play_cnt_1d,u_play_cnt_2d,u_play_cnt_3d, u_play_cnt_7d,u_play_cnt_14d,u_play_ts_1d,u_play_ts_2d,u_play_ts_3d,u_play_ts_7d,u_play_ts_14d,u_play_item_cnt_1d,u_play_item_cnt_2d,u_play_item_cnt_3d, u_play_item_cnt_7d,u_play_item_cnt_14d,u_play_days_1d,u_play_days_2d,u_play_days_3d, u_play_days_7d,u_play_days_14d,u_view_time_score,u_play_last_diff,u_video_subs_uploader_list, u_video_like_uploader_list,pre_tags_30d,pre_vdo,pre_chnl,chnl_click_1d,chnl_non_click_1d,chnl_click_3d,chnl_non_click_3d, chnl_click_7d,chnl_non_click_7d,chnl_click_14d,chnl_non_click_14d,tag_click_1d,tag_non_click_1d,tag_click_3d,tag_non_click_3d, tag_click_7d,tag_non_click_7d,tag_click_14d,tag_non_click_14d,category_id_ctr_1d,category_id_ctr_3d, category_id_ctr_7d,category_id_ctr_14d,tag_ctr_1d,tag_ctr_3d,tag_ctr_7d,tag_ctr_14d,tags_1m,tags_3m,tags_6m FROM ytrec.youku_short_video_merge_user_feature_yh2 WHERE ds=20180826 ) b ON a.utdid = b.utdid JOIN (select content_id,vdo_chnl_id,i_vdo_len,i_vdo_type,i_mbr_id,i_show_id,i_up_count,i_favorite_count,i_down_count,i_comment_count,i_vv, i_video_publishday,i_video_titlelength,i_video_symbolcount,i_tags,i_video_click_1d,i_video_click_2d,i_video_click_3d,i_video_click_7d,i_video_click_14d, i_video_click_user_1d,i_video_click_user_2d,i_video_click_user_3d,i_video_click_user_7d,i_video_click_user_14d, i_video_click_rate_1d,i_video_click_rate_2d,i_video_click_rate_3d,i_video_click_rate_7d,i_video_click_rate_14d, i_video_click_avg_cnt,i_video_expose_1d,i_video_expose_2d,i_video_expose_3d,i_video_expose_7d,i_video_expose_14d, i_play_cnt_1d,i_play_cnt_2d,i_play_cnt_3d,i_play_cnt_7d,i_play_cnt_14d,i_play_user_cnt_1d,i_play_user_cnt_2d,i_play_user_cnt_3d,i_play_user_cnt_7d,i_play_user_cnt_14d, i_play_pcr_1d,i_play_pcr_2d,i_play_pcr_3d,i_play_pcr_7d,i_play_pcr_14d, i_view_time_score,i_video_uchannel,i_video_ucrmlevel,i_video_ufans,i_video_uisoriginal,i_video_uisvip,i_video_uisofficial, i_video_uoftype,i_video_uispgc,i_video_uisverify,i_video_clarityscore,i_video_dclick_1d,i_video_dclick_7d,i_video_dctr_1d,i_video_dctr_7d FROM ytrec.ads_youku_feed_feature_item_merge_d WHERE ds=20180826) c ON a.content_id = c.content_id ) d )e GROUP BY sample_id;