Best Practices Max Compute

Download as pdf or txt
Download as pdf or txt
You are on page 1of 228

Alibaba Cloud

Alibaba Cloud

MaxCompute
MaxCompute
Best Practices
Best Practices

Document Version: 20220630

Document Version: 20220630


MaxComput e Best Pract ices· Legal disclaimer

Legal disclaimer
Alibaba Cloud reminds you t o carefully read and fully underst and t he t erms and condit ions of t his legal
disclaimer before you read or use t his document . If you have read or used t his document , it shall be deemed
as your t ot al accept ance of t his legal disclaimer.

1. You shall download and obt ain t his document from t he Alibaba Cloud websit e or ot her Alibaba Cloud-
aut horized channels, and use t his document for your own legal business act ivit ies only. The cont ent of
t his document is considered confident ial informat ion of Alibaba Cloud. You shall st rict ly abide by t he
confident ialit y obligat ions. No part of t his document shall be disclosed or provided t o any t hird part y for
use wit hout t he prior writ t en consent of Alibaba Cloud.

2. No part of t his document shall be excerpt ed, t ranslat ed, reproduced, t ransmit t ed, or disseminat ed by
any organizat ion, company or individual in any form or by any means wit hout t he prior writ t en consent of
Alibaba Cloud.

3. The cont ent of t his document may be changed because of product version upgrade, adjust ment , or
ot her reasons. Alibaba Cloud reserves t he right t o modify t he cont ent of t his document wit hout not ice
and an updat ed version of t his document will be released t hrough Alibaba Cloud-aut horized channels
from t ime t o t ime. You should pay at t ent ion t o t he version changes of t his document as t hey occur and
download and obt ain t he most up-t o-dat e version of t his document from Alibaba Cloud-aut horized
channels.

4. This document serves only as a reference guide for your use of Alibaba Cloud product s and services.
Alibaba Cloud provides t his document based on t he "st at us quo", "being defect ive", and "exist ing
funct ions" of it s product s and services. Alibaba Cloud makes every effort t o provide relevant operat ional
guidance based on exist ing t echnologies. However, Alibaba Cloud hereby makes a clear st at ement t hat
it in no way guarant ees t he accuracy, int egrit y, applicabilit y, and reliabilit y of t he cont ent of t his
document , eit her explicit ly or implicit ly. Alibaba Cloud shall not t ake legal responsibilit y for any errors or
lost profit s incurred by any organizat ion, company, or individual arising from download, use, or t rust in
t his document . Alibaba Cloud shall not , under any circumst ances, t ake responsibilit y for any indirect ,
consequent ial, punit ive, cont ingent , special, or punit ive damages, including lost profit s arising from t he
use or t rust in t his document (even if Alibaba Cloud has been not ified of t he possibilit y of such a loss).

5. By law, all t he cont ent s in Alibaba Cloud document s, including but not limit ed t o pict ures, archit ect ure
design, page layout , and t ext descript ion, are int ellect ual propert y of Alibaba Cloud and/or it s
affiliat es. This int ellect ual propert y includes, but is not limit ed t o, t rademark right s, pat ent right s,
copyright s, and t rade secret s. No part of t his document shall be used, modified, reproduced, publicly
t ransmit t ed, changed, disseminat ed, dist ribut ed, or published wit hout t he prior writ t en consent of
Alibaba Cloud and/or it s affiliat es. The names owned by Alibaba Cloud shall not be used, published, or
reproduced for market ing, advert ising, promot ion, or ot her purposes wit hout t he prior writ t en consent of
Alibaba Cloud. The names owned by Alibaba Cloud include, but are not limit ed t o, "Alibaba Cloud",
"Aliyun", "HiChina", and ot her brands of Alibaba Cloud and/or it s affiliat es, which appear separat ely or in
combinat ion, as well as t he auxiliary signs and pat t erns of t he preceding brands, or anyt hing similar t o
t he company names, t rade names, t rademarks, product or service names, domain names, pat t erns,
logos, marks, signs, or special descript ions t hat t hird part ies ident ify as Alibaba Cloud and/or it s
affiliat es.

6. Please direct ly cont act Alibaba Cloud for any errors of t his document .

> Document Version: 20220630 I


Best Pract ices· Document convent io
MaxComput e
ns

Document conventions
St yle Descript io n Example

A danger notice indicates a situation that Danger:


Danger will cause major system changes, faults,
Resetting will result in the loss of user
physical injuries, and other adverse
configuration data.
results.

W arning:
A warning notice indicates a situation
W arning that may cause major system changes, Restarting will cause business
faults, physical injuries, and other adverse interruption. About 10 minutes are
results. required to restart an instance.

A caution notice indicates warning No t ice:


No t ice information, supplementary instructions,
If the weight is set to 0, the server no
and other content that the user must
longer receives new requests.
understand.

A note indicates supplemental No t e:


No t e instructions, best practices, tips, and
You can use Ctrl + A to select all files.
other content.

Closing angle brackets are used to Click Set t ings > Net w o rk > Set net w o rk
>
indicate a multi-level menu cascade. t ype .

Bold formatting is used for buttons ,


Bo ld menus, page names, and other UI Click OK .
elements.

Run the cd /d C:/window command to


Courier font Courier font is used for commands
enter the Windows system folder.

bae log list --instanceid


Italic formatting is used for parameters
Italic
and variables.
Instance_ID

T his format is used for an optional value,


[] or [a|b] ipconfig [-all|-t]
where only one item can be selected.

T his format is used for a required value,


{} or {a|b} switch {active|stand}
where only one item can be selected.

> Document Version: 20220630 I


MaxComput e Best Pract ices· Table of Cont ent s

Table of Contents
1.SQL 07

1.1. Write MaxCompute SQL statements 07

1.2. Rewrite incompatible SQL statements 09

1.3. Export SQL execution results 22

1.4. Check whether partition pruning is effective 26

1.5. Query the first N data records of each group 30

1.6. Merge multiple rows of data into one row 31

1.7. Transpose rows to columns or columns to rows 32

1.8. JOIN operations in MaxCompute SQL 35

2.Data migration 46

2.1. Overview 46

2.2. Migrate data across DataWorks workspaces 47

2.3. Synchronize data from Hadoop to MaxCompute 51

2.4. Best practice to migrate data from Oracle to MaxCompute 61

2.5. Migrate data from Kafka to MaxCompute 64

2.6. Migrate data from Elasticsearch to MaxCompute 69

2.7. Migrate JSON-formatted data from MongoDB to MaxCompute


… 73

2.8. Migrate data from ApsaraDB RDS to MaxCompute based on


… dynamic
76 partitionin

2.9. Migrate JSON data from OSS to MaxCompute 83

2.10. Migrate data from MaxCompute to Tablestore 86

2.11. Migrate data from MaxCompute to OSS 89

2.12. Migrate data from a user-created MySQL database on an …ECS instance


92 to Max
2.13. Migrate data from Amazon Redshift to MaxCompute 97

2.14. Migrate data from BigQuery to MaxCompute 116

2.15. Migrate log data to MaxCompute 127

2.15.1. Overview 127

> Document Version: 20220630 I


Best Pract ices· Table of Cont ent s MaxComput e

2.15.2. Use Tunnel to upload log data to MaxCompute 128

2.15.3. Use DataHub to migrate log data to MaxCompute 129

2.15.4. Use DataWorks Data Integration to migrate log data to


… MaxCompute
132

3.Data development 133

3.1. Convert data types among ST RING, T IMESTAMP, and DAT ET


…IME 133

3.2. Use a MaxCompute U DF to convert IPv4 or IPv6 addresses…into geolocations


136

3.3. Use IntelliJ IDEA to develop a Java U DF 151

3.4. Use MaxCompute to query geolocations of IP addresses 156

3.5. Resolve the issue that you cannot upload files that exceed…10 MB161to DataWork

3.6. Grant a specified user the access permissions on a specific…U DF 161

3.7. Use a PyODPS node to segment Chinese text based on Jieba


… 167

3.8. Use a PyODPS node to download data to a local directory…for processing


174 or to
4.Compute optimization 176

4.1. Optimize SQL statements 176

4.2. Optimize JOIN long tails 179

4.3. Long-tail computing optimization 188

4.4. Optimize the calculation for long-period metrics 191

5.Job diagnostics 194

5.1. Use Logview to diagnose jobs that run slowly 194

6.Cost optimization 199

6.1. Overview 199

6.2. Select a billing method 199

6.3. Optimize computing costs 201

6.4. Optimize storage costs 206

6.5. Optimize the costs of data uploads and downloads 207

6.6. Manage costs 208

6.7. Command reference 209

6.8. Analyze MaxCompute bills 212

II > Document Version: 20220630


MaxComput e Best Pract ices· Table of Cont ent s

7.Security management 220

7.1. Set a RAM user as the super administrator for a MaxCompute


… project
220

7.2. Policy-based permission management for users assigned built-in


… roles
224

> Document Version: 20220630 III


MaxComput e Best Pract ices· SQL

1.SQL
1.1. Write MaxCompute SQL
statements
T his t opic describes t he common scenarios of using MaxComput e SQL st at ement s and how t o writ e
t hem.

Prepare a dataset
T he emp and dept t ables are used as t he dat aset in t his example. You can creat e a t able on a
MaxComput e project and upload dat a t o t he t able. For more informat ion about how t o import dat a,
see Overview .

Download dat a files of t he emp t able and dat a files of t he dept t able.

Execut e t he following st at ement t o creat e t he emp t able:

CREATE TABLE IF NOT EXISTS emp (


EMPNO STRING,
ENAME STRING,
JOB STRING,
MGR BIGINT,
HIREDATE DATETIME,
SAL DOUBLE,
COMM DOUBLE,
DEPTNO BIGINT);

Execut e t he following st at ement t o creat e t he dept t able:

CREATE TABLE IF NOT EXISTS dept (


DEPTNO BIGINT,
DNAME STRING,
LOC STRING);

Examples
Example 1: Query all depart ment s t hat have at least one employee.

We recommend t hat you use t he JOIN clause t o avoid large amount s of dat a in t he query. Execut e t he
following SQL st at ement :

SELECT d. *
FROM dept d
JOIN (
SELECT DISTINCT deptno AS no
FROM emp
) e
ON d.deptno = e.no;

Example 2: Query all employees who have higher salaries t han Smit h.

> Document Version: 20220630 7


Best Pract ices· SQL MaxComput e

T he following code shows how t o use MAPJOIN in t he SQL st at ement for t his scenario:

SELECT /*+ MapJoin(a) */ e.empno


, e.ename
, e.sal
FROM emp e
JOIN (
SELECT MAX(sal) AS sal
FROM `emp`
WHERE `ENAME` = 'SMITH'
) a
ON e.sal > a.sal;

Example 3: Query t he names of all employees and t he names of t heir immediat e superiors.

T he following code shows how t o use EQUI JOIN in t he SQL st at ement for t his scenario:

SELECT a.ename
, b.ename
FROM emp a
LEFT OUTER JOIN emp b
ON b.empno = a.mgr;

Example 4: Query all jobs t hat have basic salaries higher t han USD 1,500.

T he following code shows how t o use t he HAVING clause in t he SQL st at ement for t his scenario:

SELECT emp. `JOB`


, MIN(emp.sal) AS sal
FROM `emp`
GROUP BY emp. `JOB`
HAVING MIN(emp.sal) > 1500;

Example 5: Query t he number of employees in each depart ment , t he average salary, and t he average
lengt h of service.

T he following code shows how t o use built -in funct ions in t he SQL st at ement for t his scenario:

SELECT COUNT(empno) AS cnt_emp


, ROUND(AVG(sal), 2) AS avg_sal
, ROUND(AVG(datediff(getdate(), hiredate, 'dd')), 2) AS avg_hire
FROM `emp`
GROUP BY `DEPTNO`;

Example 6: Query t he names and t he sort ing order of t he first t hree employees who have t he highest
salaries.

T he following code shows how t o use t he T OP N clause in t he SQL st at ement for t his scenario:

8 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

SELECT *
FROM (
SELECT deptno
, ename
, sal
, ROW_NUMBER() OVER (PARTITION BY deptno ORDER BY sal DESC) AS nums
FROM emp
) emp1
WHERE emp1.nums < 4;

Example 7: Query t he number of employees in each depart ment and t he proport ion of clerks in t hese
depart ment s.

SELECT deptno
, COUNT(empno) AS cnt
, ROUND(SUM(CASE
WHEN job = 'CLERK' THEN 1
ELSE 0
END) / COUNT(empno), 2) AS rate
FROM `EMP`
GROUP BY deptno;

Notes
When you use t he GROUP BY clause, t he SELECT list can only consist of aggregat e funct ions or
columns t hat are part of t he GROUP BY clause.
ORDER BY must be followed by LIMIT N.
T he SELECT expression does not support subqueries. T o use subqueries, you can rewrit e t he code t o
include a JOIN clause.
T he JOIN clause does not support Cart esian project s. You can replace t he JOIN clause wit h MAPJOIN.
UNION ALL must be replaced wit h subqueries.
T he subquery t hat is specified in t he IN or NOT IN clause must cont ain only one column and ret urn a
maximum of 1,000 rows. Ot herwise, use t he JOIN clause.

1.2. Rewrite incompatible SQL


statements
T his t opic describes how t o modify SQL st at ement s t hat are incompat ible wit h MaxComput e V2.0.

Background information
MaxComput e V2.0 fully embraces open source ecosyst ems, support s more programming languages and
feat ures, and provides higher performance. It also inspect s synt ax more rigorously. As a result , errors
may be ret urned for some st at ement s t hat use less rigorous synt ax and are successfully execut ed in t he
earlier versions.

> Document Version: 20220630 9


Best Pract ices· SQL MaxComput e

T o enable t he smoot h canary upgrade t o MaxComput e V2.0, t he MaxComput e framework support s


rollback. If MaxComput e V2.0 fails t o execut e a job, MaxComput e V1.0 will execut e t he job inst ead. T he
rollback increases t he lat ency of t he job. Before you submit jobs, we recommend t hat you configure
set odps.sql.planner.mode=lot; t o manually disable t he rollback feat ure. T his prevent s t he impact s
from t he modificat ion on t he MaxComput e rollback policy.

T he MaxComput e t eam not ifies t he owners of t he jobs for which t he required SQL st at ement s cannot
be execut ed by email or DingT alk based on t he online rollback condit ion. T he job owners must modify
t he SQL st at ement s for t he jobs at t he earliest opport unit y. Ot herwise, t he jobs may fail.

group.by.with.star
T his st at ement is equivalent t o t he select * …group by… st at ement .

In MaxComput e V2.0, all t he columns of a source t able must be included in t he GROUP BY clause.
Ot herwise, an error is ret urned.
In t he earlier version of MaxComput e, select * from group by key is support ed even if not all
columns of a source t able are included in t he GROUP BY clause.

Examples

Scenario 1: T he GROUP BY key does not include all columns.


Invalid synt ax:

select * from t group by key;

Error message:

FAILED: ODPS-0130071:[1,8] Semantic analysis exception - column reference t.value shoul


d appear in GROUP BY key

Valid synt ax:

select distinct key from t;

Scenario 2: T he GROUP BY key includes all columns.

We recommend t hat you do no use t he following synt ax:

select * from t group by key, value; -- t has columns key and value

Even if t he preceding synt ax causes no errors in MaxComput e V2.0, we recommend t hat you use
t he following synt ax:

select distinct key, value from t;

bad.escape
T he escape sequence is invalid.

MaxComput e defines t hat , in st ring lit eral, each ASCII charact er t hat ranges from 0 t o 127 must be
writ t en in t he format of a backslash (\) followed by t hree oct al digit s. For example, 0 is writ t en as \001,
and 1 is writ t en as \002. However, \01 and \0001 are processed as \001.

T his met hod confuses new users. For example, "\0001" cannot be processed as "\000"+"1". For users
who migrat e dat a from ot her syst ems t o MaxComput e, invalid dat a may be generat ed.

10 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

Not e If numbers are appended t o \000 , such as numbers in t he range of \0001 to


\0009 or t he number \00001 , an error may be ret urned.

MaxComput e V2.0 correct s t he sequences in script s t o handle t his issue.

Invalid synt ax:

select split(key, "\01"), value like "\0001" from t;

Error message:

FAILED: ODPS-0130161:[1,19] Parse exception - unexpected escape sequence: 01


ODPS-0130161:[1,38] Parse exception - unexpected escape sequence: 0001

Valid synt ax:

select split(key, "\001"), value like "\001" from t;

column.repeated.in.creation
If duplicat e column names are det ect ed when t he CREAT E T ABLE st at ement is execut ed, MaxComput e
V2.0 ret urns an error.

Examples

Invalid synt ax:

create table t (a BIGINT, b BIGINT, a BIGINT);

Error message:

FAILED: ODPS-0130071:[1,37] Semantic analysis exception - column repeated in creation: a

Valid synt ax:

create table t (a BIGINT, b BIGINT);

string.join.double
You want t o join t he values of t he ST RING t ype wit h t hose of t he DOUBLE t ype.
In t he early version of MaxComput e, t he values of t he ST RING and DOUBLE t ypes are convert ed int o
t he BIGINT t ype. T his causes precision loss. For example, 1.1 = "1" in a JOIN condit ion is considered
equal.
In MaxComput e V2.0, t he values of t he ST RING and DOUBLE t ypes are convert ed int o t he DOUBLE t ype
because MaxComput e V2.0 is compat ible wit h Hive.
Examples

Synt ax t hat is not recommended:

select * from t1 join t2 on t1.double_value = t2.string_value;

Warning informat ion:

WARNING:[1,48] implicit conversion from STRING to DOUBLE, potential data loss, use CAST
function to suppress

> Document Version: 20220630 11


Best Pract ices· SQL MaxComput e

Recommended synt ax:

select * from t1 join t2 on t.double_value = cast(t2.string_value as double);

window.ref.prev.window.alias
Window funct ions reference t he aliases of ot her window funct ions in t he SELECT clause of t he same
level.

Examples

Assume t hat rn does not exist in t 1. Invalid synt ax:

select row_number() over (partition by c1 order by c1) rn,


row_number() over (partition by c1 order by rn) rn2
from t1;

Error message:

FAILED: ODPS-0130071:[2,45] Semantic analysis exception - column rn cannot be resolved

Valid synt ax:

select row_number() over (partition by c1 order by rn) rn2


from
(select c1, row_number() over (partition by c1 order by c1) rn
from t1
) tmp;

select.invalid.token.after.star
T he SELECT clause allows you t o use an ast erisk (*) t o select all t he columns of a t able. However, t he
ast erisk cannot be followed by aliases even if t he ast erisk specifies only one column. T he new
edit or ret urns errors for similar synt ax.
Examples

Invalid synt ax:

select * as alias from table_test;

Error message:

FAILED: ODPS-0130161:[1,10] Parse exception - invalid token 'as'

Valid synt ax:

select * from table_test;

agg.having.ref.prev.agg.alias
If HAVING exist s, t he SELECT clause can reference aggregat e funct ion aliases.

Examples

Invalid synt ax:

12 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

select count(c1) cnt,


sum(c1) / cnt avg
from t1
group by c2
having cnt > 1;

Error message:

FAILED: ODPS-0130071:[2,11] Semantic analysis exception - column cnt cannot be resolved


ODPS-0130071:[2,11] Semantic analysis exception - column reference cnt should appear in G
ROUP BY key

s and cnt do not exist in source t able t 1. However, t he early version of MaxComput e does not ret urn
an error because HAVING exist s. In MaxComput e V2.0, t he error message column cannot be resolve
is ret urned.

Valid synt ax:

select cnt, s, s/cnt avg


from
(
select count(c1) cnt,
sum(c1) s
from t1
group by c2
having count(c1) > 1
) tmp;

order.by.no.limit
In MaxComput e, t he ORDER BY clause must be followed by a LIMIT clause t o limit t he number of
dat a records. ORDER BY is used t o sort all dat a records. If ORDER BY is not followed by a LIMIT
clause, t he execut ion performance is low.

Examples

Invalid synt ax:

select * from (select *


from (select cast(login_user_cnt as int) as uv, '3' as shuzi
from test_login_cnt where type = 'device' and type_name = 'mobile') v
order by v.uv desc) v
order by v.shuzi limit 20;

Error message:

FAILED: ODPS-0130071:[4,1] Semantic analysis exception - ORDER BY must be used with a LIM
IT clause

Add a LIMIT clause t o t he subquery order by v.uv desc .

In MaxComput e V1.0, view checks are not rigorous. For example, a view is creat ed in a project which
does not require a check on t he LIMIT clause. odps.sql.validat e.orderby.limit =false indicat es t hat
t he project does not require a check on t he LIMIT clause.

create view table_view as select id from table_view order by id;

> Document Version: 20220630 13


Best Pract ices· SQL MaxComput e

Execut e t he following st at ement t o access t he view:

select * from table_view;

MaxComput e V1.0 does not ret urn an error, whereas MaxComput e V2.0 ret urns t he following error:

FAILED: ODPS-0130071:[1,15] Semantic analysis exception - while resolving view xdj.xdj_view


_limit - ORDER BY must be used with a LIMIT clause

generated.column.name.multi.window
Aut omat ically generat ed aliases are used.

In t he early version of MaxComput e, an alias is aut omat ically generat ed for each expression of a SELECT
st at ement . T he alias is displayed on t he MaxComput e client . However, t he early version of MaxComput e
does not guarant ee t hat t he alias generat ion rule is correct or remains unchanged. We recommend t hat
you do not use aut omat ically generat ed aliases.

MaxComput e V2.0 warns you against t he use of aut omat ically generat ed aliases. However,
MaxComput e V2.0 does not prohibit t he use of aut omat ically generat ed aliases t o avoid adverse
impact s.

In some cases, known changes are made t o t he alias generat ion rules in t he different versions of
MaxComput e. Some online jobs depend on aut omat ically generat ed aliases. T hese jobs may fail when
MaxComput e is being upgraded or rolled back. If you encount er t hese issues, modify your queries and
explicit ly specify t he aliases of t he columns.

Examples

Synt ax t hat is not recommended:

select _c0 from (select count(*) from table_name) t;

Recommended synt ax:

select c from (select count(*) c from table_name) t;

non.boolean.filter
Non-BOOLEAN f ilt er condit ions are used.

MaxComput e prohibit s implicit conversions bet ween t he BOOLEAN t ype and ot her dat a t ypes. However,
t he early version of MaxComput e allows t he use of BIGINT filt er condit ions in some cases. MaxComput e
V2.0 prohibit s t he use of BIGINT filt er condit ions. If your script s have BIGINT filt er condit ions, modify
t hem at t he earliest opport unit y. Examples:

Invalid synt ax:

select id, count(*) from table_name group by id having id;

Error message:

FAILED: ODPS-0130071:[1,50] Semantic analysis exception - expect a BOOLEAN expression

Valid synt ax:

14 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

select id, count(*) from table_name group by id having id <> 0;

post.select.ambiguous
T he ORDER BY, CLUST ER BY, DIST RIBUT E BY, and SORT BY clauses ref erence columns wit h
conf lict ing names.

In t he early version of MaxComput e, t he syst em aut omat ically select s t he last column in a SELECT clause
as t he operat ion object . However, MaxComput e V2.0 report s an error in t his case. Modify your queries at
t he earliest opport unit y. Examples:

Invalid synt ax:

select a, b as a from t order by a limit 10;

Error message:

FAILED: ODPS-0130071:[1,34] Semantic analysis exception - a is ambiguous, can be both t.a o


r null.a

Valid synt ax:

select a as c, b as a from t order by a limit 10;

T he change covers t he st at ement s t hat have conflict ing column names but have t he same synt ax. Even
t hough no ambiguit y is caused, t he syst em ret urns an error t o warn you against t hese st at ement s. We
recommend t hat you modify relevant st at ement s.

duplicated.partition.column
Part it ions wit h t he same name are specif ied in a query.

In t he early version of MaxComput e, no error is ret urned if t wo part it ion keys wit h t he same name are
specified. T he lat t er part it ion key overwrit es t he former part it ion. T his causes confusion. MaxComput e
V2.0 ret urns an error in t his case. Examples:

Invalid synt ax 1:

insert overwrite table partition (ds = '1', ds = '2')select ... ;

ds = '1' is ignored during execut ion.

Valid synt ax:

insert overwrite table partition (ds = '2')select ... ;

Invalid synt ax 2:

create table t (a bigint, ds string) partitioned by (ds string);

Valid synt ax:

> Document Version: 20220630 15


Best Pract ices· SQL MaxComput e

create table t (a bigint) partitioned by (ds string);

order.by.col.ambiguous
T he ORDER BY clause ref erences t he duplicat e aliases in a SELECT clause.

Invalid synt ax:

select id, id
from table_test
order by id;

Valid synt ax:

select id, id id2


from table_name
order by id;

Remove t he duplicat e aliases before t he ORDER BY clause can reference t hem.

in.subquery.without.result
If colx in subquery does not ret urn result s, colx does not exist in t he source t able.

Invalid synt ax:

select * from table_name


where not_exist_col in (select id from table_name limit 0);

Error message:

FAILED: ODPS-0130071:[2,7] Semantic analysis exception - column not_exist_col cannot be res


olved

ctas.if.not.exists
T he synt ax of a dest inat ion t able is invalid.
If t he dest inat ion t able exist s, t he early version of MaxComput e does not check t he synt ax. However,
MaxComput e V2.0 checks t he synt ax. As a result , a large number of errors may be ret urned. Examples:

Invalid synt ax:

create table if not exists table_name


as
select * from not_exist_table;

Error message:

FAILED: ODPS-0130131:[1,50] Table not found - table meta_dev.not_exist_table cannot be reso


lved

16 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

worker.restart.instance.timeout
In t he early version of MaxComput e, each t ime a UDF generat es a record, a writ e operat ion is t riggered
on Apsara Dist ribut ed File Syst em, and a heart beat packet is sent t o Job Scheduler. If t he UDF does not
generat e records for 10 minut es, t he following error is ret urned:

FAILED: ODPS-0123144: Fuxi job failed - WorkerRestart errCode:252,errMsg:kInstanceMonitorTi


meout, usually caused by bad udf performance.

T he runt ime framework of MaxComput e V2.0 support s vect oring t o process mult iple rows of a column
at a t ime. T his makes execut ion more efficient . If mult iple records are processed at a t ime and no
heart beat packet s are sent t o Job Scheduler wit hin t he specific period, vect oring may cause normal
st at ement s t o t ime out . T he int erval bet ween t wo out put records cannot exceed 10 minut es.

If a t imeout error occurs, we recommend t hat you first check t he performance of UDFs. It requires
several seconds t o process each record. If UDFs cannot be opt imized, you can manually set
bat ch.rowcount t o handle t his issue. T he default value of bat ch.rowcount is 1024.

set odps.sql.executionengine.batch.rowcount=16;

divide.nan.or.overflow
T he early version of MaxComput e does not support division const ant f olding.

T he following code shows t he physical execut ion plan in t he early version of MaxComput e:

explain
select if(false, 0/0, 1.0)
from table_name;
in task M1_Stg1:
Data source: meta_dev.table_name
TS: alias: table_name
SEL: If(False, Divide(UDFToDouble(0), UDFToDouble(0)), 1.0)
FS: output: None

T he IF and DIVIDE funct ions are ret ained. During execut ion, t he first paramet er of IF is set t o False, and
t he expression of DIVIDE is not evaluat ed. Divide-by-zero errors do not occur.

However, MaxComput e V2.0 support s division const ant folding. As a result , an error is ret urned.
Examples:

Invalid synt ax:

select IF(FALSE, 0/0, 1.0)


from table_name;

Error message:

FAILED: ODPS-0130071:[1,19] Semantic analysis exception - encounter runtime exception while


evaluating function /, detailed message: DIVIDE func result NaN, two params are 0.000000 an
d 0.000000

An overflow error may also occur. Examples:

> Document Version: 20220630 17


Best Pract ices· SQL MaxComput e

Invalid synt ax:

select if(false, 1/0, 1.0)


from table_name;

Error message:

FAILED: ODPS-0130071:[1,19] Semantic analysis exception - encounter runtime exception while


evaluating function /, detailed message: DIVIDE func result overflow, two params are 1.0000
00 and 0.000000

Valid synt ax:

We recommend t hat you remove /0 and use valid const ant s.

A similar issue occurs in t he const ant folding for CASE WHEN, such as CASE WHEN T RUE T HEN 0 ELSE 0/0.
During const ant folding in MaxComput e V2.0, all subexpressions are evaluat ed, which causes divide-by-
zero errors.

CASE WHEN may involve more complex opt imizat ion scenarios. Example:

select case when key = 0 then 0 else 1/key end


from (
select 0 as key from src
union all
select key from src) r;

T he opt imizer pushes down t he division operat ion t o subqueries. T he following code shows a similar
conversion:

M (
select case when 0 = 0 then 0 else 1/0 end c1 from src
UNION ALL
select case when key = 0 then 0 else 1/key end c1 from src) r;

Error message:

FAILED: ODPS-0130071:[0,0] Semantic analysis exception - physical plan generation failed: j


ava.lang.ArithmeticException: DIVIDE func result overflow, two params are 1.000000 and 0.00
0000

An error is ret urned for t he const ant folding in t he first clause of UNION ALL. We recommend t hat you
move CASE WHEN in t he SQL st at ement t o subqueries and remove useless CASE WHEN st at ement s and
/0.

select c1 end
from (
select 0 c1 end from src
union all
select case when key = 0 then 0 else 1/key end) r;

small.table.exceeds.mem.limit

18 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

T he early version of MaxComput e support s mult i-way join opt imizat ion. Mult iple JOIN operat ions wit h
t he same join key are merged for execut ion in t he same Fuxi t ask, such as J4_1_2_3_St g1 in t his example.

explain
select t1.*
from t1 join t2 on t1.c1 = t2.c1
join t3 on t1.c1 = t3.c1;

T he following code shows t he physical execut ion plan in t he early version of MaxComput e:

In Job job0:
root Tasks: M1_Stg1, M2_Stg1, M3_Stg1
J4_1_2_3_Stg1 depends on: M1_Stg1, M2_Stg1, M3_Stg1
In Task M1_Stg1:
Data source: meta_dev.t1
In Task M2_Stg1:
Data source: meta_dev.t2
In Task M3_Stg1:
Data source: meta_dev.t3
In Task J4_1_2_3_Stg1:
JOIN: t1 INNER JOIN unknown INNER JOIN unknown
SEL: t1._col0, t1._col1, t1._col2
FS: output: None

If MAPJOIN hint s are added, t he physical execut ion plan in t he early version of MaxComput e remains
unchanged. In t he early version of MaxComput e, mult i-way join opt imizat ion is preferent ially used, and
user-defined MAPJOIN hint s can be ignored.

explain
select /* +mapjoin(t1) */ t1.*
from t1 join t2 on t1.c1 = t2.c1
join t3 on t1.c1 = t3.c1;

T he preceding physical execut ion plan in t he early version of MaxComput e is applied.

T he opt imizer of MaxComput e V2.0 preferent ially uses user-defined MAPJOIN hint s. In t his example, if t 1
is a large t able, an error similar t o t he following one is ret urned:

FAILED: ODPS-0010000:System internal error - SQL Runtime Internal Error: Hash Join Cursor H
ashJoin_REL… small table exceeds, memory limit(MB) 640, fixed memory used …, variable memor
y used …

In t his case, if MAPJOIN is not required, we recommend t hat you remove MAPJOIN hint s.

sigkill.oom

> Document Version: 20220630 19


Best Pract ices· SQL MaxComput e

sigkill.oom has t he same issue as small.t able.exceeds.mem.limit . If you specify MAPJOIN hint s and t he
sizes of small t ables are large, mult iple JOIN st at ement s may be opt imized by using mult i-way joins in
t he early version of MaxComput e. As a result , t he st at ement s are successfully execut ed in t he early
version of MaxComput e. However, in MaxComput e V2.0, some users may use
odps.sql.mapjoin.memory.max t o prevent small t ables from exceeding t he size limit . Each
MaxComput e worker has a memory limit . If t he sizes of small t ables are large, MaxComput e workers may
be t erminat ed because t he memory limit is exceeded. If t his happens, an error similar t o t he following
one is ret urned:

Fuxi job failed - WorkerRestart errCode:9,errMsg:SigKill(OOM), usually caused by OOM(out of


memory).

We recommend t hat you remove MAPJOIN hint s and use mult i-way joins.

wm_concat.first.argument.const
Based on t he WM_CONCAT funct ion described in Aggregate functions, t he first paramet er of
WM_CONCAT must be a const ant . However, t he early version of MaxComput e does not have rigorous
check st andards. For example, if t he source t able has no dat a, no error is ret urned even if t he first
paramet er of WM_CONCAT is ColumnReference.

Function declaration:
string wm_concat(string separator, string str)
Parameters:
separator: the delimiter, which is a constant of the STRING type. Delimiters of other types
or non-constant delimiters result in exceptions.

MaxComput e V2.0 checks t he validit y of paramet ers during t he planning st age. If t he first paramet er of
WM_CONCAT is not a const ant , an error is ret urned. Examples:

Invalid synt ax:

select wm_concat(value, ',') FROM src group by value;

Error message:

FAILED: ODPS-0130071:[0,0] Semantic analysis exception - physical plan generation failed: c


om.aliyun.odps.lot.cbo.validator.AggregateCallValidator$AggregateCallValidationException: I
nvalid argument type - The first argument of WM_CONCAT must be constant string.

pt.implicit.convertion.failed
srcpt is a part it ioned t able t hat has t wo part it ions.

create table srcpt(key STRING, value STRING) partitioned by (pt STRING);


alter table srcpt add partition (pt='pt1');
alter table srcpt add partition (pt='pt2');

20 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

In t he preceding SQL st at ement s, t he const ant s of t he INT t ype in t he pt columns of t he ST RING t ype
are convert ed int o t hose of t he DOUBLE t ype for comparison. Even if
odps.sql.udf.strict.mode=true is configured for t he project , t he early version of MaxComput e does
not ret urn an error and it filt ers out all pt columns. However, in MaxComput e V2.0, an error is ret urned.
Examples:

Invalid synt ax:

select key from srcpt where pt in (1, 2);

Error message:

FAILED: ODPS-0130071:[0,0] Semantic analysis exception - physical plan generation failed: j


ava.lang.NumberFormatException: ODPS-0123091:Illegal type cast - In function cast, value 'p
t1' cannot be casted from String to Double.

We recommend t hat you do not compare t he values in t he part it ion key columns of t he ST RING and INT
const ant s. If such comparison is required, convert t he INT const ant s int o t he ST RING t ype.

having.use.select.alias
SQL specificat ions define t hat t he GROUP BY and HAVING clauses precede a SELECT clause. T herefore,
t he column alias generat ed by t he SELECT clause cannot be used in t he HAVING clause.

Examples

Invalid synt ax:

select id id2 from table_name group by id having id2 > 0;

Error message:

FAILED: ODPS-0130071:[1,44] Semantic analysis exception - column id2 cannot be resolvedOD


PS-0130071:[1,44] Semantic analysis exception - column reference id2 should appear in GRO
UP BY key

id2 is t he column alias generat ed by t he SELECT clause and cannot be used in t he HAVING clause.

dynamic.pt.to.static
In MaxComput e V2.0, dynamic part it ions may be convert ed int o st at ic part it ions by t he opt imizer.

Examples

insert overwrite table srcpt partition(pt) select id, 'pt1' from table_name;

T he preceding st at ement is convert ed int o t he following st at ement :

insert overwrite table srcpt partition(pt='pt1') select id from table_name;

If t he specified part it ion value is invalid, such as '${bizdat e}', MaxComput e V2.0 ret urns an error during
synt ax checks. For more informat ion, see Partition.

Invalid synt ax:

> Document Version: 20220630 21


Best Pract ices· SQL MaxComput e

insert overwrite table srcpt partition(pt) select id, '${bizdate}' from table_name limit 0;

Error message:

FAILED: ODPS-0130071:[1,24] Semantic analysis exception - wrong columns count 2 in data sou
rce, requires 3 columns (includes dynamic partitions if any)

In t he early version of MaxComput e, no result s are ret urned by t he SQL st at ement s due t o LIMIT 0, and
no dynamic part it ions are creat ed. As a result , no error is ret urned.

lot.not.in.subquery
Processing of NULL values in t he IN subquery.

In a st andard SQL IN operat ion, if t he value list cont ains a NULL value, t he ret urn value may be NULL or
t rue, but cannot be false. For example, 1 in (null, 1, 2, 3) ret urns t rue, 1 in (null, 2, 3) ret urns NULL, and
null in (null, 1, 2, 3) ret urns NULL. Likewise, for t he NOT IN operat ion, if t he value list cont ains a NULL
value, t he ret urn value may be false or NULL, but cannot be t rue.

MaxComput e V2.0 processes NULL values by using st andard execut ion rules. If you receive a not ificat ion
for t his issue, check whet her t he subqueries in t he IN operat ion have a NULL value and whet her t he
relat ed execut ion meet s your expect at ions. If t he relat ed execut ion does not meet your expect at ions,
modify t he queries.

Examples

select * from t where c not in (select accepted from c_list);

If t he accept ed column does not cont ain NULL values, ignore t his issue. If t he accept ed column
cont ains NULL values, c not in (select accepted from c_list) ret urns t rue in t he early version of
MaxComput e and NULL in MaxComput e V2.0.

Valid synt ax:

select * from t where c not in (select accepted from c_list where accepted is not null)

1.3. Export SQL execution results


T his t opic describes how t o export SQL execut ion result s in MaxComput e.

Not e T his t opic provides examples based on Alibaba Cloud MaxComput e SDK for Java.

O verview
You can use t he following met hods t o export t he execut ion result s of SQL st at ement s:

If t he amount of dat a is small, use SQLT ask t o obt ain all query result s.
If you want t o export t he query result s of a t able or a part it ion, use T unnel.
If t he SQL st at ement s are complex, use T unnel and SQLT ask t o export t he query result s.
Use Dat aWorks t o execut e SQL st at ement s, synchronize dat a, perform t imed scheduling, and
configure t ask dependencies.
Use t he open source t ool Dat aX t o export dat a from MaxComput e t o specified dest inat ion dat a

22 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

sources.

Use SQ LTask to export data


SQLT ask uses Alibaba Cloud MaxComput e SDK t o call SQLT ask.get Result (i) t o execut e SQL st at ement s
and obt ain t he query result s. For more informat ion, see SQLT ask.

When you use SQLT ask, not e t he following rules:

SQLT ask.get Result (i) is used t o export t he result s of SELECT st at ement s. You cannot use it t o export
t he execut ion result s of ot her MaxComput e SQL st at ement s such as SHOW T ABLES .
You can use READ_T ABLE_MAX_ROW t o specify t he maximum number of dat a records t hat t he SELECT
st at ement ret urns t o a client . For more informat ion, see Project operat ions.
T he SELECT st at ement ret urns a maximum of 10,000 dat a records t o a client . You can execut e t he
SELECT st at ement on a client such as SQLT ask. T his is equivalent t o appending a LIMIT N clause t o t he
SELECT st at ement .

T his rule does not apply if you execut e t he CREAT E T ABLE XX AS SELECT or INSERT
INT O/OVERWRIT E T ABLE st at ement t o solidify t he result s int o a specified t able.

Use Tunnel to export data


If a query ret urns all dat a of a t able or a part it ion, you can use T unnel t o export t he dat a. For more
informat ion, see T unnel commands and MaxCompute T unnel overview .

T he following example shows how t o run a T unnel command t o export dat a. If t he T unnel command
cannot be used t o export dat a, you can compile t he T unnel SDK t o export dat a. For more informat ion,
see MaxCompute T unnel overview .

tunnel d wc_out c:\wc_out.dat;


2016-12-16 19:32:08 - new session: 201612161932082d3c9b0a012f68e7 total lines: 3
2016-12-16 19:32:08 - file [0]: [0, 3), c:\wc_out.dat
downloading 3 records into 1 file
2016-12-16 19:32:08 - file [0] start
2016-12-16 19:32:08 - file [0] OK. total: 21 bytes
download OK

Use SQ LTask and Tunnel to export data


SQLT ask cannot be used t o process more t han 10,000 dat a records, whereas T unnel can. T hese t wo
met hods complement each ot her. You can use SQLT ask and T unnel t oget her t o export more t han
10,000 dat a records.

T he following sample code provides an example t o show how t o use SQLT ask and T unnel t o export
dat a:

private static final String accessId = "userAccessId";


private static final String accessKey = "userAccessKey";
private static final String endPoint = "http://service.cn-shanghai.maxcompute.aliyun.co
m/api";
private static final String project = "userProject";
private static final String sql = "userSQL";
private static final String table = "Tmp_" + UUID.randomUUID().toString().replace("-",
"_");// Use a random string as the name of the temporary table.
private static final Odps odps = getOdps();
public static void main(String[] args) {

> Document Version: 20220630 23


Best Pract ices· SQL MaxComput e

public static void main(String[] args) {


System.out.println(table);
runSql();
tunnel();
}
/*
* Download the results that are returned by SQLTask.
* */
private static void tunnel() {
TableTunnel tunnel = new TableTunnel(odps);
try {
DownloadSession downloadSession = tunnel.createDownloadSession(
project, table);
System.out.println("Session Status is : "
+ downloadSession.getStatus().toString());
long count = downloadSession.getRecordCount();
System.out.println("RecordCount is: " + count);
RecordReader recordReader = downloadSession.openRecordReader(0,
count);
Record record;
while ((record = recordReader.read()) != null) {
consumeRecord(record, downloadSession.getSchema());
}
recordReader.close();
} catch (TunnelException e) {
e.printStackTrace();
} catch (IOException e1) {
e1.printStackTrace();
}
}
/*
* Save the data.
* If the amount of data is small, you can directly copy the data from the output. You
can also use Java.io to write the data to a local file or a remote storage system to save t
he data.
* */
private static void consumeRecord(Record record, TableSchema schema) {
System.out.println(record.getString("username")+","+record.getBigint("cnt"));
}
/*
* Execute an SQL statement to save the query results to a temporary table.
* The time-to-live (TTL) of the saved data is one day. Saved data does not consume muc
h storage space. The storage of the system is not affected even if an error occurs when the
system deletes the data.
* */
private static void runSql() {
Instance i;
StringBuilder sb = new StringBuilder("Create Table ").append(table)
.append(" lifecycle 1 as ").append(sql);
try {
System.out.println(sb.toString());
i = SQLTask.run(getOdps(), sb.toString());
i.waitForSuccess();
} catch (OdpsException e) {
e.printStackTrace();

24 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

}
}
/*
* Initialize the connection information of MaxCompute.
* */
private static Odps getOdps() {
Account account = new AliyunAccount(accessId, accessKey);
Odps odps = new Odps(account);
odps.setEndpoint(endPoint);
odps.setDefaultProject(project);
return odps;
}

Use DataWorks to synchronize and export data


Dat aWorks allows you t o execut e SQL st at ement s and configure dat a synchronizat ion t asks t o
generat e and export dat a.

1. Log on t o t he Dat aWorks console.


2. In t he left -side navigat ion pane, click Workspaces.
3. On t he Workspaces page, find t he workspace t hat you want t o manage and click Dat a Analyt ics
in t he Act ions column.
4. Creat e a business process.
i. On t he Dat a Analyt ics page, right -click Business process and select New business process.
ii. Ent er a name in t he Business Name field.
iii. Click New .
5. Creat e an ODPS SQL node.
i. Right -click t he business process and choose New > MaxComput e > ODPS SQL.
ii. Ent er runsql in t he Node name field and click Submit .
iii. Configure t he ODPS SQL node and click t he Save icon.
6. Creat e a dat a synchronizat ion node.
i. Right -click t he business process and choose New > Dat a Int egrat ion > Of f line
synchroniz at ion.
ii. Ent er sync2mysql in t he Node name field and click Submit .
iii. Specify a dat a source and a dat a dest inat ion.
iv. Configure t he mapping bet ween columns in t he source and dest inat ion t ables.
v. Configure channel cont rol.
vi. Click t he Save icon.
7. Configure a dependency bet ween t he dat a synchronizat ion node and t he ODPS SQL node.
Configure t he ODPS SQL node as t he out put node and t he dat a synchronizat ion node as t he
export node.
8. Configure workflow scheduling or use t he default set t ings. T hen, click t he Run icon. T he following
informat ion shows t he operat ional log for dat a synchronizat ion:

> Document Version: 20220630 25


Best Pract ices· SQL MaxComput e

2016-12-17 23:43:46.394 [job-15598025] INFO JobContainer -


Task start time: 2016-12-17 23:43:34
Task end time: 2016-12-17 23:43:46
Total execution time: 11s
Average amounts of data per task: 31.36 KB/s
Write speed: 1,668 rec/s
Read records: 16,689
Failed read and write attempts: 0

9. Execut e t he following SQL st at ement t o query t he dat a synchronizat ion result s:

select count(*) from result_in_db;

1.4. Check whether partition pruning


is effective
T his t opic describes how t o check whet her part it ion pruning is effect ive.

Background information
A MaxComput e part it ioned t able is a t able wit h part it ions. You can specify one or more columns as t he
part it ion key t o creat e a part it ioned t able. If you have specified t he name of a part it ion t hat you want
t o access, MaxComput e reads dat a only from t hat part it ion and does not scan t he ent ire t able. T his
reduces cost s and improves efficiency.

Part it ion pruning allows you t o specify filt er condit ions for part it ion key columns. T his way,
MaxComput e reads dat a only from t he part it ions t hat meet t he filt er condit ions t hat you have
specified in SQL st at ement s. T his avoids t he errors and wast e of resources t hat are caused by full t able
scans. However, part it ion pruning may not t ake effect somet imes.

T his t opic describes part it ion pruning from t he following aspect s:

Check whet her part it ion pruning is effect ive


Scenarios where part it ion pruning does not t ake effect

Check whether partition pruning is effective


T o check whet her part it ion pruning is effect ive for a query, execut e t he EXPLAIN st at ement t o view t he
execut ion plan of t he query.

For a query where part it ion pruning does not t ake effect :

explain
select seller_id
from xxxxx_trd_slr_ord_1d
where ds=rand();

T he execut ion plan indicat es t hat all t he 1,344 part it ions of T able xxxxx_t rd_slr_ord_1d are read.

For a query where part it ion pruning is effect ive:

26 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

explain
select seller_id
from xxxxx_trd_slr_ord_1d
where ds='20150801';

T he execut ion plan indicat es t hat only Part it ion 20150801 of T able xxxxx_t rd_slr_ord_1d is read.

Scenarios where partition pruning does not take effect


Improper use of UDFs

If you use user-defined funct ions (UDFs) or specific built -in funct ions t o specify part it ions, part it ion
pruning may not t ake effect . In t his case, we recommend t hat you execut e t he EXPLAIN st at ement t o
check whet her part it ion pruning is effect ive.

explain
select ...
from xxxxx_base2_brd_ind_cw
where ds = concat(SPLIT_PART(bi_week_dim(' ${bdp.system.bizdate}'), ',', 1), SPLIT_PART(b
i_week_dim(' ${bdp.system.bizdate}'), ',', 2))

Not e For more informat ion about UDF-based part it ion pruning, see t he "WHERE" sect ion in
WHERE clause (where_condit ion).

Improper use of joins

When you join t ables, pay at t ent ion t o t he following rules:

If part it ion pruning condit ions are specified in t he WHERE clause, part it ion pruning is effect ive.
If part it ion pruning condit ions are specified in t he ON clause, part it ion pruning is effect ive for t he
secondary t able, but not t he primary t able.

T he following examples describe how part it ion pruning works when t hree different t ypes of JOIN
operat ions are performed:

LEFT OUT ER JOIN

> Document Version: 20220630 27


Best Pract ices· SQL MaxComput e

For a query where part it ion pruning condit ions are specified in t he ON clause:

set odps.sql.allow.fullscan=true;
explain
select a.seller_id
,a.pay_ord_pbt_1d_001
from xxxxx_trd_slr_ord_1d a
left outer join
xxxxx_seller b
on a.seller_id=b.user_id
and a.ds='20150801'
and b.ds='20150801';

T he execut ion plan indicat es t hat part it ion pruning is effect ive for t he right t able, but not t he
left t able.

28 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

For a query where part it ion pruning condit ions are specified in t he WHERE clause:

set odps.sql.allow.fullscan=true;
explain
select a.seller_id
,a.pay_ord_pbt_1d_001
from xxxxx_trd_slr_ord_1d a
left outer join
xxxxx_seller b
on a.seller_id=b.user_id
where a.ds='20150801'
and b.ds='20150801';

T he execut ion plan indicat es t hat part it ion pruning is effect ive for bot h t ables.

RIGHT OUT ER JOIN

A RIGHT OUT ER JOIN operat ion is similar t o a LEFT OUT ER JOIN operat ion. If part it ion pruning
condit ions are specified in t he ON clause, part it ion pruning is effect ive only for t he left t able, but
not t he right t able. If part it ion pruning condit ions are specified in t he WHERE clause, part it ion
pruning is effect ive for bot h t ables.

FULL OUT ER JOIN

Part it ion pruning is effect ive only when part it ion pruning condit ions are specified in t he WHERE
clause, but not t he ON clause.

Impact and consideration


If part it ion pruning does not t ake effect , t he query performance can be great ly det eriorat ed. T his

> Document Version: 20220630 29


Best Pract ices· SQL MaxComput e

issue can be hardly discovered. We recommend t hat you check whet her part it ion pruning is effect ive
before you commit t he code.
T o use UDFs for part it ion pruning, you must modify t he classes of t he UDFs or add set odps.sql.udf
.ppr.deterministic = true; before t he SQL st at ement s t o execut e. For more informat ion, see
WHERE clause (where_condit ion).

1.5. Query the first N data records of


each group
T his t opic describes how t o group dat a records and query t he first N dat a records.

Sample data

empno ename job sal

7369 SMIT H CLERK 800.0

7876 SMIT H CLERK 1100.0

7900 JAMES CLERK 950.0

7934 MILLER CLERK 1300.0

7499 ALLEN SALESMAN 1600.0

7654 MART IN SALESMAN 1250.0

7844 T URNER SALESMAN 1500.0

7521 WARD SALESMAN 1250.0

Implementation
You can use one of t he following met hods t o query t he first N dat a records of each group:

Query t he row ID of each record and use t he WHERE clause t o filt er t he records.

SELECT * FROM (
SELECT empno
, ename
, sal
, job
, ROW_NUMBER() OVER (PARTITION BY job ORDER BY sal) AS rn
FROM emp
) tmp
WHERE rn < 10;

Use t he SPLIT funct ion.

For more informat ion, see t he last example in MaxComput e learning plan. T his met hod can be used t o
det ermine t he sequence number of a dat a record. If t he sequence number is great er t han t he
specified number, such as 10, t he dat a records t hat remain are no longer processed. T his improves
comput ing efficiency.

30 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

1.6. Merge multiple rows of data into


one row
T his t opic describes how t o use SQL st at ement s t o merge mult iple rows of dat a int o one row.

Sample data

class gender name

1 M LiLei

1 F HanMM

1 M Jim

1 F HanMM

2 F Kate

2 M Peter

Examples
Example 1: Execut e t he following st at ement t o merge t he rows whose values in t he class column are
t he same int o one row based on t he values in t he name column and deduplicat e t he values in t he
name column. You can implement t he deduplicat ion by using nest ed subqueries.

SELECT class, wm_concat(distinct ',', name) FROM students GROUP BY class;

Not e T he wm_concat funct ion is used t o aggregat e dat a. For more informat ion, see
Aggregat e funct ions.

T he following result is ret urned.

class names

1 LiLei,HanMM,Jim

2 Kate,Peter

Example 2: Execut e t he following st at ement t o collect st at ist ics on t he numbers of males and
females based on t he values in t he class column:

SELECT
class
,SUM(CASE WHEN gender = 'M' THEN 1 ELSE 0 END) AS cnt_m
,SUM(CASE WHEN gender = 'F' THEN 1 ELSE 0 END) AS cnt_f
FROM students
GROUP BY class;

T he following result is ret urned.

> Document Version: 20220630 31


Best Pract ices· SQL MaxComput e

class cnt_m cnt_f

1 2 2

2 1 1

1.7. Transpose rows to columns or


columns to rows
T his t opic describes how t o use SQL st at ement s t o t ranspose rows t o columns or columns t o rows.

Background information
T he following figure shows t he effect of t ransposing rows t o columns or columns t o rows.

Rows t o columns
T ranspose mult iple rows t o one row, or t ranspose one column t o mult iple columns.

Columns t o rows
T ranspose one row t o mult iple rows, or t ranspose mult iple columns t o one column.

Sample data
Sample source dat a is provided for you t o bet t er underst and t he examples of t ransposing rows t o
columns or columns t o rows.

Creat e a source t able and insert dat a int o t he source t able. T he t able is used t o t ranspose rows t o
columns. Sample st at ement s:

create table rowtocolumn (name string, subject string, result bigint);


insert into table rowtocolumn values
('Bob' , 'chinese' , 74),
('Bob' , 'mathematics' , 83),
('Bob' , 'physics' , 93),
('Alice' , 'chinese' , 74),
('Alice' , 'mathematics' , 84),
('Alice' , 'physics' , 94),

Creat e a source t able and insert dat a int o t he source t able. T he t able is used t o t ranspose columns
t o rows. Sample st at ement s:

32 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

create table columntorow (name string, chinese bigint, mathematics bigint, physics bigint
);
insert into table columntorow values
('Bob' , 74, 83, 93),
('Alice' , 74, 84, 94);

Examples of transposing rows to columns


You can use one of t he following met hods t o t ranspose rows t o columns:

Met hod 1: Use t he CASE WHEN expression t o ext ract t he values of each subject as separat e
columns. Sample st at ement :

select name as name,


max(case subject when 'chinese' then result end) as chinese,
max(case subject when 'mathematics' then result end) as mathematics,
max(case subject when 'physics' then result end) as physics
from rowtocolumn
group by name;

T he following result is ret urned:

+--------+------------+------------+------------+
name chinese mathematics physics
+--------+------------+------------+------------+
Bob 74 83 93
Alice 74 84 94
+--------+------------+------------+------------+

Met hod 2: Use built -in funct ions t o t ranspose rows t o columns. Merge t he values of t he subject and
result columns int o one column by using t he CONCAT and WM_CONCAT funct ions. T hen, parse t he
values of t he subject column as separat e columns by using t he KEYVALUE funct ion. Sample
st at ement :

select name as name,


keyvalue(subject, chinese') as chinese,
keyvalue(subject, 'mathematics') as mathematics,
keyvalue(subject, 'physics') as physics
from(
select name, wm_concat(';',concat(subject,':',result))as subject
from rowtocolumn
group by name);

T he following result is ret urned:

+--------+------------+------------+------------+
name chinese mathematics physics
+--------+------------+------------+------------+
Bob 74 83 93
Alice 74 84 94
+--------+------------+------------+------------+

Examples of transposing columns to rows


You can use one of t he following met hods t o t ranspose columns t o rows:

> Document Version: 20220630 33


Best Pract ices· SQL MaxComput e

Met hod 1: Use t he UNION ALL clause t o combine t he values in chinese, mat hemat ics, and physics
columns int o one column. Sample st at ement s:

-- Remove the limit on the simultaneous execution of the ORDER BY and LIMIT clauses. This
way, you can use ORDER BY to sort the results by name.
set odps.sql.validate.orderby.limit=false;
-- Transpose columns to rows.
select name as name, subject as subject, result as result
from(
select name, 'chinese' as subject, chinese as result from columntorow
union all
Choose name, 'mathematics' as subject, mathematics as result from columntorow
union all
select name, 'physics' as subject, physics as result from columntorow)
order by name;

T he following result is ret urned:

+--------+--------+------------+
name subject result
+--------+--------+------------+
Bob chinese 74
Bob mathematics 83
Bob physics 93
Alice chinese 74
Alice mathematics 84
Alice physics 94
+--------+--------+------------+

Met hod 2: Use built -in funct ions t o t ranspose columns t o rows. Concat enat e t he column name of
each subject and t he values in each column by using t he CONCAT funct ion. T hen, split t he
concat enat ed values int o t he subject and result columns as separat e columns by using t he
T RANS_ARRAY and SPLIT _PART funct ions. Sample st at ement :

select name as name,


split_part(subject,':',1) as subject,
split_part(subject,':',2) as result
from(
select trans_array(1,';',name,subject) as (name,subject)
from(
select name,
concat('chinese',':',chinese,';', 'mathematics ',':',mathematics,';', 'physics','
:',physics) as subject
from columntorow)tt)tx;

T he following result is ret urned:

34 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

+--------+--------+------------+
name subject result
+--------+--------+------------+
Bob chinese 74
Bob mathematics 83
Bob physics 93
Alice chinese 74
Alice mathematics 84
Alice physics 94
+--------+--------+------------+

1.8. JOIN operations in MaxCompute


SQL
T his t opic describes t he JOIN operat ions t hat MaxComput e SQL support s.

O verview
T he following t able describes t he JOIN operat ions t hat MaxComput e SQL support s.

Operation Description

Returns the rows that have matching column values


INNER JOIN in both the left table and the right table based on
the join condition.

Returns all the rows from the left table and


matched rows from the right table based on the
join condition. If a row in the left table has no
LEFT JOIN
matching rows in the right table, NULL values are
returned in the columns from the right table in the
result set.

Returns all the rows from the right table and


matched rows from the left table based on the join
RIGHT JOIN condition. If a row in the right table has no matching
rows in the left table, NULL values are returned in
the columns from the left table in the result set.

Returns all the rows in both the left table and the
right table whether the join condition is met or not.
FULL JOIN In the result set, NULL values are returned in the
columns from the table that lacks a matching row in
the other table.

Returns only the rows in the left table that have a


LEFT SEMI JOIN
matching row in the right table.

Returns only the rows in the left table that have no


LEFT ANT I JOIN
matching rows in the right table.

> Document Version: 20220630 35


Best Pract ices· SQL MaxComput e

T he ON clause and t he WHERE clause can be used in t he same SQL st at ement . For example, consider t he
following SQL st at ement :

(SELECT * FROM A WHERE {subquery_where_condition} A) A


JOIN
(SELECT * FROM B WHERE {subquery_where_condition} B) B
ON {on_condition}
WHERE {where_condition}

T he condit ions in t he preceding SQL st at ement are evaluat ed in t he following order:


1. T he {subquery_where_condition} condit ion in t he WHERE clause of t he subqueries

2. T he {on_condition} condit ion in t he ON clause


3. T he {where_condition} condit ion in t he WHERE clause aft er t he JOIN clause

T herefore, a JOIN operat ion may ret urn different result s, depending on whet her t he filt er condit ions are
specified in {subquery_where_condition} , {on_condition} , or {where_condition} . For more
informat ion, see Case-by-case st udy.

Test tables
T able A
Execut e t he following st at ement t o creat e T able A:

CREATE TABLE A AS SELECT * FROM VALUES (1, 20180101),(2, 20180101),(2, 20180102) t (key,
ds);

T able A has t he following t hree rows and is used as t he left t able for all JOIN operat ions in t his t opic.

key ds

1 20180101

2 20180101

2 20180102

T able B

Execut e t he following st at ement t o creat e T able B:

CREATE TABLE B AS SELECT * FROM VALUES (1, 20180101),(3, 20180101),(2, 20180102) t (key,
ds);

T able B has t he following t hree rows and is used as t he right t able for all JOIN operat ions in t his t opic.

key ds

1 20180101

3 20180101

2 20180102

36 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

Cart esian product of T able A and T able B

a.key a.ds b.key b.ds

1 20180101 1 20180101

1 20180101 3 20180101

1 20180101 2 20180102

2 20180101 1 20180101

2 20180101 3 20180101

2 20180101 2 20180102

2 20180102 1 20180101

2 20180102 3 20180101

2 20180102 2 20180102

Case-by-case study
INNER JOIN
An INNER JOIN operat ion first t akes t he Cart esian product of t he rows in T able A and T able B and
ret urns rows t hat have mat ching column values in T able A and T able B based on t he ON clause.
Conclusion: An INNER JOIN operat ion ret urns t he same result s independent ly of whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_conditio
n} .

Case 1: Specify t he filt er condit ions in t he {subquery_where_condition} clause, as shown in t he


following st at ement :

SELECT A.*, B.*


FROM
(SELECT * FROM A WHERE ds='20180101') A
JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key;

T he following t able list s t he result s t hat t he preceding st at ement ret urns.

a.key a.ds b.key b.ds

1 20180101 1 20180101

> Document Version: 20220630 37


Best Pract ices· SQL MaxComput e

Case 2: Specify t he filt er condit ions in t he {on_condition} clause, as shown in t he following


st at ement :

SELECT A.*, B.*


FROM A JOIN B
ON a.key = b.key and A.ds='20180101' and B.ds='20180101';

T he Cart esian product of T able A and T able B cont ains nine rows, of which only one meet s t he join
condit ion. T he following t able list s t he result s t hat t he preceding st at ement ret urns.

a.key a.ds b.key b.ds

1 20180101 1 20180101

Case 3: Specify t he filt er condit ions in t he WHERE clause aft er t he ON clause, as shown in t he
following st at ement :

SELECT A.*, B.*


FROM A JOIN B
ON a.key = b.key
WHERE A.ds='20180101' and B.ds='20180101';

T he Cart esian product of T able A and T able B cont ains nine rows, of which only t hree meet t he join
condit ion. T he following t able list s t he result set .

a.key a.ds b.key b.ds

1 20180101 1 20180101

2 20180102 2 20180102

2 20180101 2 20180102

T he query processor t hen filt ers t he preceding result set based on t he A.ds='20180101' and B.ds
='20180101' filt er condit ion. T he following t able list s t he result s t hat t he preceding st at ement
ret urns.

a.key a.ds b.key b.ds

1 20180101 1 20180101

LEFT JOIN
A LEFT JOIN operat ion first t akes t he Cart esian product of t he rows in T able A and T able B and ret urns
all t he rows of T able A and rows in T able B t hat meet t he join condit ion. If t he join condit ion finds no
mat ching rows in T able B for a row in T able A, t he row in T able A is ret urned in t he result set wit h
NULL values in each column from T able B.
Conclusion: A LEFT JOIN operat ion may ret urn different result s, depending on whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_conditio
n} :

T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able A is
specified in {subquery_where_condition} or {where_condition} .

38 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able B is
specified in {subquery_where_condition} or {on_condition} .

Case 1: Specify t he filt er condit ions in t he {subquery_where_condition} clause, as shown in t he


following st at ement :

SELECT A.*, B.*


FROM
(SELECT * FROM A WHERE ds='20180101') A
LEFT JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key;

T he following t able list s t he result s t hat t he preceding st at ement ret urns.

a.key a.ds b.key b.ds

1 20180101 1 20180101

2 20180101 NULL NULL

Case 2: Specify t he filt er condit ions in t he {on_condition} clause, as shown in t he following


st at ement :

SELECT A.*, B.*


FROM A LEFT JOIN B
ON a.key = b.key and A.ds='20180101' and B.ds='20180101';

T he Cart esian product of T able A and T able B cont ains nine rows, of which only one meet s t he join
condit ion. T he ot her t wo rows in T able A do not have mat ching rows in T able B. T herefore, NULL
values are ret urned in t he columns from T able B for t he t wo rows in T able A. T he following t able
list s t he result s t hat t he preceding st at ement ret urns.

a.key a.ds b.key b.ds

1 20180101 1 20180101

2 20180101 NULL NULL

2 20180102 NULL NULL

> Document Version: 20220630 39


Best Pract ices· SQL MaxComput e

Case 3: Specify t he filt er condit ions in t he WHERE clause aft er t he ON clause, as shown in t he
following st at ement :

SELECT A.*, B.*


FROM A LEFT JOIN B
ON a.key = b.key
WHERE A.ds='20180101' and B.ds='20180101';

T he Cart esian product of T able A and T able B cont ains nine rows, of which only t hree meet t he join
condit ion. T he following t able list s t he result set .

a.key a.ds b.key b.ds

1 20180101 1 20180101

2 20180101 2 20180102

2 20180102 2 20180102

T he query processor t hen filt ers t he preceding result set based on t he A.ds='20180101' and B.ds
='20180101' filt er condit ion. T he following t able list s t he result s t hat t he preceding st at ement
ret urns.

a.key a.ds b.key b.ds

1 20180101 1 20180101

RIGHT JOIN
A RIGHT JOIN operat ion is similar t o a LEFT JOIN operat ion, except t hat t he t wo t ables are used in a
reversed manner. A RIGHT JOIN operat ion ret urns all t he rows of T able B and rows in T able A t hat
meet t he join condit ion.

Conclusion: A RIGHT JOIN operat ion may ret urn different result s, depending on whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_condit
ion} .

T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able B is
specified in {subquery_where_condition} or {where_condition} .
T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able A is
specified in {subquery_where_condition} or {on_condition} .

FULL JOIN

A FULL JOIN operat ion t akes t he Cart esian product of t he rows in T able A and T able B and ret urns all
t he rows in T able A and T able B, whet her t he join condit ion is met or not . In t he result set , NULL
values are ret urned in t he columns from t he t able t hat lacks a mat ching row in t he ot her t able.
Conclusion: A FULL JOIN operat ion may ret urn different result s, depending on whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_conditio
n} .

40 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

Case 1: Specify t he filt er condit ions in t he {subquery_where_condition} clause, as shown in t he


following st at ement :

SELECT A.*, B.*


FROM
(SELECT * FROM A WHERE ds='20180101') A
FULL JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key;

T he following t able list s t he result s t hat t he preceding st at ement ret urns.

a.key a.ds b.key b.ds

1 20180101 1 20180101

2 20180101 NULL NULL

NULL NULL 3 20180101

Case 2: Specify t he filt er condit ions in t he {on_condition} clause, as shown in t he following


st at ement :

SELECT A.*, B.*


FROM A FULL JOIN B
ON a.key = b.key and A.ds='20180101' and B.ds='20180101';

T he Cart esian product of T able A and T able B cont ains nine rows, of which only one meet s t he join
condit ion. In t he result set , for t he t wo rows in T able A t hat mat ch no rows in T able B, NULL values
are ret urned in t he columns from T able B. For t he t wo rows in T able B t hat mat ch no rows in T able
A, NULL values are ret urned in t he columns from T able A. T he following t able list s t he result s t hat
t he preceding st at ement ret urns.

a.key a.ds b.key b.ds

1 20180101 1 20180101

2 20180101 NULL NULL

2 20180102 NULL NULL

NULL NULL 3 20180101

NULL NULL 2 20180102

> Document Version: 20220630 41


Best Pract ices· SQL MaxComput e

Case 3: Specify t he filt er condit ions in t he WHERE clause aft er t he ON clause, as shown in t he
following st at ement :

SELECT A.*, B.*


FROM A FULL JOIN B
ON a.key = b.key
WHERE A.ds='20180101' and B.ds='20180101';

T he Cart esian product of T able A and T able B cont ains nine rows, of which only t hree meet t he join
condit ion.

a.key a.ds b.key b.ds

1 20180101 1 20180101

2 20180101 2 20180102

2 20180102 2 20180102

T he row in T able B t hat has no mat ching rows in T able A is ret urned in t he result set , wit h NULL
values in t he columns from T able A for t hat row. T he following t able list s t he result set .

a.key a.ds b.key b.ds

1 20180101 1 20180101

2 20180101 2 20180102

2 20180102 2 20180102

NULL NULL 3 20180101

T he query processor t hen filt ers t he preceding result set based on t he A.ds='20180101' and B.ds
='20180101' filt er condit ion. T he following t able list s t he result s t hat t he preceding st at ement
ret urns.

a.key a.ds b.key b.ds

1 20180101 1 20180101

LEFT SEMI JOIN

A LEFT SEMI JOIN operat ion ret urns only t he rows in T able A t hat have a mat ching row in T able B. A
LEFT SEMI JOIN operat ion does not ret urn rows from T able B. T herefore, you cannot specify a filt er
condit ion for T able B in t he WHERE clause aft er t he ON clause.
Conclusion: A LEFT SEMI JOIN operat ion ret urns t he same result s independent ly of whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_conditio
n} .

42 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

Case 1: Specify t he filt er condit ions in t he {subquery_where_condit ion} clause, as shown in t he


following st at ement :

SELECT A.*
FROM
(SELECT * FROM A WHERE ds='20180101') A
LEFT SEMI JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key;

T he following t able list s t he result s t hat t he preceding st at ement ret urns.

a.key a.ds

1 20180101

Case 2: Specify t he filt er condit ions in t he {on_condition} clause, as shown in t he following


st at ement :

SELECT A.*
FROM A LEFT SEMI JOIN B
ON a.key = b.key and A.ds='20180101' and B.ds='20180101';

T he following t able list s t he result s t hat t he preceding st at ement ret urns.

a.key a.ds

1 20180101

Case 3: Specify t he filt er condit ions in t he WHERE clause aft er t he ON clause, as shown in t he
following st at ement :

SELECT A.*
FROM A LEFT SEMI JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key
WHERE A.ds='20180101';

T he following t able list s t he result set .

a.key a.ds

1 20180101

T he query processor t hen filt ers t he preceding result set based on t he A.ds='20180101' filt er
condit ion. T he following t able list s t he result s t hat t he preceding st at ement ret urns.

a.key a.ds

1 20180101

LEFT ANT I JOIN

> Document Version: 20220630 43


Best Pract ices· SQL MaxComput e

A LEFT ANT I JOIN operat ion ret urns only t he rows in T able A t hat have no mat ching rows in T able B. A
LEFT ANT I JOIN operat ion does not ret urn rows from T able B. T herefore, you cannot specify a filt er
condit ion for T able B in t he WHERE clause aft er t he ON clause. A LEFT ANT I JOIN operat ion is usually
used t o replace t he NOT EXIST S synt ax.

Conclusion: A LEFT ANT I JOIN operat ion may ret urn different result s, depending on whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_conditio
n} .

T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able A is
specified in {subquery_where_condition} or {where_condition} .
T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able B is
specified in {subquery_where_condition} or {on_condition} .

Case 1: Specify t he filt er condit ions in t he {subquery_where_condition} clause, as shown in t he


following st at ement :

SELECT A.*
FROM
(SELECT * FROM A WHERE ds='20180101') A
LEFT ANTI JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key;

T he following t able list s t he result s t hat t he preceding st at ement ret urns.

a.key a.ds

2 20180101

Case 2: Specify t he filt er condit ions in t he {on_condition} clause, as shown in t he following


st at ement :

SELECT A.*
FROM A LEFT ANTI JOIN B
ON a.key = b.key and A.ds='20180101' and B.ds='20180101';

T he following t able list s t he result s t hat t he preceding st at ement ret urns.

a.key a.ds

2 20180101

2 20180102

44 > Document Version: 20220630


MaxComput e Best Pract ices· SQL

Case 3: Specify t he filt er condit ions in t he WHERE clause aft er t he ON clause, as shown in t he
following st at ement :

SELECT A.*
FROM A LEFT ANTI JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key
WHERE A.ds='20180101';

T he following t able list s t he result set .

a.key a.ds

2 20180101

2 20180102

T he query processor t hen filt ers t he preceding result set based on t he A.ds='20180101' filt er
condit ion. T he following t able list s t he result s t hat t he preceding st at ement ret urns.

a.key a.ds

2 20180101

Usage notes
For an INNER JOIN operat ion or a LEFT SEMI JOIN operat ion, an SQL st at ement ret urns t he same result s,
regardless of where you specify filt er condit ions for t he left t able and t he right t able.
For a LEFT JOIN operat ion or a LEFT ANT I JOIN operat ion, t he filt er condit ion for t he left t able
funct ions t he same whet her it is specified in {subquery_where_condition} or {where_condition}
. T he filt er condit ion for t he right t able funct ions t he same whet her it is specified in {subquery_wh
ere_condition} or {on_condition} .
For a RIGHT JOIN operat ion, t he filt er condit ion for t he left t able funct ions t he same whet her it is
specified in {subquery_where_condition} or {on_condition} . T he filt er condit ion for t he right
t able funct ions t he same whet her it is specified in {subquery_where_condition} or {where_condit
ion} .

For a FULL OUT ER JOIN operat ion, filt er condit ions can be specified only in {subquery_where_conditio
n} .

> Document Version: 20220630 45


Best Pract ices· Dat a migrat ion MaxComput e

2.Data migration
2.1. Overview
T his t opic describes t he best pract ices for dat a migrat ion, including migrat ing business dat a or log dat a
from ot her business plat forms t o MaxComput e or migrat ing dat a from MaxComput e t o ot her business
plat forms.

Background information
T radit ional relat ional dat abases are not suit able for processing a large amount of dat a. If you have a
large amount of dat a st ored in a t radit ional relat ional dat abase, you can migrat e t he dat a t o
MaxComput e.

MaxComput e provides a comprehensive set of dat a migrat ion solut ions and a variet y of classic
dist ribut ed comput ing models, allowing you t o st ore a large amount of dat a and comput e dat a fast .
By using MaxComput e, you can efficient ly save cost s for your ent erprise.
Dat aWorks provides comprehensive feat ures for MaxComput e, such as dat a int egrat ion, dat a analyt ics,
dat a management , and dat a administ rat ion. Among t hese feat ures, data integration enables st able,
efficient , and scalable dat a synchronizat ion.

Best practices
Migrat e business dat a from ot her business plat forms t o MaxComput e
Migrat e dat a across Dat aWorks workspaces. For more informat ion, see Migrat e dat a across
Dat aWorks workspaces.
Migrat e dat a from Hadoop t o MaxComput e. For more informat ion, see Best pract ices of migrat ing
dat a from Hadoop t o MaxComput e. For more informat ion about t he issues t hat you may encount er
during dat a and script migrat ion and t he solut ions, see Pract ices of migrat ing dat a from a user-
creat ed Hadoop clust er t o MaxComput e.
Migrat e dat a from Oracle t o MaxComput e. For more informat ion, see Migrat e dat a from Oracle t o
MaxComput e.
Migrat e dat a from a Kafka clust er t o MaxComput e. For more informat ion, see Migrat e dat a from a
Kafka clust er t o MaxComput e.
Migrat e dat a from an Elast icsearch clust er t o MaxComput e. For more informat ion, see Migrat e dat a
from an Elast icsearch clust er t o MaxComput e.
Migrat e dat a from RDS t o MaxComput e. For more informat ion, see Migrat e dat a from RDS t o
MaxComput e t o implement dynamic part it ioning.
Migrat e JSON dat a from Object St orage Service (OSS) t o MaxComput e. For more informat ion, see
Migrat e JSON dat a from OSS t o MaxComput e.
Migrat e JSON dat a from MongoDB t o MaxComput e. For more informat ion, see Migrat e JSON dat a
from MongoDB t o MaxComput e.
Migrat e dat a from a user-creat ed MySQL dat abase on an Elast ic Comput e Service (ECS) inst ance t o
MaxComput e. For more informat ion, see Migrat e dat a from a user-creat ed MySQL dat abase on an
ECS inst ance t o MaxComput e.

Migrat e log dat a from ot her business plat forms t o MaxComput e

46 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

Use T unnel t o migrat e log dat a t o MaxComput e. For more informat ion, see Use T unnel t o upload
log dat a t o MaxComput e.
Use Dat aHub t o migrat e log dat a t o MaxComput e. For more informat ion, see Use Dat aHub t o
migrat e log dat a t o MaxComput e.
Use Dat aWorks t o migrat e log dat a t o MaxComput e. For more informat ion, see Use Dat aWorks Dat a
Int egrat ion t o migrat e log dat a t o MaxComput e.

Migrat e dat a from MaxComput e t o ot her business plat forms


Migrat e dat a from MaxComput e t o OSS. For more informat ion, see Migrat e dat a from MaxComput e
t o OSS.
Migrat e dat a from MaxComput e t o T ablest ore. For more informat ion, see Migrat e dat a from
MaxComput e t o T ablest ore.

Aft er t he business dat a and log dat a are processed by MaxComput e, you can use Quick BI t o present
t he dat a processing result s in a visualized manner. For more informat ion, see Best pract ices of using
MaxComput e t o process dat a and Quick BI t o present t he dat a processing result s.

2.2. Migrate data across DataWorks


workspaces
T his t opic describes how t o migrat e dat a across Dat aWorks workspaces in t he same region.

Prerequisites
All t he st eps in t he t ut orial Build an online operat ion analysis plat form are complet ed. For more
informat ion, see Business scenarios and development process.

Context
T his t opic uses t he bigdat a_DOC workspace creat ed in t he t ut orial Build an online operat ion analysis
plat form as t he source workspace. You need t o creat e a dest inat ion workspace t o st ore t he t ables,
resources, configurat ions, and dat a synchronized from t he source workspace.

Procedure
1. Creat e a dest inat ion workspace.
i. Log on t o t he Dat aWorks console. In t he left -side navigat ion pane, click Workspaces.
ii. On t he Workspaces page t hat appears, select t he China (Hangz hou) region in t he upper-left
corner and click Creat e Workspace .
iii. In t he Creat e Workspace pane t hat appears, set paramet ers in t he Basic Set t ings st ep and
click Next .

Section Parameter Description

T he name of the workspace. T he name


must be 3 to 23 characters in length and
W o rkspace Name
can contain letters, underscores (_), and
digits. T he name must start with a letter.

> Document Version: 20220630 47


Best Pract ices· Dat a migrat ion MaxComput e

Section Parameter Description

T he display name of the workspace. T he


display name can be a maximum of 23
Display Name characters in length. It can contain letters,
underscores (_), and digits and must start
with a letter.

T he mode of the workspace. Valid values:


Basic Mo de (Pro duct io n Enviro nment
Only) and St andard Mo de
(Develo pment and Pro duct io n
Enviro nment s) .

Basic Mo de (Pro duct io n Enviro nment


Only) : A workspace in basic mode is
associated with only one MaxCompute
project. Workspaces in basic mode do
Basic Inf o rmat io n
not isolate the development environment
from the production environment. In
these workspaces, you can perform only
basic data development and cannot
strictly control the data development
process and the permissions on tables.

Mo de St andard Mo de (Develo pment and


Pro duct io n Enviro nment s) : A
workspace in standard mode is
associated with two MaxCompute
projects. One serves as the development
environment, and the other serves as the
production environment. Workspaces in
standard mode allow you to develop
code in a standard way and strictly
control the permissions on tables. T hese
workspaces impose limits on table
operations in the production environment
for data security.

For more information, see Basic mode and


standard mode.

Descript io n T he description of the workspace.

Specifies whether the query results that are


returned by SELECT statements in
DataStudio can be downloaded. If you turn
off this switch, the query results cannot be
Advanced Do w nlo ad SELECT Query
downloaded. You can change the setting of
Set t ings Result
this parameter for the workspace in the
Workspace Settings panel after the
workspace is created. For more information,
see Configure security settings.

48 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

T he source workspace bigdat a_DOC is in t he basic mode. For convenience, set Mode t o Basic
Mode (Product ion Environment Only) in t he Basic Set t ings st ep when you creat e a
dest inat ion workspace.

Set Workspace Name t o a globally unique name. We recommend t hat you use a name t hat is
easy t o dist inguish. In t his example, set Workspace Name t o clone_t est _doc.
iv. In t he Select Engines and Services st ep, select t he MaxComput e check box and Pay-As-You-Go
in t he Comput e Engines sect ion and click Next .
v. In t he Engine Det ails st ep, set t he required paramet ers and click Creat e Workspace .

Compute engine Parameter Description

T he display name of the compute engine


instance. T he display name must be 3 to 27
Inst ance Display Name characters in length, and can contain only
letters, underscores (_), and digits. It must
start with a letter.

T he name of the MaxCompute project. By


MaxCo mput e Pro ject
default, the name is the same as that of
Name
the DataWorks workspace.

T he identity used to access the


MaxCo mput e MaxCompute project. For the development
environment, the value is fixed to T ask
o w ner .
Acco unt f o r Accessing
MaxCo mput e For the production environment, the valid
values are Alibaba Clo ud primary
acco unt and Alibaba Clo ud sub-
acco unt .

T he quotas of computing resources and


Reso urce Gro up disk spaces for the compute engine
instance.

2. Clone node configurat ions and resources across workspaces.


You can use t he cross-workspace cloning feat ure of Dat aWorks t o clone t he node
configurat ions and resources from t he bigdat a_DOC workspace t o t he clone_t est _doc workspace.
For more informat ion, see Clone nodes across workspaces.

Not e
T he cross-workspace cloning feat ure cannot clone t able schemas or dat a.
T he cross-workspace cloning feat ure cannot clone combined nodes. If t he dest inat ion
workspace needs t o use t he combined nodes t hat exist in t he source workspace, you
need t o manually creat e t he combined nodes in t he dest inat ion workspace.

i. Go t o t he bigdat a_DOC workspace and click Cross-project cloning in t he upper-right corner.


T he Creat e Clone T ask page appears.

> Document Version: 20220630 49


Best Pract ices· Dat a migrat ion MaxComput e

ii. Set T arget Workspace t o clone_t est _doc and Workf low t o Workshop t hat needs t o be
cloned. Select all t he nodes in t he workflow and click Add t o List . Click T o-Be-Cloned Node
List in t he upper-right corner.
iii. In t he Nodes t o Clone pane t hat appears, click Clone All. T he select ed nodes are cloned t o
t he clone_t est _doc workspace.
iv. Go t o t he dest inat ion workspace and check whet her t he nodes are cloned.
3. Creat e t ables.
T he cross-workspace cloning feat ure cannot clone t able schemas. T herefore, you need t o
manually creat e required t ables in t he dest inat ion workspace.
For non-part it ioned t ables, we recommend t hat you use t he following SQL st at ement t o
synchronize t he t able schema from t he source workspace:

create table table_name as select * from Source workspace. Table name;

For part it ioned t ables, we recommend t hat you use t he following SQL st at ement t o synchronize
t he t able schema from t he source workspace:

create table table_name partitioned by (Partition key column string);

Commit t he t ables t o t he product ion environment . For more informat ion, see Create tables.

4. Synchronize dat a.
T he cross-workspace cloning feat ure cannot clone dat a from t he source workspace t o t he
dest inat ion workspace. You need t o manually synchronize required dat a t o t he dest inat ion
workspace. T o synchronize t he dat a of t he rpt _user_t race_log t able from t he source workspace t o
t he dest inat ion workspace, follow t hese st eps:
i. Creat e a connect ion.
a. Go t o t he Dat a Int egrat ion page and click Connect ion in t he left -side navigat ion pane.
b. On t he Dat a Source page t hat appears, click Add a Connect ion in t he upper-right
corner. In t he Add Connect ion dialog box t hat appears, select MaxComput e(ODPS) in t he
Big Dat a St orage sect ion.
c. In t he Add MaxComput e(ODPS) Connect ion dialog box t hat appears, set Connect ion
Name , MaxComput e Project Name , AccessKey ID, and AccessKey Secret , and click
Complet e . For more informat ion, see Add a MaxComput e dat a source.
ii. Creat e a bat ch sync node.
a. Go t o t he Dat aSt udio page, click t he Dat a Analyt ics t ab, and t hen click Workshop under
Business Flow . Right -click Dat a Int egrat ion and choose Creat e > Bat ch
Synchroniz at ion t o creat e a bat ch sync node.
b. On t he configurat ion t ab of t he bat ch sync node, set t he required paramet ers. In t his
example, set Connect ion under Source t o bigdat a_DOC and t hat under T arget t o
odps_f irst . Set T able t o rpt _user_t race_log. Aft er t he configurat ion is complet ed, click
t he Propert ies t ab in t he right -side navigat ion pane.
c. Click Use Root Node in t he Dependencies sect ion and commit t he bat ch sync node.

50 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

iii. Generat e ret roact ive dat a for t he bat ch sync node.
a. On t he Dat aSt udio page, click t he Dat aWorks icon in t he upper-left corner and choose All
Product s > Operat ion Cent er.
b. On t he page t hat appears, choose Cycle T ask Maint enance > Cycle T ask in t he left -
side navigat ion pane.
c. On t he page t hat appears, find t he bat ch sync node you creat ed in t he node list and click
t he node name. On t he canvas t hat appears on t he right , right -click t he bat ch sync node
and choose Run > Current Node Ret roact ively .
d. In t he Pat ch Dat a dialog box t hat appears, set t he required paramet ers. In t his example,
set Dat a T imest amp t o Jun 11, 2019 - Jun 17, 2019 t o synchronize dat a from mult iple
part it ions. Click OK.
e. On t he Pat ch Dat a page t hat appears, check t he running st at us of t he ret roact ive
inst ances t hat are generat ed. If Successf ul appears in t he ST AT US column of a
ret roact ive inst ance, t he inst ance is run and t he corresponding dat a is synchronized.
iv. Verify t he dat a synchronizat ion.
On t he Dat a Analyt ics t ab of t he Dat aSt udio page, right -click t he Workshop workflow under
Business Flow and choose Creat e > MaxComput e > ODPS SQL t o creat e an ODPS SQL node.
On t he configurat ion t ab of t he ODPS SQL node, run t he following SQL st at ement t o check
whet her dat a is synchronized t o t he dest inat ion workspace:

select * from rpt_user_trace_log where dt BETWEEN '20190611' and '20190617';

2.3. Synchronize data from Hadoop to


MaxCompute
T his t opic describes how t o use t he dat a synchronizat ion feat ure of Dat aWorks t o synchronize dat a
from Hadoop Dist ribut ed File Syst em (HDFS) t o MaxComput e. Dat a synchronizat ion bet ween
MaxComput e and Hadoop or Spark is support ed.

Prerequisites
MaxComput e is act ivat ed. A MaxComput e project is creat ed.
In t his example, a project named bigdat a_DOC in t he China (Hangzhou) region is used. For more
informat ion, see Activate MaxCompute and DataWorks.

A Hadoop clust er is creat ed.


Before you synchronize dat a, you must make sure t hat your Hadoop clust er can work as expect ed. In
t his example, Alibaba Cloud E-MapReduce (EMR) is used t o creat e t he Hadoop clust er. For more
informat ion, see Create a cluster.
In t his example, t he following configurat ions are used for t he EMR Hadoop clust er:

EMR version: EMR V3.11.0


Clust er t ype: Hadoop
Soft ware: HDFS 2.7.2, YARN 2.7.2, Hive 2.3.3, Ganglia 3.7.2, Spark 2.2.1, Hue 4.1.0, Zeppelin 0.7.3,
T ez 0.9.1, Sqoop 1.4.6, Pig 0.14.0, ApacheDS 2.0.0, and Knox 0.13.0

> Document Version: 20220630 51


Best Pract ices· Dat a migrat ion MaxComput e

T he EMR Hadoop clust er is a non-high availabilit y (HA) clust er t hat is deployed on t he classic net work
in t he China (Hangzhou) region. A public IP address and a privat e IP address are configured for t he
Elast ic Comput e Service (ECS) inst ance in t he mast er node group of t he EMR Hadoop clust er.

Step 1: Prepare test data


1. Creat e t est dat a in t he EMR Hadoop clust er.
i. Log on t o t he EMR console by using your Alibaba Cloud account .
ii. In t he EMR console, click t he Dat a Plat form t ab. On t he Dat a Plat form t ab, find t he desired
project and creat e a job named doc in t he project . In t he job t hat you creat ed, execut e a t able
creat ion st at ement t o creat e a t able. In t his example, t he following st at ement is used t o
creat e a t able named hive_doc_good_sale in t he EMR Hadoop clust er. For more informat ion
about how t o creat e an EMR job, see Edit jobs.

CREATE TABLE IF NOT


EXISTS hive_doc_good_sale(
create_time timestamp,
category STRING,
brand STRING,
buyer_id STRING,
trans_num BIGINT,
trans_amount DOUBLE,
click_cnt BIGINT
)
PARTITIONED BY (pt string) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',' lines terminated by '\n';

iii. Click Run in t he upper-right corner of t he code edit or on t he Dat a Plat form t ab. If t he Query
executed successfully message appears, t he t able hive_doc_good_sale is creat ed in t he
EMR Hadoop clust er.

Creat e a t able
iv. Insert t est dat a int o t he t able. You can import t est dat a from Object St orage Service (OSS) or
ot her dat a sources t o t he t able. You can also manually insert t est dat a int o t he t able. In t his
example, t he following st at ement is used t o manually insert t est dat a int o t he t able:

insert into
hive_doc_good_sale PARTITION(pt =1 ) values('2018-08-21','Coat','Brand A','lilei',3
,500.6,7),('2018-08-22','Fresh food','Brand B','lilei',1,303,8),('2018-08-22','Coat
','Brand C','hanmeimei',2,510,2),(2018-08-22,'Bathroom product','Brand A','hanmeime
i',1,442.5,1),('2018-08-22','Fresh food','Brand D','hanmeimei',2,234,3),('2018-08-2
3','Coat','Brand B','jimmy',9,2000,7),('2018-08-23','Fresh food','Brand A','jimmy',
5,45.1,5),('2018-08-23','Coat','Brand E','jimmy',5,100.2,4),('2018-08-24','Fresh fo
od','Brand G','peiqi',10,5560,7),('2018-08-24','Bathroom product','Brand F','peiqi'
,1,445.6,2),('2018-08-24','Coat','Brand A','ray',3,777,3),('2018-08-24','Bathroom p
roduct','Brand G','ray',3,122,3),('2018-08-24','Coat','Brand C','ray',1,62,7) ;

v. Aft er you insert t he dat a int o t he t able, execut e t he select * from hive_doc_good_sale whe
re pt =1; st at ement t o check whet her t he dat a exist s in t he t able t hat you creat ed in t he
EMR Hadoop clust er.

Check t he insert ed dat a


2. Creat e a MaxComput e t able in t he Dat aWorks console.

52 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

i.
ii.
iii.
iv.
v.
vi.
vii. In t he DDL St at ement dialog box, ent er t he following t able creat ion st at ement and click
Generat e T able Schema. In t he Confirm message, click OK. In t his example, t he following
t able creat ion st at ement is used t o creat e a MaxComput e t able named hive_doc_good_sale:

CREATE TABLE IF NOT EXISTS hive_doc_good_sale(


create_time string,
category STRING,
brand STRING,
buyer_id STRING,
trans_num BIGINT,
trans_amount DOUBLE,
click_cnt BIGINT
)
PARTITIONED BY (pt string) ;

When you creat e t he t able, you must consider t he mappings bet ween Hive dat a t ypes and
MaxComput e dat a t ypes. For more informat ion about t he mappings, see Data type mappings.
You can also use t he MaxComput e client odpscmd t o creat e a MaxComput e t able. For more
informat ion about how t o inst all and configure t he MaxComput e client , see Install and configure
the MaxCompute client .

Not e If you need t o resolve compat ibilit y issues bet ween Hive dat a t ypes and
MaxComput e dat a t ypes, we recommend t hat you run t he following commands on t he
MaxComput e client :

set odps.sql.type.system.odps2=true;
set odps.sql.hive.compatible=true;

viii. Click Commit t o Product ion Environment . T he t able is creat ed.

> Document Version: 20220630 53


Best Pract ices· Dat a migrat ion MaxComput e

ix. In t he left -side navigat ion pane of t he Dat aSt udio page, click Workspace T ables. In t he
Workspace T ables pane, view t he MaxComput e t able t hat you creat ed.

Step 2: Synchronize data


1. Creat e a cust om resource group.
In most cases, t he net work where a MaxComput e project resides is inaccessible t o a dat a node in a
Hadoop clust er. T o resolve t he connect ivit y issue, you can creat e a cust om resource group t o run
your Dat aWorks synchronizat ion node on t he mast er node of t he Hadoop clust er. In most cases,
t he mast er node and dat a nodes in a Hadoop clust er are connect ed.
i. View informat ion about t he dat a nodes of t he EMR Hadoop clust er.
a. Log on t o t he EMR console. Click t he Clust er Management t ab.
b. On t he Clust er Management t ab, find t he EMR Hadoop clust er t hat you creat ed and click
t he name of t he clust er. On t he Clust ers and Services page, click Inst ances in t he left -side
navigat ion pane. On t he Inst ances page, view informat ion about t he dat a nodes of t he
EMR Hadoop clust er.

54 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

You can also click t he ECS inst ance ID of t he mast er node t o go t o t he Inst ance Det ails t ab
of t he ECS inst ance in t he ECS console. In t he Basic Informat ion sect ion of t he Inst ance
Det ails t ab, click Connect t o log on t o t he ECS inst ance and run t he hadoop dfsadmin –re
port command t o view t he informat ion about t he dat a nodes.

Not e In t his example, each dat a node has only a privat e IP address and cannot
communicat e wit h t he default resource group of Dat aWorks. T herefore, you must
creat e a cust om resource group t o run your Dat aWorks synchronizat ion node on t he
mast er node.

> Document Version: 20220630 55


Best Pract ices· Dat a migrat ion MaxComput e

ii. Creat e a cust om resource group.


a. Log on t o t he Dat aWorks console. In t he left -side navigat ion pane, click Workspaces. On
t he Workspaces page, find t he workspace in which you want t o creat e a cust om resource
group and click Dat a Int egrat ion in t he Act ions column. On t he Dat a Int egrat ion page,
click Cust om Resource Group in t he left -side navigat ion pane. On t he Cust om Resource
Groups page, click Add Resource Group in t he upper-right corner.

Not e You can perform t his st ep t o creat e a cust om resource group only when
you use Dat aWorks Professional Edit ion or a more advanced edit ion.

b. When you add a server, ent er informat ion such as t he UUID of t he ECS inst ance and t he
server IP address. If t he net work t ype is classic net work, ent er t he host name. If t he net work
t ype is virt ual privat e cloud (VPC), ent er t he UUID of t he ECS inst ance. You can add
scheduling resources whose net work t ype is classic net work in Dat aWorks V2.0 only in t he
China (Shanghai) region. In ot her regions, you must add scheduling resources whose
net work t ype is VPC regardless of t he net work t ype of your ECS inst ances.

For t he server IP address, ent er t he public IP address of t he mast er node because t he


privat e IP address may be unreachable. T o query t he UUID of t he ECS inst ance, log on t o
t he mast er node and run t he dmidecode grep UUID command. You can also use t his
command t o query t he UUID of t he ECS inst ance even if your Hadoop clust er is not creat ed
by using EMR.

c. Aft er you add t he server, you must make sure t hat t he mast er node and Dat aWorks are
connect ed. If you add an ECS inst ance, you must configure a securit y group for t he
inst ance.
If you use a privat e IP address, add t he privat e IP address t o t he securit y group of t he
ECS inst ance. For more informat ion, see Configure a securit y group for an ECS inst ance
where a self-managed dat a st ore resides.
If you use a public IP address, configure t he Int ernet inbound and out bound rules in t he
securit y group of t he ECS inst ance. In t his example, all port s are specified in t he
configured inbound rules t o allow t raffic from t he Int ernet . In act ual scenarios, we
recommend t hat you configure specific securit y group rules for securit y purposes.
Inbound and out bound rules

d. Aft er you complet e t he preceding st eps, inst all an agent for t he cust om resource group
as prompt ed. If t he st at us of t he ECS inst ance is Available , t he cust om resource group is
creat ed.

View t he st at us of t he ECS inst ance

If t he st at us of t he ECS inst ance is Unavailable , log on t o t he mast er node and run t he t


ail –f/home/admin/alisatasknode/logs/heartbeat.log command t o check whet her t he
heart beat packet s bet ween Dat aWorks and t he mast er node t imed out .

View heart beat packet s

2. Add dat a sources.


Aft er you creat e a workspace in Dat aWorks and associat e a MaxComput e comput e engine inst ance
wit h t he workspace, Dat aWorks creat es t he default MaxComput e dat a source odps_first . In t his

56 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

example, t he default MaxComput e dat a source is used. T herefore, you need t o add only a Hadoop
dat a source. For more informat ion about how t o add a Hadoop dat a source, see Add an HDFS dat a
source.
i. On t he Dat a Int egrat ion page of t he Dat aWorks console, click Dat a Source in t he left -side
navigat ion pane.
ii. On t he Dat a Source page, click Add dat a source in t he upper-right corner.
iii. In t he Add dat a source dialog box, click HDFS in t he Semi-st ruct uredst orage sect ion.
iv. In t he Add HDFS dat a source dialog box, configure t he paramet ers.

Parameter Description

T he name of the data source. T he name can contain letters,


Dat a So urce Name
digits, and underscores (_) and must start with a letter.

T he description of the data source. T he description cannot


Dat a So urce Descript io n
exceed 80 characters in length.

T he environment in which the data source is used. Valid values:


Develo pment and Pro duct io n.

Enviro nment
No t e T his parameter is displayed only when the
workspace is in standard mode.

T he address of the NameNode in HDFS. If the EMR Hadoop


cluster is in HA mode, the address is hdfs://IP address of
the emr-header-1 node:8020 . If the EMR Hadoop cluster is
in non-HA mode, the address is hdfs://IP address of the
Def ault FS emr-header-1 node:9000 .

In this example, the emr-header-1 node is connected to


DataWorks over the Internet. T herefore, enter the public IP
address and allow traffic from the Internet.

v. Click T est Connect ivit y .


vi. If t he connect ivit y t est is successful, click Complet e .

Not e If t he net work t ype of t he EMR Hadoop clust er is VPC, t he connect ivit y t est is
not support ed.

3. Creat e and configure a dat a synchronizat ion node.


i.
ii.

> Document Version: 20220630 57


Best Pract ices· Dat a migrat ion MaxComput e

iii.
iv. In t he Conf irm message, click OK t o swit ch t o t he code edit or.
v. Click t he Apply T emplat e icon in t he t op t oolbar.
Apply T emplat e icon

vi. In t he Apply T emplat e dialog box, configure t he Source Connect ion T ype , Connect ion,
T arget Connect ion T ype , and Connect ion paramet ers and click OK.
Apply T emplat e dialog box

vii. Aft er t he t emplat e is applied, t he basic set t ings of HDFS Reader are configured.
You can furt her configure t he dat a source and source t able for HDFS Reader based on your
business requirement s. In t his example, t he following script is used. For more informat ion, see
HDFS Reader.

{
"configuration": {
"reader": {
"plugin": "hdfs",
"parameter": {
"path": "/user/hive/warehouse/hive_doc_good_sale/",
"datasource": "HDFS1",
"column": [
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "string"
},
{
"index": 2,
"type": "string"
},
{
"index": 3,
"type": "string"
},
{
"index": 4,
"type": "long"
},
{
"index": 5,
"type": "double"
},
{
"index": 6,
"type": "long"
}
],
"defaultFS": "hdfs://47.100.XX.XXX:9000",
"fieldDelimiter": ",",

58 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

"fieldDelimiter": ",",
"encoding": "UTF-8",
"fileType": "text"
}
},
"writer": {
"plugin": "odps",
"parameter": {
"partition": "pt=1",
"truncate": false,
"datasource": "odps_first",
"column": [
"create_time",
"category",
"brand",
"buyer_id",
"trans_num",
"trans_amount",
"click_cnt"
],
"table": "hive_doc_good_sale"
}
},
"setting": {
"errorLimit": {
"record": "1000"
},
"speed": {
"throttle": false,
"concurrent": 1,
"mbps": "1",
}
}
},
"type": "job",
"version": "1.0"
}

In t he preceding script , t he pat h paramet er specifies t he direct ory where t he source dat a is
st ored in t he EMR Hadoop clust er. You can log on t o t he mast er node and run t he hdfs dfs –
ls /user/hive/warehouse/hive_doc_good_sale command t o check t he direct ory. For a
part it ioned t able, t he dat a synchronizat ion feat ure of Dat aWorks can aut omat ically recurse t o
t he part it ion where t he dat a is st ored.

viii. Aft er t he configurat ion is complet e, click t he Run icon in t he t op t oolbar. If a message
indicat ing t hat t he synchronizat ion node is successfully run appears, t he dat a is synchronized.
If a message indicat ing t hat t he synchronizat ion node failed t o be run appears, check logs for
t roubleshoot ing.

Step 3: View the result


1. In t he left -side navigat ion pane of t he Dat aSt udio page, click Ad Hoc Query .
2. In t he Ad Hoc Query pane, creat e an ODPS SQL node based on t he inst ruct ions in t he following

> Document Version: 20220630 59


Best Pract ices· Dat a migrat ion MaxComput e

figure.

ODPS SQL

3. In t he code edit or of t he creat ed ODPS SQL node, writ e and execut e an SQL st at ement t o view t he
dat a t hat is synchronized t o t he hive_doc_good_sale t able.
Sample st at ement :

-- Check whether the data is synchronized to MaxCompute.


select * from hive_doc_good_sale where pt=1;

Not e You can also run t he select * FROM hive_doc_good_sale where pt =1;
command by using t he MaxComput e client t o query t he synchronized dat a.

If you want t o synchronize dat a from MaxComput e t o Hadoop, you can also perform t he preceding
st eps. However, you must exchange t he reader and writ er in t he preceding script . You can use t he
following script t o synchronize dat a from MaxComput e t o Hadoop:

{
"configuration": {
"reader": {
"plugin": "odps",
"parameter": {
"partition": "pt=1",
"isCompress": false,
"datasource": "odps_first",
"column": [
"create_time",
"category",
"brand",
"buyer_id",
"trans_num",
"trans_amount",
"click_cnt"
],
"table": "hive_doc_good_sale"
}
},
"writer": {
"plugin": "hdfs",
"parameter": {
"path": "/user/hive/warehouse/hive_doc_good_sale",
"fileName": "pt=1",
"datasource": "HDFS_data_source",
"column": [
{
"name": "create_time",
"type": "string"
},
{
"name": "category",
"type": "string"
},
{

60 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

{
"name": "brand",
"type": "string"
},
{
"name": "buyer_id",
"type": "string"
},
{
"name": "trans_num",
"type": "BIGINT"
},
{
"name": "trans_amount",
"type": "DOUBLE"
},
{
"name": "click_cnt",
"type": "BIGINT"
}
],
"defaultFS": "hdfs://47.100.XX.XX:9000",
"writeMode": "append",
"fieldDelimiter": ",",
"encoding": "UTF-8",
"fileType": "text"
}
},
"setting": {
"errorLimit": {
"record": "1000"
},
"speed": {
"throttle": false,
"concurrent": 1,
"mbps": "1",
}
}
},
"type": "job",
"version": "1.0"
}

Not e Before you run a synchronizat ion node t o synchronize dat a from MaxComput e t o
Hadoop, you must configure t he Hadoop clust er. For more informat ion, see HDFS Writ er. Aft er
t he synchronizat ion node is run, you can copy t he file t hat is synchronized.

2.4. Best practice to migrate data


from Oracle to MaxCompute

> Document Version: 20220630 61


Best Pract ices· Dat a migrat ion MaxComput e

T his t opic describes how t o use t he dat a int egrat ion feat ure of Dat aWorks t o migrat e dat a from Oracle
t o MaxComput e.

Prerequisites
T he Dat aWorks environment is ready.
i. Act ivat e MaxComput e and Dat aWorks.
ii. Creat e a workspace. In t his example, a workspace in basic mode is used.
iii. A workflow is creat ed in t he Dat aWorks console. For more informat ion, see Creat e a workflow.
T he Oracle dat abase is ready.

In t his example, t he Oracle dat abase is inst alled on an Elast ic Comput e Service (ECS) inst ance. T o
enable net work communicat ion, you must configure a public IP address for t he ECS inst ance. In
addit ion, you must configure a securit y group rule for t he ECS inst ance t o ensure t hat t he common
port 1521 of t he Oracle dat abase is accessible. For more informat ion about how t o configure a
securit y group rule for an ECS inst ance, see Modify security group rules.

In t his example, t he t ype of t he ECS inst ance is ecs.c5.xlarge . T he ECS inst ance resides in a virt ual
privat e cloud (VPC) in t he China (Hangzhou) region.

Context
In t his example, Dat aWorks Oracle Reader is used t o read t est dat a from t he Oracle dat abase. For more
informat ion, see Oracle Reader.

Prepare test data in the O racle database


1. In t he Oracle dat abase, creat e t he DT ST EST .GOOD_SALE t able t hat cont ains t he CREAT E_T IME,
CAT EGORY, BRAND, BUYER_ID, T RANAS_NUM, T RANS_AMOUNT , and CLICK_CNT columns.
2. Insert t est dat a t o t he DT ST EST .GOOD_SALE t able. In t his example, t he following st at ement s are
execut ed t o insert t est dat a:

insert into good_sale values('28-December-19','Kitchenware','Brand A','hanmeimei','6','


80.6','4');
insert into good_sale values('21-December-19','Fresh food','Brand B','lilei','7','440.6
','5');
insert into good_sale values('29-December-19','Clothing','Brand C','lily','12','351.9',
'9');
commit;

3. Aft er dat a insert ion, execut e t he following st at ement t o view t he dat a in t he t able:

select * from good_sale;

Use DataWorks to migrate data from the O racle database to


MaxCompute
1. Go t o t he Dat aSt udio page.
i. Log on t o t he Dat aWorks console.
ii. In t he left -side navigat ion pane, click Workspaces.
iii. Select t he region where t he required workspace resides. Find t he required workspace and click
Dat a Analyt ics.

62 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

2. On t he Dat aSt udio page, creat e a dest inat ion t able t o receive dat a migrat ed from t he Oracle
dat abase.
i.
ii.
iii.
iv. In t he DDL St at ement dialog box, ent er t he following st at ement and click Generat e T able
Schema:

CREATE TABLE good_sale


(
create_time string,
category string,
brand string,
buyer_id string,
trans_num bigint,
trans_amount double,
click_cnt bigint
) ;

When you creat e t he MaxComput e t able, make sure t hat t he dat a t ypes of t he MaxComput e
t able mat ch t hose of t he Oracle t able. For more informat ion about t he dat a t ypes support ed
by Oracle Reader, see Data types.

v.
3. Creat e an Oracle connect ion. For more informat ion, see Add an Oracle data source.
4. Creat e a bat ch sync node.
i.
ii.
iii. Aft er you creat e t he bat ch sync node, set t he Connect ion paramet er t o t he creat ed Oracle
connect ion and t he T able paramet er t o t he Oracle t able t hat you have creat ed. Click Map
Fields wit h t he Same Name . Use t he default values for ot her paramet ers.
iv.
v.

Verify the result


1.
2.
3. On t he configurat ion t ab of t he ODPS SQL node, ent er t he following st at ement :

-- Check whether the data is written to MaxCompute.


select * from good_sale;

4.
5.

2.5. Migrate data from Kafka to


> Document Version: 20220630 63
Best Pract ices· Dat a migrat ion MaxComput e

2.5. Migrate data from Kafka to


MaxCompute
T his t opic describes how t o use Dat aWorks Dat a Int egrat ion t o migrat e dat a from a Kafka clust er t o
MaxComput e.

Prerequisites
MaxComput e is act ivat ed. For more informat ion, see Act ivat e MaxComput e and Dat aWorks.
A workflow is creat ed in Dat aWorks. In t his example, a Dat aWorks workspace in basic mode is used.
For more informat ion, see Creat e a workflow.
A Kafka clust er is creat ed.
Before dat a migrat ion, make sure t hat your Kafka clust er works as expect ed. In t his example, Alibaba
Cloud E-MapReduce (EMR) is used t o aut omat ically creat e a Kafka clust er. For more informat ion, see
Kafka quick start .

In t his example, t he following version of EMR Kafka is used:

EMR version: V3.12.1


Clust er t ype: Kafka
Soft ware: Ganglia 3.7.2, ZooKeeper 3.4.12, Kafka 2.11-1.0.1, and Kafka Manager 1.3.3.16

T he Kafka clust er is deployed in a virt ual privat e cloud (VPC) in t he China (Hangzhou) region. T he
Elast ic Comput e Service (ECS) inst ances in t he primary inst ance group of t he Kafka clust er are
configured wit h public and privat e IP addresses.

Context
Kafka is dist ribut ed middleware t hat is used t o publish and subscribe t o messages. Kafka is widely used
because of it s high performance and high t hroughput . Kafka can process millions of messages per
second. Kafka is applicable t o st reaming dat a processing, and is used in scenarios such as user behavior
t racing and log collect ion.

A t ypical Kafka clust er cont ains several producers, brokers, consumers, and a ZooKeeper clust er. A Kafka
clust er uses ZooKeeper t o manage configurat ions and coordinat e services in t he clust er.

A t opic is t he most commonly used collect ion of messages in a Kafka clust er, and is a logical concept
for message st orage. T opics are not st ored on physical disks. Inst ead, messages in each t opic are st ored
on t he disks of each clust er node by part it ion. Mult iple producers can publish messages t o a t opic, and
mult iple consumers can subscribe t o messages in a t opic.

When a message is st ored t o a part it ion, t he message is allocat ed an offset . T he offset is t he unique ID
of t he message in t he part it ion. T he offset s of messages in each part it ion st art from 0.

Step 1: Prepare Kafka data


You must prepare t est dat a in t he Kafka clust er. Configure a securit y group rule for t he header node of
t he EMR clust er t o allow request s on T CP port s 22 and 9092. T his way, you can log on t o t he header
node of t he EMR clust er and MaxComput e and Dat aWorks can communicat e wit h t he header node.

1. Log on t o t he header node of t he EMR clust er.


i. Log on t o t he EMR console.

64 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

ii. In t he t op navigat ion bar, click Clust er Management .


iii. On t he page t hat appears, find t he clust er for which you want t o prepare t est dat a and go t o
t he det ails page of t he clust er.
iv. On t he det ails page of t he clust er, click Inst ances. Find t he IP address of t he header node of
t he E-MapReduce clust er and use t he IP address t o remot ely log on t o t he header node by
using Secure Shell (SSH).
2. Creat e a t est t opic.
Run t he following command t o creat e a t est t opic named t est kafka:

[root@emr-header-1 ~]# kafka-topics.sh --zookeeper emr-header-1:2181/kafka-1.0.1 --part


itions 10 --replication-factor 3 --topic testkafka --create
Created topic "testkafka".

3. Writ e t est dat a.

Run t he following command t o simulat e a producer t o writ e dat a t o t he t est kafka t opic. Kafka is
used t o process st reaming dat a. You can cont inuously writ e dat a t o t he t opic. T o ensure t hat t est
result s are valid, we recommend t hat you writ e more t han 10 records.

[root@emr-header-1 ~]# kafka-console-producer.sh --broker-list emr-header-1:9092 --topi


c testkafka
>123
>abc
>

T o simulat e a consumer t o check whet her dat a is writ t en t o Kafka, open anot her SSH window and
run t he following command. If t he dat a t hat is writ t en appears, t he dat a is writ t en t o t he t opic.

[root@emr-header-1 ~]# kafka-console-consumer.sh --bootstrap-server emr-header-1:9092 -


-topic testkafka --from-beginning
123
abc

Step 2: Create a destination table in DataWorks


Creat e a dest inat ion t able in Dat aWorks t o receive dat a from Kafka.

1.
2.
3.
4. Click DDL St at ement . In t he DDL St at ement dialog box, ent er t he following CREAT E T ABLE
st at ement and click Generat e T able Schema:

> Document Version: 20220630 65


Best Pract ices· Dat a migrat ion MaxComput e

CREATE TABLE testkafka


(
key string,
value string,
partition1 string,
timestamp1 string,
offset string,
t123 string,
event_id string,
tag string
) ;

Each column in t he st at ement corresponds t o a default column of Kafka Reader t hat is provided by
Dat aWorks Dat a Int egrat ion.
__key__: t he key of t he message.
__value__: t he complet e cont ent of t he message.
__part it ion__: t he part it ion where t he message resides.
__headers__: t he header of t he message.
__offset __: t he offset of t he message.
__t imest amp__: t he t imest amp of t he message.

You can cust omize a column. For more informat ion, see Kafka Reader.
5.

Step 3: Synchronize the data


1. Creat e an exclusive resource group for Dat a Int egrat ion.

T he Kafka plug-in cannot run on t he default resource group of Dat aWorks as expect ed. You must
use an exclusive resource group for Dat a Int egrat ion t o synchronize dat a. For more informat ion, see
Create and use an exclusive resource group for Data Integration.

2.
3.
4.
5. Configure t he script . In t his example, ent er t he following code:

{
"type": "job",
"steps": [
{
"stepType": "kafka",
"parameter": {
"server": "47.xxx.xxx.xxx:9092",
"kafkaConfig": {
"group.id": "console-consumer-83505"
},
"valueType": "ByteArray",
"column": [
"__key__",
"__value__",
"__partition__",

66 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

"__partition__",
"__timestamp__",
"__offset__",
"'123'",
"event_id",
"tag.desc"
],
"topic": "testkafka",
"keyType": "ByteArray",
"waitTime": "10",
"beginOffset": "0",
"endOffset": "3"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "",
"truncate": true,
"compress": false,
"datasource": "odps_first",
"column": [
"key",
"value",
"partition1",
"timestamp1",
"offset",
"t123",
"event_id",
"tag"
],
"emptyAsNull": false,
"table": "testkafka"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"throttle": false,

> Document Version: 20220630 67


Best Pract ices· Dat a migrat ion MaxComput e

"throttle": false,
"concurrent": 1,
}
}
}

T o view t he values of t he group.id paramet er and t he names of consumer groups, run t he kaf ka-
consumer-groups.sh --boot st rap-server emr-header-1:9092 --list command on t he header
node.

[root@emr-header-1 ~]# kafka-consumer-groups.sh --bootstrap-server emr-header-1:9092


--list
Note: This will not show information about old Zookeeper-based consumers.
_emr-client-metrics-handler-group
console-consumer-69493
console-consumer-83505
console-consumer-21030
console-consumer-45322
console-consumer-14773

In t his example, console-consumer-83505 is used. Run t he kaf ka-consumer-groups.sh --


boot st rap-server emr-header-1:9092 --describe --group console-consumer-83505
command on t he header node t o obt ain t he values of t he beginOffset and endOffset paramet ers.

[root@emr-header-1 ~]# kafka-consumer-groups.sh --bootstrap-server emr-header-1:9092 --


describe --group console-consumer-83505
Note: This will not show information about old Zookeeper-based consumers.
Consumer group 'console-consumer-83505' has no active members.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CO
NSUMER-ID HOST CLIENT-I
D
testkafka 6 0 0 0 -
- -
test 6 3 3 0 -
- -
testkafka 0 0 0 0 -
- -
testkafka 1 1 1 0 -
- -
testkafka 5 0 0 0 -
- -

6. Configure a resource group for scheduling.


i. On t he node configurat ion t ab, click t he Propert ies t ab in t he right -side navigat ion pane.
ii. In t he Resource Group sect ion, set t he Resource Group paramet er t o t he exclusive resource
group for Dat a Int egrat ion t hat you have creat ed.

Not e Assume t hat you want t o writ e Kafka dat a t o MaxComput e at a regular
int erval, for example, on an hourly basis. You can use t he beginDat eT ime and endDat eT ime
paramet ers t o set t he int erval for dat a reading t o 1 hour. T hen, t he dat a int egrat ion node
is scheduled t o run once per hour. For more informat ion, see Kafka Reader.

7.

68 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

8.

What's next
You can creat e a dat a development job and run SQL st at ement s t o check whet her t he dat a has been
synchronized from Message Queue for Apache Kafka t o t he current t able. T his t opic uses t he select
* from testkafka st at ement as an example. Specific st eps are as follows:

1. In t he left -side navigat ion pane, choose Dat a Development > Business Flow .
2. Right -click and choose Dat a Development > Creat e Dat a Development Node ID > ODPS SQL.
3. In t he Creat e Node dialog box, ent er t he node name, and t hen click Submit .
4. On t he page of t he creat ed node, ent er select * from testkafka and t hen click t he Run icon.

2.6. Migrate data from Elasticsearch


to MaxCompute
T his t opic describes how t o use t he dat a synchronizat ion feat ure of Dat aWorks t o migrat e dat a from
an Alibaba Cloud Elast icsearch clust er t o MaxComput e.

Prerequisites
MaxComput e is act ivat ed. For more informat ion, see Act ivat e MaxComput e and Dat aWorks.
Dat aWorks is act ivat ed.
A workflow is creat ed in Dat aWorks. In t his example, a Dat aWorks workspace in basic mode is used.
For more informat ion, see Creat e a workflow.
An Alibaba Cloud Elast icsearch clust er is creat ed.

Before you migrat e dat a, make sure t hat your Alibaba Cloud Elast icsearch clust er works as expect ed.
For more informat ion about how t o creat e an Alibaba Cloud Elast icsearch clust er, see Quick start .

An Alibaba Cloud Elast icsearch clust er wit h t he following configurat ions is used in t his example:

Region: China (Shanghai)


Zone: Zone B
Version: Elast icsearch 5.5.3 wit h Commercial Feat ure

Context
Elast icsearch is a Lucene-based search server. It provides a dist ribut ed mult i-t enant search engine t hat
support s full-t ext search. Elast icsearch is an open source service t hat complies wit h t he Apache open
st andards. It is a mainst ream ent erprise-class search engine.

Alibaba Cloud Elast icsearch includes Elast icsearch 5.5.3 wit h Commercial Feat ure, Elast icsearch 6.3.2 wit h
Commercial Feat ure, and Elast icsearch 6.7.0 wit h Commercial Feat ure. It also cont ains t he commercial X-
Pack plug-in. You can use Alibaba Cloud Elast icsearch in scenarios such as dat a analysis and search.
Based on open source Elast icsearch, Alibaba Cloud Elast icsearch provides ent erprise-class access
cont rol, securit y monit oring and alert ing, and aut omat ic report ing.

Procedure
1. Creat e a source t able in Elast icsearch. For more informat ion, see Use DataWorks to synchronize data
from MaxCompute to an Alibaba Cloud Elasticsearch cluster.

> Document Version: 20220630 69


Best Pract ices· Dat a migrat ion MaxComput e

2. Creat e a dest inat ion t able in MaxComput e.


i.
ii.
iii.
iv.
v. In t he DDL St at ement dialog box, ent er t he following CREAT E T ABLE st at ement and click
Generat e T able Schema:

create table elastic2mc_bankdata


(
age string,
job string,
marital string,
education string,
default string,
housing string,
loan string,
contact string,
month string,
day of week string
);

vi.
3. Synchronize dat a.
i.
ii.
iii.
iv.
v.
vi. Configure t he script .
In t his example, ent er t he following code. For more informat ion about t he code descript ion,
see Elast icsearch Reader.

{
"type": "job",
"steps": [
{
"stepType": "elasticsearch",
"parameter": {
"retryCount": 3,
"column": [
"age",
"job",
"marital",
"education",
"default",
"housing",
"loan",
"contact",
"month",

70 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

"month",
"day_of_week",
"duration",
"campaign",
"pdays",
"previous",
"poutcome",
"emp_var_rate",
"cons_price_idx",
"cons_conf_idx",
"euribor3m",
"nr_employed",
"y"
],
"scroll": "1m",
"index": "es_index",
"pageSize": 1,
"sort": {
"age": "asc"
},
"type": "elasticsearch",
"connTimeOut": 1000,
"retrySleepTime": 1000,
"endpoint": "http://es-cn-xxxx.xxxx.xxxx.xxxx.com:9200",
"password": "xxxx",
"search": {
"match_all": {}
},
"readTimeOut": 5000,
"username": "xxxx"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "",
"truncate": true,
"compress": false,
"datasource": "odps_first",
"column": [
"age",
"job",
"marital",
"education",
"default",
"housing",
"loan",
"contact",
"month",
"day_of_week",
"duration",
"campaign",
"pdays",

> Document Version: 20220630 71


Best Pract ices· Dat a migrat ion MaxComput e

"pdays",
"previous",
"poutcome",
"emp_var_rate",
"cons_price_idx",
"cons_conf_idx",
"euribor3m",
"nr_employed",
"y"
],
"emptyAsNull": false,
"table": "elastic2mc_bankdata"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"throttle": false,
"concurrent": 1,
"dmu": 1
}
}
}

Not e On t he Basic Inf ormat ion page of t he creat ed Alibaba Cloud Elast icsearch
clust er, you can view t he public IP address and port number in t he Public Net work Access
and Public Net work Port fields.

vii. Click t he icon t o run t he code.

viii. You can view t he running result on t he Runt ime Log t ab.
4. View t he result .
i.
ii.
iii. On t he configurat ion t ab of t he ODPS SQL node, ent er t he following st at ement :

SELECT * FROM elastic2mc_bankdata;

72 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

iv.
v.

2.7. Migrate JSON-formatted data


from MongoDB to MaxCompute
T his t opic describes how t o use t he Dat a Int egrat ion service of Dat aWorks t o migrat e JSON-format t ed
fields from MongoDB t o MaxComput e.

Prerequisites
MaxComput e is act ivat ed. For more informat ion, see Act ivat e MaxComput e and Dat aWorks.
Dat aWorks is act ivat ed.
A workflow is creat ed in Dat aWorks. In t his example, a Dat aWorks workspace in basic mode is used.
For more informat ion, see Creat e a workflow.

Prepare test data in MongoDB


1. Prepare an account .
Creat e a user in your dat abase t o prepare informat ion for creat ing a connect ion in Dat aWorks. In
t his example, run t he following command:

db.createUser({user:"bookuser",pwd:"123456",roles:["root"]})

T he username is bookuser, t he password is 123456, and t he permission is root .

2. Prepare dat a.

Upload t he dat a t o t he MongoDB database. In t his example, an ApsaraDB for MongoDB inst ance in a
virt ual privat e cloud (VPC) is used. You must apply for a public endpoint for t he ApsaraDB for
MongoDB inst ance t o communicat e wit h t he default resource group of Dat aWorks. T he following
t est dat a is uploaded:

> Document Version: 20220630 73


Best Pract ices· Dat a migrat ion MaxComput e

{
"store": {
"book": [
{
"category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95
},
{
"category": "fiction",
"author": "Evelyn Waugh",
"title": "Sword of Honour",
"price": 12.99
},
{
"category": "fiction",
"author": "J. R. R. Tolkien",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99
}
],
"bicycle": {
"color": "red",
"price": 19.95
}
},
"expensive": 10
}

3. Log on t o t he MongoDB dat abase in t he Dat a Management (DMS) console. In t his example, t he
name of t he dat abase is admin, and t he name of t he collect ion is userlog. You can run t he
following command t o view t he uploaded dat a:

db.userlog.find().limit(10)

Migrate JSO N-formatted data from MongoDB to MaxCompute by


using DataWorks
1.
2. Creat e a dest inat ion t able in Dat aWorks. T his t able is used t o st ore t he dat a t hat is migrat ed from
MongoDB.
i.
ii.
iii.
iv. In t he DDL St at ement dialog box, ent er t he following st at ement and click Generat e T able
Schema:

create table mqdata (mqdata string);

74 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

v. Click Commit t o Product ion Environment .


3. Creat e a MongoDB connect ion. For more informat ion, see Add a MongoDB data source.
4. Creat e a bat ch sync node.
i.
ii.
iii.
iv.
v.
vi. Ent er t he following script :

{
"type": "job",
"steps": [
{
"stepType": "mongodb",
"parameter": {
"datasource": "mongodb_userlog", // The name of the connection.
"column": [
{
"name": "store.bicycle.color", // The path of the JSON-formatted fi
eld. In this example, the color field is extracted.
"type": "document.String" // For fields other than top-level fields
, the data type of such a field is the type that is finally obtained. If the specif
ied JSON-formatted field is a top-level field, such as the expensive field in this
example, enter string.
}
],
"collectionName": "userlog" // The name of the collection.
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "",
"isCompress": false,
"truncate": true,
"datasource": "odps_first",
"column": [
"mqdata" // The name of the column in the MaxCompute table.
],
"emptyAsNull": false,
"table": "mqdata"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {

> Document Version: 20220630 75


Best Pract ices· Dat a migrat ion MaxComput e

"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"concurrent": 2,
"throttle": false,
}
}
}

vii.
viii.

Verify the result


1.
2.
3. On t he configurat ion t ab of t he ODPS SQL node, ent er t he following st at ement :

SELECT * from mqdata;

4.
5.

2.8. Migrate data from ApsaraDB RDS


to MaxCompute based on dynamic
partitioning
T his t opic describes how t o use t he dat a int egrat ion and dat a synchronizat ion feat ures of Dat aWorks
t o migrat e dat a from ApsaraDB RDS t o MaxComput e based on dynamic part it ioning.

Prerequisites
T he Dat aWorks environment is ready.
i. MaxComput e is act ivat ed. For more informat ion, see Act ivat e MaxComput e and Dat aWorks.
ii. Dat aWorks is act ivat ed. T o act ivat e Dat aWorks, go t o t he Dat aWorks buy page.
iii. A workflow is creat ed in t he Dat aWorks console. In t his example, a workflow is creat ed in a
Dat aWorks workspace in basic mode. For more informat ion, see Creat e a workflow.
Connect ions t o t he source and dest inat ion dat a st ores are creat ed.
A MySQL connect ion is creat ed as t he source connect ion. For more informat ion, see Add a MySQL

76 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

dat a source.
A MaxComput e connect ion is creat ed as t he dest inat ion connect ion. For more informat ion, see Add
a MaxComput e dat a source.

Migrate data from ApsaraDB RDS to MaxCompute based on dynamic


partitioning
Aft er t he preceding preparat ions are made, configure a sync node t o migrat e dat a from ApsaraDB RDS
t o MaxComput e based on dynamic part it ioning every day at a scheduled t ime. For more informat ion
about how t o configure a sync node, see Overview .

1.
2. Creat e a dest inat ion t able in MaxComput e.
i.
ii.
iii.
iv.
v.
vi. In t he DDL St at ement dialog box, ent er t he following st at ement and click Generat e T able
Schema:

CREATE TABLE IF NOT EXISTS ods_user_info_d (


uid STRING COMMENT 'User ID',
gender STRING COMMENT 'Gender',
age_range STRING COMMENT 'Age range',
zodiac STRING COMMENT 'Zodiac sign'
)
PARTITIONED BY (
dt STRING
);

vii.
3. Creat e a bat ch sync node.
i.
ii.

> Document Version: 20220630 77


Best Pract ices· Dat a migrat ion MaxComput e

iii. Configure t he source and dest inat ion for t he bat ch sync node.

4. Configure t he part it ion paramet er.


i. In t he right -side navigat ion pane of t he node configurat ion t ab, click t he Propert ies t ab.

78 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

ii. In t he General sect ion, set t he Argument s paramet er. T he default value is ${bizdate} in
t he format of yyyymmdd.

Not e T he value of t he Argument s paramet er in t he General sect ion on t he


Propert ies t ab is t he same as t hat of t he Part it ion Key Column paramet er in t he T arget
sect ion on t he node configurat ion page. When t he sync node is scheduled and run, t he
value of t he part it ion paramet er of t he dest inat ion t able is replaced wit h t he dat e t hat is
one day before t he node is run, which is known as t he dat a t imest amp. By default , t he
dat a generat ed on t he day before t he node is run is migrat ed. T o use t he dat e when t he
node is run as t he value of t he part it ion paramet er of t he dest inat ion t able, you must
cust omize t he part it ion paramet er.

You can specify a dat e in one of t he following format s for t he part it ion paramet er:

N years lat er: $[add_months(yyyymmdd,12*N)]

N years ago: $[add_months(yyyymmdd,-12*N)]

N mont hs ago: $[add_months(yyyymmdd,-N)]

N weeks lat er: $[yyyymmdd+7*N]

N mont hs lat er: $[add_months(yyyymmdd,N)]

N weeks ago: $[yyyymmdd-7*N]

N days lat er: $[yyyymmdd+N]

N days ago: $[yyyymmdd-N]

N hours lat er: $[hh24miss+N/24]

N hours ago: $[hh24miss-N/24]

N minut es lat er: $[hh24miss+N/24/60]

N minut es ago: $[hh24miss-N/24/60]

Not e
Keep t he value calculat ion formula in bracket s []. For example, key1=$[yyyy-mm-dd]
.

T he default unit of t he calculat ion result is day. For example, $[hh24miss-N/24/60


] refers t o t he calculat ion result of (yyyymmddhh24miss - (N/24/60 × 1 day)) .
T he format hh24miss is used t o align t he value.
T he unit of add_mont hs is mont h. For example, $[add_months(yyyymmdd,12 N)-M/2
4/60] refers t o t he calculat ion result of (yyyymmddhh24miss - (12 × N × 1 mont
h)) - (M/24/60 × 1 day) . T he format yyyymmdd is used t o align t he value.

5.
6.

Generate retroactive data

> Document Version: 20220630 79


Best Pract ices· Dat a migrat ion MaxComput e

If you have a large amount of hist orical dat a in ApsaraDB RDS t hat is generat ed before t he node is run,
all hist orical dat a needs t o be aut omat ically migrat ed t o MaxComput e and t he part it ions need t o be
aut omat ically creat ed. T o generat e ret roact ive dat a for t he current sync node, you can use t he Pat ch
Dat a feat ure of Dat aWorks.
1. Filt er hist orical dat a in ApsaraDB RDS by dat e.
You can set t he Filt er paramet er in t he Source sect ion t o filt er dat a in ApsaraDB RDS.

2. Generat e ret roact ive dat a for t he node. For more informat ion, see Perform retroactive data
generation and view retroactive data generation instances.

3. View t he process of ext ract ing dat a from ApsaraDB RDS on t he Run Log t ab.
T he logs indicat e t hat Part it ion 20180913 is aut omat ically creat ed in MaxComput e.

4. Verify t he execut ion result . Execut e t he following st at ement on t he MaxComput e client t o check
whet her t he dat a is writ t en t o MaxComput e:

SELECT count(*) from ods_user_info_d where dt = 20180913;

Use hash functions to create partitions based on non-date fields

80 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

If you have a large amount of dat a or full dat a is migrat ed t o part it ions based on a non-dat e field for
t he first t ime, t he part it ions cannot be aut omat ically creat ed during t he migrat ion. In t his case, you can
map t he values in a field in t he source t able t o a corresponding part it ion in MaxComput e by using a
hash funct ion.

1. Creat e an SQL script node. Execut e t he following st at ement s t o creat e a t emporary t able in
MaxComput e and migrat e dat a t o t he t able:

drop table if exists ods_user_t;


CREATE TABLE ods_user_t (
dt STRING,
uid STRING,
gender STRING,
age_range STRING,
zodiac STRING);
-- Create a temporary table named ods_user_t and write data from Table ods_user_info_d
to Table ods_user_t.
insert overwrite table ods_user_t select dt,uid,gender,age_range,zodiac from ods_user_i
nfo_d;

2. Creat e a sync node named mysql_t o_odps t o migrat e full dat a from ApsaraDB RDS t o
MaxComput e. Part it ioning is not required.

3. Execut e t he following SQL st at ement s t o migrat e dat a from T able ods_user_t t o T able ods_user_d
based on dynamic part it ioning:

> Document Version: 20220630 81


Best Pract ices· Dat a migrat ion MaxComput e

drop table if exists ods_user_d;


// Create a MaxCompute partitioned table named ods_user_d, which is the destination tab
le.
CREATE TABLE ods_user_d (
uid STRING,
gender STRING,
age_range STRING,
zodiac STRING
)
PARTITIONED BY (
dt STRING
);
// Create dynamic partitions for Table ods_user_d based on the dt field in Table ods_us
er_t. In Table ods_user_d, a partition is automatically created for each unique value i
n the dt field in Table ods_user_t.
// For example, if the value of the dt field is 20181025 in some rows of Table ods_user
_t, Partition dt=20181025 is created in Table ods_user_d.
// The following SQL statement is used to migrate data from Table ods_user_t to Table o
ds_user_d based on dynamic partitioning.
// The dt field is specified in the SELECT clause. This indicates that the partitions a
re automatically created based on this field.
insert overwrite table ods_user_d partition(dt)select dt,uid,gender,age_range,zodiac fr
om ods_user_t;
// After data migration is complete, you may drop the temporary table to release storag
e space.
drop table if exists ods_user_t;

You can use SQL st at ement s t o migrat e dat a in MaxComput e. For more informat ion about SQL
st at ement s, see Use part it ioned t ables in MaxComput e.

4. Configure t he t hree nodes t o form a workflow t o run t hese nodes sequent ially, as shown in t he
following figure.

5. View t he execut ion process. T he last node represent s t he process of dynamic part it ioning, as

82 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

shown in t he following figure.

6. Verify t he execut ion result . Execut e t he following st at ement on t he MaxComput e client t o check
whet her t he dat a is writ t en t o MaxComput e:

SELECT count(*) from ods_user_d where dt = 20180913;

2.9. Migrate JSON data from OSS to


MaxCompute
T his t opic describes how t o use t he dat a int egrat ion feat ure of Dat aWorks t o migrat e JSON dat a from
Object St orage Service (OSS) t o MaxComput e and use t he GET _JSON_OBJECT funct ion t o ext ract JSON
object s.

Prerequisites
MaxComput e is act ivat ed.
Dat aWorks is act ivat ed.
A workflow is creat ed in t he Dat aWorks console. In t his example, a workflow is creat ed in a Dat aWorks
workspace in basic mode. For more informat ion, see Creat e a workflow.
A T XT file t hat cont ains JSON dat a is uploaded t o an OSS bucket . In t his example, t he OSS bucket is in
t he China (Shanghai) region. T he T XT file cont ains t he following JSON dat a:

> Document Version: 20220630 83


Best Pract ices· Dat a migrat ion MaxComput e

{
"store": {
"book": [
{
"category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95
},
{
"category": "fiction",
"author": "Evelyn Waugh",
"title": "Sword of Honour",
"price": 12.99
},
{
"category": "fiction",
"author": "J. R. R. Tolkien",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99
}
],
"bicycle": {
"color": "red",
"price": 19.95
}
},
"expensive": 10
}

Migrate JSO N data from O SS to MaxCompute


1. Add an OSS connect ion. For more informat ion, see Add an OSS data source.
2. Creat e a t able in Dat aWorks t o st ore t he JSON dat a t o be migrat ed from OSS.
i.
ii.
iii.
iv. In t he DDL St at ement dialog box, ent er t he following st at ement and click Generat e T able
Schema:

create table mqdata (mq_data string);

v.
3. Creat e a bat ch synchronizat ion node.
i.
ii.
iii.
iv.

84 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

v.

vi. Modify JSON code and click t he icon.

Sample code:

{
"type": "job",
"steps": [
{
"stepType": "oss",
"parameter": {
"fieldDelimiterOrigin": "^",
"nullFormat": "",
"compress": "",
"datasource": "OSS_userlog",
"column": [
{
"name": 0,
"type": "string",
"index": 0
}
],
"skipHeader": "false",
"encoding": "UTF-8",
"fieldDelimiter": "^",
"fileFormat": "binary",
"object": [
"applog.txt"
]
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "",
"isCompress": false,
"truncate": true,
"datasource": "odps_first",
"column": [
"mqdata"
],
"emptyAsNull": false,
"table": "mqdata"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{

> Document Version: 20220630 85


Best Pract ices· Dat a migrat ion MaxComput e

"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"concurrent": 2,
"throttle": false,
}
}
}

Use the GET_JSO N_O BJECT function to extract JSO N objects


1. Creat e an ODPS SQL node.
i.
ii.
iii. On t he configurat ion t ab of t he ODPS SQL node, ent er t he following st at ement s:

--Query data in the mqdata table.


SELECT * from mqdata;
--Obtain the value of the expensive field.
SELECT GET_JSON_OBJECT(mqdata.MQdata,'$.expensive') FROM mqdata;

iv.
v.

2.10. Migrate data from MaxCompute


to Tablestore
T his t opic describes how t o migrat e dat a from MaxComput e t o T ablest ore.

Prerequisites

Procedure
1. Creat e a t able in t he Dat aWorks console.
i.
ii.
iii.
iv.
v.
vi.

86 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

vii. In t he DDL St at ement dialog box, ent er t he following st at ement and click Generat e T able
Schema:

create table Transs


(name string,
id bigint,
gender string);

viii.
2. Import dat a t o t able T ranss.
i.
ii.
iii. In t he dialog box t hat appears, set Select Dat a Import Met hod t o Upload Local File and
click Browse next t o Select File . Select t he local file t hat you want t o import . T hen, specify
ot her paramet ers.
Example:

qwe,145,F
asd,256,F
xzc,345,M
rgth,234,F
ert,456,F
dfg,12,M
tyj,4,M
bfg,245,M
nrtjeryj,15,F
rwh,2344,M
trh,387,F
srjeyj,67,M
saerh,567,M

iv.
v.
vi.
3. Creat e a t able in t he T ablest ore console.
i. Log on t o t he T ablest ore console and creat e an inst ance. For more informat ion, see Create
instances.

ii. Creat e a t able named T rans. For more informat ion, see Create tables.
4. Add dat a sources in t he Dat aWorks console.
i.
ii.
iii.
iv.
v. In t he upper-right corner, click New dat a source . In t he dialog box t hat appears, click
MaxComput e(ODPS).
vi. In t he Add MaxComput e(ODPS) dat a source dialog box, specify t he required paramet ers
and click Complet e . For more informat ion, see Add a MaxCompute data source.

> Document Version: 20220630 87


Best Pract ices· Dat a migrat ion MaxComput e

vii. Add T ablest ore as a dat a source. For more informat ion, see Add a T ablestore data source.
5. Configure MaxComput e as t he reader and T ablest ore as t he writ er.
i.
ii.
iii.
iv.
v.

vi. Modify JSON code and click t he icon.

Sample code:

{
"type": "job",
"steps": [
{
"stepType": "odps",
"parameter": {
"partition": [],
"datasource": "odps_first",
"column": [
"name",
"id",
"gender"
],
"table": "Transs"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "ots",
"parameter": {
"datasource": "Transs",
"column": [
{
"name": "Gender",
"type": "STRING"
}
],
"writeMode": "UpdateRow",
"table": "Trans",
"primaryKey": [
{
"name": "Name",
"type": "STRING"
},
{
"name": "ID",
"type": "INT"
}
]
},

88 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"throttle": false,
"concurrent": 1,
"dmu": 1
}
}
}

6. View t he dat a of t he newly creat ed t able in t he T ablest ore console.


i. Log on t o t he T ablest ore console.
ii. In t he left -side navigat ion pane, click All Inst ances.
iii. On t he page t hat appears, find t he t arget inst ance and click t he inst ance name t o go t o t he
Inst ance Management page. In t he T ables sect ion, click t he name of t he t able whose dat a
you want t o view.
iv. On t he page t hat appears, click t he Dat a Edit or t ab t o view t he dat a.

2.11. Migrate data from MaxCompute


to OSS
T his t opic describes how t o use t he dat a synchronizat ion feat ure of Dat aWorks t o migrat e dat a from
MaxComput e t o Object St orage Service (OSS).

Prerequisites

Procedure
1. Creat e a t able in t he Dat aWorks console.
i.
ii.
iii.
iv.

> Document Version: 20220630 89


Best Pract ices· Dat a migrat ion MaxComput e

v.
vi.
vii. In t he DDL St at ement dialog box, ent er t he following st at ement and click Generat e T able
Schema:

create table Transs


(name string,
id string,
gender string);

viii.
2. Import dat a t o t able T ranss.
i.
ii.
iii. In t he dialog box t hat appears, set Select Dat a Import Met hod t o Upload Local File and
click Browse next t o Select File . Select t he local file t hat you want t o import . T hen, specify
ot her paramet ers.
Example:

qwe,145,F
asd,256,F
xzc,345,M
rgth,234,F
ert,456,F
dfg,12,M
tyj,4,M
bfg,245,M
nrtjeryj,15,F
rwh,2344,M
trh,387,F
srjeyj,67,M
saerh,567,M

iv.
v.
vi.
3. Creat e a t able in t he OSS console.
i. Log on t o t he OSS console and creat e a bucket . For more informat ion, see Create buckets.
ii. Upload t he qwee.csv file t o OSS. For more informat ion, see Upload objects.

Not e Make sure t hat fields in t he qwee.csv file are exact ly t he same as t hose in t he
T ranss t able.

4. Add dat a sources in t he Dat aWorks console.


i.
ii.
iii.

90 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

iv. In t he left -side navigat ion pane of t he page t hat appears, click Connect ion. T he Dat a
Source page appears.
v. In t he upper-right corner, click New dat a source . In t he dialog box t hat appears, click
MaxComput e(ODPS).
vi. In t he Add MaxComput e(ODPS) dat a source dialog box, specify t he required paramet ers
and click Complet e . For more informat ion, see Add a MaxCompute data source.
vii. Add OSS as a dat a source. For more informat ion, see Add an OSS data source.
5. Configure MaxComput e as t he reader and OSS as t he writ er.
i.
ii.
iii.
iv.
v.

vi. Modify JSON code and click t he icon.

Sample code:

{
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
},
"setting":{
"errorLimit":{
"record":"0"
},
"speed":{
"concurrent":1,
"dmu":1,
"throttle":false
}
},
"steps":[
{
"category":"reader",
"name":"Reader",
"parameter":{
"column":[
"name",
"id",
"gender"
],
"datasource":"odps_first",
"partition":[],
"table":"Transs"
},

> Document Version: 20220630 91


Best Pract ices· Dat a migrat ion MaxComput e

"stepType":"odps"
},
{
"category":"writer",
"name":"Writer",
"parameter":{
"datasource":"Trans",
"dateFormat":"yyyy-MM-dd HH:mm:ss",
"encoding":"UTF-8",
"fieldDelimiter":",",
"fileFormat":"csv",
"nullFormat":"null",
"object":"qweee.csv",
"writeMode":"truncate"
},
"stepType":"oss"
}
],
"type":"job",
"version":"2.0"
}

6. View t he dat a of t he newly creat ed t able in t he OSS console. For more informat ion, see Download
objects.

2.12. Migrate data from a user-


created MySQL database on an ECS
instance to MaxCompute
T his t opic describes how t o use an exclusive resource group for Dat a Int egrat ion t o migrat e dat a from a
user-creat ed MySQL dat abase on an Elast ic Comput e Service (ECS) inst ance t o MaxComput e.

Prerequisites
An ECS inst ance is purchased and bound t o a virt ual privat e cloud (VPC) but not t he classic net work. A
MySQL dat abase t hat st ores t est dat a is deployed on t he ECS inst ance. An account used t o connect
t o t he dat abase is creat ed. In t his example, use t he following st at ement s t o creat e a t able in t he
MySQL dat abase and insert t est dat a t o t he t able:

92 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

CREATE TABLE IF NOT EXISTS good_sale(


create_time timestamp,
category varchar(20),
brand varchar(20),
buyer_id varchar(20),
trans_num varchar(20),
trans_amount DOUBLE,
click_cnt varchar(20)
);
insert into good_sale values('2018-08-21','coat','brandA','lilei',3,500.6,7),
('2018-08-22','food','brandB','lilei',1,303,8),
('2018-08-22','coat','brandC','hanmeimei',2,510,2),
('2018-08-22','bath','brandA','hanmeimei',1,442.5,1),
('2018-08-22','food','brandD','hanmeimei',2,234,3),
('2018-08-23','coat','brandB','jimmy',9,2000,7),
('2018-08-23','food','brandA','jimmy',5,45.1,5),
('2018-08-23','coat','brandE','jimmy',5,100.2,4),
('2018-08-24','food','brandG','peiqi',10,5560,7),
('2018-08-24','bath','brandF','peiqi',1,445.6,2),
('2018-08-24','coat','brandA','ray',3,777,3),
('2018-08-24','bath','brandG','ray',3,122,3),
('2018-08-24','coat','brandC','ray',1,62,7) ;

T he privat e IP address, VPC, and vSwit ch of your ECS inst ance are not ed.
A securit y group rule is added for t he ECS inst ance t o allow access request s on t he port used by t he
MySQL dat abase. By default , t he MySQL dat abase uses port 3306. For more informat ion, see Add a
securit y group rule. T he name of t he securit y group is not ed.
A Dat aWorks workspace is creat ed. In t his example, creat e a Dat aWorks workspace t hat is in basic
mode and uses a MaxComput e comput e engine. Make sure t hat t he creat ed Dat aWorks workspace
belongs t o t he same region as t he ECS inst ance. For more informat ion about how t o creat e a
workspace, see Creat e a workspace.
An exclusive resource group for Dat a Int egrat ion is purchased and bound t o t he VPC where t he ECS
inst ance resides. T he exclusive resource group and t he ECS inst ance are in t he same zone. For more
informat ion, see Creat e and use an exclusive resource group for Dat a Int egrat ion. Aft er t he exclusive
resource group is bound t o t he VPC, you can view informat ion about t he exclusive resource group on
t he Resource Groups page.
Check whet her t he VPC, vSwit ch, and securit y group of t he exclusive resource group are t he same as
t hose of t he ECS inst ance.

Context
An exclusive resource group can t ransmit your dat a in a fast and st able manner. Make sure t hat t he
exclusive resource group for Dat a Int egrat ion belongs t o t he same zone in t he same region as t he dat a
st ore t hat needs t o be accessed. Make sure t hat t he exclusive resource group for Dat a Int egrat ion
belongs t o t he same region as t he Dat aWorks workspace. In t his example, t he dat a st ore t hat needs t o
be accessed is a user-creat ed MySQL dat abase on an ECS inst ance.

Procedure
1. Creat e a connect ion t o t he MySQL dat abase in t he Dat aWorks console.
i. Log on t o t he Dat aWorks console by using your Alibaba Cloud account .

> Document Version: 20220630 93


Best Pract ices· Dat a migrat ion MaxComput e

ii. On t he Workspaces page, find t he required workspace and click Dat a Int egrat ion.
iii. In t he left -side navigat ion pane, click Connect ion.
iv. On t he Dat a Source page, click New dat a source in t he upper-right corner.
v. In t he Add dat a source dialog box, select MySQL.
vi. In t he Add MySQL dat a source dialog box, set t he paramet ers. For more informat ion, see Add
a MySQL data source.

For example, set t he Dat a source t ype paramet er t o Connect ion st ring mode . Use t he
privat e IP address of t he ECS inst ance and t he default port number 3306 of t he MySQL
dat abase when you specify t he Java Dat abase Connect ivit y (JDBC) URL.

Not e Dat aWorks cannot t est t he connect ivit y of a user-creat ed MySQL dat abase in
a VPC. T herefore, it is normal t hat a connect ivit y t est fails.

vii. Find t he required resource group and click T est connect ivit y .
During dat a synchronizat ion, a sync node uses only one resource group. You must t est t he
connect ivit y of all t he resource groups for Dat a Int egrat ion on which your sync nodes will be
run and make sure t hat t he resource groups can connect t o t he dat a st ore. T his ensures t hat
your sync nodes can be run as expect ed. For more informat ion, see Select a net work
connect ivit y solut ion.
viii. Aft er t he connect ion passes t he connect ivit y t est , click Complet e .
2. Creat e a MaxComput e t able.
You must creat e a t able in Dat aWorks t o receive t est dat a from t he MySQL dat abase.

94 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

i. Click t he icon in t he upper-left corner and choose All Product s > Dat aSt udio .

ii. Creat e a workflow. For more informat ion, see Create a workflow .
iii. Right -click t he creat ed workflow and choose Creat e > MaxComput e > T able .
iv. Ent er a name for your MaxComput e t able. In t his example, set t he T able Name paramet er t o
good_sale, which is t he same as t he name of t he t able in t he MySQL dat abase. Click DDL
St at ement , ent er t he t able creat ion st at ement , and t hen click Generat e T able Schema.
In t his example, ent er t he following t able creat ion st at ement . Pay at t ent ion t o dat a t ype
conversion.

CREATE TABLE IF NOT EXISTS good_sale(


create_time string,
category STRING,
brand STRING,
buyer_id STRING,
trans_num BIGINT,
trans_amount DOUBLE,
click_cnt BIGINT
);

v. Set t he Display Name paramet er and click Commit t o Product ion Environment . T he
MaxComput e t able named good_sale is creat ed.
3. Configure a dat a int egrat ion node.
i. Right -click t he workflow you just creat ed and choose Creat e > Dat a Int egrat ion > Bat ch
Synchroniz at ion t o creat e a dat a int egrat ion node.
ii. Set t he Connect ion paramet er under Source t o t he creat ed MySQL connect ion and t he
Connect ion paramet er under T arget t o odps_first . Click t he Swit ch t o Code Edit or icon t o
swit ch t o t he code edit or.
If you cannot set t he T able paramet er under Source or an error is ret urned when you at t empt
t o swit ch t o t he code edit or, ignore t he issue.
iii. Click t he Resource Group conf igurat ion t ab in t he right -side navigat ion pane and select an
exclusive resource group t hat you have purchased.
If you do not select t he exclusive resource group as t he resource group for Dat a Int egrat ion of
your node, t he node may fail t o be run.
iv. Ent er t he following code for t he dat a int egrat ion node:

{
"type": "job",
"steps": [
{
"stepType": "mysql",
"parameter": {
"column": [// The columns in the source table.
"create_time",
"category",
"brand",
"buyer_id",
"trans_num",
"trans_amount",
"click_cnt"

> Document Version: 20220630 95


Best Pract ices· Dat a migrat ion MaxComput e

],
"connection": [
{
"datasource": "shuai",// The source connection.
"table": [
"good_sale"// The name of the table in the source datab
ase. The name must be enclosed in brackets [].
]
}
],
"where": "",
"splitPk": "",
"encoding": "UTF-8"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "",
"truncate": true,
"datasource": "odps_first",// The destination connection.
"column": [// The columns in the destination table.
"create_time",
"category",
"brand",
"buyer_id",
"trans_num",
"trans_amount",
"click_cnt"
],
"emptyAsNull": false,
"table": "good_sale"// The name of the destination table.
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"throttle": false,
"concurrent": 2

96 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

"concurrent": 2
}
}
}

v. Click t he Run icon. You can view t he Runt ime Log t ab in t he lower part of t he page t o check
whet her t he t est dat a is synchronized t o MaxComput e.

Result
T o query dat a in t he MaxComput e t able, creat e an ODPS SQL node.Ent er t he st at ement select *
from good_sale ; , and click t he Run icon. If t he t est dat a appears, it is synchronized t o t he
MaxComput e t able.

2.13. Migrate data from Amazon


Redshift to MaxCompute
T his t opic describes how t o migrat e dat a from Amazon Redshift t o MaxComput e over t he Int ernet .

Prerequisites
Creat e an Amazon Redshift clust er and prepare dat a for migrat ion.
For more informat ion about how t o creat e an Amazon Redshift clust er, see Amazon Redshift Clust er
Management Guide.

i. Creat e an Amazon Redshift clust er. If you already have an Amazon Redshift clust er, skip t his
st ep.

ii. Prepare t he dat a t hat you want t o migrat e in t he Amazon Redshift clust er.

In t his example, a T PC-H dat aset is available in public schema. T he dat aset uses t he MaxComput e
V2.0 dat a t ypes and t he Decimal 2.0 dat a t ype.

Prepare a MaxComput e project .

For more informat ion, see Prepare.

> Document Version: 20220630 97


Best Pract ices· Dat a migrat ion MaxComput e

In t his example, a MaxComput e project is creat ed as t he migrat ion dest inat ion in t he Singapore
(Singapore) region. T he project is creat ed in MaxComput e V2.0 because t he T PC-H dat aset uses t he
MaxComput e V2.0 dat a t ypes and t he Decimal V2.0 dat a t ype.

Act ivat e Alibaba Cloud Object St orage Service (OSS).


For more informat ion, see 开通OSS服务.

Context
T he following figure shows t he process t o migrat e dat a from Amazon Redshift t o MaxComput e.

No. Description

Unload the data from Amazon Redshift to a data lake on Amazon Simple Storage Service

(S3).

Migrate the data from Amazon S3 to an OSS bucket by using the Dat a Online Migrat io n

service of OSS.

Migrate the data from the OSS bucket to a MaxCompute project in the same region, and

then verify the integrity and accuracy of the migrated data.

Step 1: Unload data from Amazon Redshift to Amazon S3


Amazon Redshift support s aut hent icat ion based on Ident it y and Access Management (IAM) roles and
t emporary securit y credent ials (AccessKey pairs). You can use t he UNLOAD command of Amazon
Redshift t o unload dat a t o Amazon S3 based on t hese t wo aut hent icat ion met hods. For more
informat ion, see Unloading dat a.

T he synt ax of t he UNLOAD command varies based on t he aut hent icat ion met hod.

98 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

UNLOAD command based on an IAM role

-- Run the UNLOAD command to unload data from the customer table to Amazon S3.
UNLOAD ('SELECT * FROM customer')
TO 's3://bucket_name/unload_from_redshift/customer/customer_' -- The Amazon S3 bucket.
IAM_ROLE 'arn:aws:iam::****:role/MyRedshiftRole'; -- The Alibaba Cloud Resource Name (ARN
) of the IAM role.

UNLOAD command based on an AccessKey pair

-- Run the UNLOAD command to unload data in the customer table to Amazon S3.
UNLOAD ('SELECT * FROM customer')
TO 's3://bucket_name/unload_from_redshift/customer/customer_' -- The Amazon S3 bucket.
Access_Key_id '<access-key-id>' -- The AccessKey ID of the IAM user.
Secret_Access_Key '<secret-access-key>' -- The AccessKey secret of the IAM user.
Session_Token '<temporary-token>'; -- The temporary access token of the IAM user.

T he UNLOAD command allows you t o unload dat a in one of t he following format s:

Default format
T he following sample command shows how t o unload dat a in t he default format :

UNLOAD ('SELECT * FROM customer')


TO 's3://bucket_name/unload_from_redshift/customer/customer_'
IAM_ROLE 'arn:aws:iam::****:role/redshift_s3_role';

Aft er t he command is run, dat a is unloaded t o t ext files in which values are separat ed by vert ical bars
(|). You can log on t o t he Amazon S3 console and view t he unloaded t ext files in t he specified
bucket .

T he following figure shows t he unloaded t ext files in t he default format .

> Document Version: 20220630 99


Best Pract ices· Dat a migrat ion MaxComput e

Apache Parquet format

Dat a unloaded in t he Apache Parquet format can be direct ly read by ot her engines. T he following
sample command shows how t o unload dat a in t he Apache Parquet format :

UNLOAD ('SELECT * FROM customer')


TO 's3://bucket_name/unload_from_redshift/customer_parquet/customer_'
FORMAT AS PARQUET
IAM_ROLE 'arn:aws:iam::xxxx:role/redshift_s3_role';

Aft er t he command is run, you can view t he unloaded Parquet files in t he specified bucket . Parquet
files are smaller t han t ext files and have a higher dat a compression rat io.

T his sect ion describes how t o aut hent icat e request s based on IAM roles and unload dat a in t he Apache
Parquet format .
1. Creat e an IAM role for Amazon Redshift .
i. Log on t o t he IAM console. In t he left -side navigat ion pane, choose Access Management >
Roles. On t he Roles page, click Creat e role .

100 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

ii. In t he Common use cases sect ion of t he Creat e role page, click Redshif t . In t he Select
your use case sect ion, click Redshif t -Cust omiz able , and t hen click Next : Permissions.

2. Add an IAM policy t hat grant s t he read and writ e permissions on Amazon S3. In t he At t ach
permissions policies sect ion of t he Creat e role page, ent er S3 , select Amaz onS3FullAccess,
and t hen click Next : T ags.

> Document Version: 20220630 101


Best Pract ices· Dat a migrat ion MaxComput e

3. Assign a name t o t he IAM role and complet e t he IAM role creat ion.
i. Click Next : Review . In t he Review sect ion of t he Creat e role page, specify Role name and
Role descript ion, and click Creat e role . T he IAM role is t hen creat ed.

ii. Go t o t he IAM console, and ent er redshif t _s3_role in t he search box t o search for t he role.
T hen, click t he role name redshif t _s3_role , and copy t he value of Role ARN.
When you run t he UNLOAD command t o unload dat a, you must provide t he Role ARN value t o
access Amazon S3.

4. Associat e t he creat ed IAM role wit h t he Amazon Redshift clust er t o aut horize t he clust er t o access
Amazon S3.

102 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

i. Log on t o t he Amazon Redshift console. In t he upper-right corner, select Asia Pacif ic


(Singapore) from t he drop-down list .
ii. In t he left -side navigat ion pane, click CLUST ERS , find t he creat ed Amazon Redshift clust er,
click Act ions, and t hen click Manage IAM roles.

iii. On t he Manage IAM roles page, click t he icon next t o t he search box, and select

redshif t _s3_role . Click Add IAM role > Done t o associat e t he redshif t _s3_role role wit h
t he Amazon Redshift clust er.
5. Unload dat a from Amazon Redshift t o Amazon S3.
i. Go t o t he Amazon Redshift console.
ii. In t he left -side navigat ion pane, click EDIT OR. Run t he UNLOAD command t o upload dat a from
Amazon Redshift t o each dest inat ion bucket on Amazon S3 in t he Apache Parquet format .
T he following sample command shows how t o unload dat a from Amazon Redshift t o Amazon
S3:

UNLOAD ('SELECT * FROM customer')


TO 's3://bucket_name/unload_from_redshift/customer_parquet/customer_'
FORMAT AS PARQUET
IAM_ROLE 'arn:aws:iam::xxxx:role/redshift_s3_role';
UNLOAD ('SELECT * FROM orders')
TO 's3://bucket_name/unload_from_redshift/orders_parquet/orders_'
FORMAT AS PARQUET
IAM_ROLE 'arn:aws:iam::xxxx:role/redshift_s3_role';
UNLOAD ('SELECT * FROM lineitem')
TO 's3://bucket_name/unload_from_redshift/lineitem_parquet/lineitem_'
FORMAT AS PARQUET
IAM_ROLE 'arn:aws:iam::xxxx:role/redshift_s3_role';
UNLOAD ('SELECT * FROM nation')
TO 's3://bucket_name/unload_from_redshift/nation_parquet/nation_'
FORMAT AS PARQUET
IAM_ROLE 'arn:aws:iam::xxxx:role/redshift_s3_role';
UNLOAD ('SELECT * FROM part')
TO 's3://bucket_name/unload_from_redshift/part_parquet/part_'
FORMAT AS PARQUET
IAM_ROLE 'arn:aws:iam::xxxx:role/redshift_s3_role';
UNLOAD ('SELECT * FROM partsupp')
TO 's3://bucket_name/unload_from_redshift/partsupp_parquet/partsupp_'
FORMAT AS PARQUET
IAM_ROLE 'arn:aws:iam::xxxx:role/redshift_s3_role';
UNLOAD ('SELECT * FROM region')
TO 's3://bucket_name/unload_from_redshift/region_parquet/region_'
FORMAT AS PARQUET
IAM_ROLE 'arn:aws:iam::xxxx:role/redshift_s3_role';
UNLOAD ('SELECT * FROM supplier')
TO 's3://bucket_name/unload_from_redshift/supplier_parquet/supplier_'
FORMAT AS PARQUET
IAM_ROLE 'arn:aws:iam::xxxx:role/redshift_s3_role';

Not e You can submit mult iple UNLOAD commands at a t ime in EDIT OR.

> Document Version: 20220630 103


Best Pract ices· Dat a migrat ion MaxComput e

iii. Log on t o t he Amazon S3 console and check t he unloaded dat a in t he direct ory of each
dest inat ion bucket on Amazon S3.
T he unloaded dat a is available in t he Apache Parquet format .

Step 2: Migrate the unloaded data from Amazon S3 to O SS


In MaxComput e, you can use t he Dat a Online Migrat ion service of OSS t o migrat e dat a from Amazon
S3 t o OSS. For more informat ion, see Migrate data from Amazon Simple Storage Service (Amazon S3) to OSS.
T he Dat a Online Migrat ion service is in public preview. Before you use t his service, you must submit a
t icket t o cont act t he Cust omer Service t o act ivat e t he service.

1. Log on t o t he OSS console, and creat e a bucket t o save t he migrat ed dat a. For more informat ion,
see Create buckets.

2. Creat e a Resource Access Management (RAM) user and grant relevant permissions t o t he RAM user.
i. Log on t o t he RAM console and creat e a RAM user. For more informat ion, see Create a RAM user.
ii. Find t he RAM user t hat you creat ed, and click Add Permissions in t he Act ions column. On t he
page t hat appears, select AliyunOSSFullAccess and AliyunMGWFullAccess, and click OK
and complet e. T he AliyunOSSFullAccess policy aut horizes t he RAM user t o read dat a from and
writ e dat a t o OSS bucket s. T he AliyunMGWFullAccess policy aut horizes t he RAM user t o
perform online migrat ion jobs.
iii. In t he left -side navigat ion pane, click Overview . In t he Account Management sect ion of t he
Overview page, click t he link under RAM user logon, and use t he credent ials of t he RAM user
t o log on t o t he Alibaba Cloud Management Console.
3. On t he Amazon Web Services (AWS) plat form, creat e an IAM user who uses t he programmat ic
access met hod t o access Amazon S3.
i. Log on t o t he Amazon S3 console.

104 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

ii. Right -click t he export ed folder and select Get t ot al siz e t o obt ain t he t ot al size of t he
folder and t he number of files in t he folder.
Obt ain t he t ot al size.

Obt ain t he t ot al size of t he folder and t he number of files in t he folder.

iii. Log on t o t he IAM console and click Add user.

iv. On t he Add user page, specify t he User name . In t he Select AWS access t ype sect ion,
select Programmat ic access and t hen click Next : Permissions.

> Document Version: 20220630 105


Best Pract ices· Dat a migrat ion MaxComput e

v. On t he Add user page, click At t ach exist ing policies direct ly . Ent er S3 in t he search box,
select t he Amaz onS3ReadOnlyAccess policy, and t hen click Next : T ags.

vi. Click Next : Review > Creat e user. T he IAM user is creat ed. Obt ain t he AccessKey pair.
If you creat e an online migrat ion job, you must provide t his AccessKey pair.

4. Creat e a source dat a address and a dest inat ion dat a address for online migrat ion.
i. Log on t o t he Alibaba Cloud Dat a T ransport console. In t he left -side navigat ion pane, click
Dat a Address.

106 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

ii. (Opt ional)If you have not act ivat ed Dat a Online Migrat ion, click Applicat ion in t he dialog box
t hat appears. On t he Online Migrat ion Bet a T est page, specify t he required informat ion and
click Submit .

iii. On t he Dat a Address page, click Creat e Dat a Address. In t he Creat e Dat a Address panel,
set t he required paramet ers and click OK. For more informat ion about t he required paramet ers,
see Migrate data.
Source dat a address

> Document Version: 20220630 107


Best Pract ices· Dat a migrat ion MaxComput e

Not e In t he Access Key Id and Access Key Secret fields, ent er t he AccessKey ID
and AccessKey secret of t he IAM user.

108 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

Dest inat ion dat a address

Not e In t he Access Key Id and Access Key Secret fields, ent er t he AccessKey ID
and t he AccessKey secret of t he RAM user.

5. Creat e an online migrat ion job.


i. In t he left -side navigat ion pane, click Migrat ion Jobs.
ii. On t he File Sync Management page, click Creat e Job . In t he Creat e Job wizard, set t he
required paramet ers and click Creat e . For more informat ion about t he required paramet ers,
see Migrate data.

> Document Version: 20220630 109


Best Pract ices· Dat a migrat ion MaxComput e

Job Conf ig

Perf ormance

110 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

Not e In t he Dat a Siz e and File Count fields, ent er t he size and t he number of
files t hat you want t o migrat e from Amazon S3.

iii. T he migrat ion job t hat you creat ed is aut omat ically run. If Finished is displayed in t he Job
St at us column, t he migrat ion job is complet e.

> Document Version: 20220630 111


Best Pract ices· Dat a migrat ion MaxComput e

iv. In t he Operat ion column of t he migrat ion job, click Manage t o view t he migrat ion report and
confirm t hat all t he dat a is migrat ed.

v. Log on t o t he OSS console.


vi. In t he left -side navigat ion pane, click Bucket s. On t he Bucket s page, click t he creat ed bucket .
In t he left -side navigat ion pane of t he bucket det ails page, click Files t o view t he migrat ion
result s.

Step 3: Migrate data from the O SS bucket to the MaxCompute


project in the same region
You can run t he LOAD command of MaxComput e t o migrat e dat a from an OSS bucket t o a MaxComput e
project in t he same region.

T he LOAD command support s Securit y T oken Service (ST S) and AccessKey for aut hent icat ion. If you use
AccessKey for aut hent icat ion, you must provide t he AccessKey ID and AccessKey secret of your account
in plaint ext . ST S aut hent icat ion is highly secure because it does not expose t he AccessKey pair. In t his
sect ion, ST S aut hent icat ion is used as an example t o show how t o migrat e dat a.

1. On t he Ad-Hoc Query t ab of Dat aWorks or t he MaxComput e client odpscmd, execut e t he DDL


st at ement s t o creat e t ables t o st ore t he migrat ed dat a in MaxComput e. T he DDL st at ement s t hat
you execut e must be t he same as t hose execut ed in t he Amazon Redshift clust er.
For more informat ion about ad hoc queries, see Use t he ad-hoc query feat ure t o execut e SQL
st at ement s (opt ional). T he following sample commands show how t o creat e t ables:

CREATE TABLE customer(


C_CustKey int ,
C_Name varchar(64) ,
C_Address varchar(64) ,

112 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

C_NationKey int ,
C_Phone varchar(64) ,
C_AcctBal decimal(13, 2) ,
C_MktSegment varchar(64) ,
C_Comment varchar(120) ,
skip varchar(64)
);
CREATE TABLE lineitem(
L_OrderKey int ,
L_PartKey int ,
L_SuppKey int ,
L_LineNumber int ,
L_Quantity int ,
L_ExtendedPrice decimal(13, 2) ,
L_Discount decimal(13, 2) ,
L_Tax decimal(13, 2) ,
L_ReturnFlag varchar(64) ,
L_LineStatus varchar(64) ,
L_ShipDate timestamp ,
L_CommitDate timestamp ,
L_ReceiptDate timestamp ,
L_ShipInstruct varchar(64) ,
L_ShipMode varchar(64) ,
L_Comment varchar(64) ,
skip varchar(64)
);
CREATE TABLE nation(
N_NationKey int ,
N_Name varchar(64) ,
N_RegionKey int ,
N_Comment varchar(160) ,
skip varchar(64)
);
CREATE TABLE orders(
O_OrderKey int ,
O_CustKey int ,
O_OrderStatus varchar(64) ,
O_TotalPrice decimal(13, 2) ,
O_OrderDate timestamp ,
O_OrderPriority varchar(15) ,
O_Clerk varchar(64) ,
O_ShipPriority int ,
O_Comment varchar(80) ,
skip varchar(64)
);
CREATE TABLE part(
P_PartKey int ,
P_Name varchar(64) ,
P_Mfgr varchar(64) ,
P_Brand varchar(64) ,
P_Type varchar(64) ,
P_Size int ,
P_Container varchar(64) ,
P_RetailPrice decimal(13, 2) ,
P_Comment varchar(64) ,

> Document Version: 20220630 113


Best Pract ices· Dat a migrat ion MaxComput e

P_Comment varchar(64) ,
skip varchar(64)
);
CREATE TABLE partsupp(
PS_PartKey int ,
PS_SuppKey int ,
PS_AvailQty int ,
PS_SupplyCost decimal(13, 2) ,
PS_Comment varchar(200) ,
skip varchar(64)
);
CREATE TABLE region(
R_RegionKey int ,
R_Name varchar(64) ,
R_Comment varchar(160) ,
skip varchar(64)
);
CREATE TABLE supplier(
S_SuppKey int ,
S_Name varchar(64) ,
S_Address varchar(64) ,
S_NationKey int ,
S_Phone varchar(18) ,
S_AcctBal decimal(13, 2) ,
S_Comment varchar(105) ,
skip varchar(64)
);

In t his example, t he project uses t he MaxComput e V2.0 dat a t ypes because t he T PC-H dat aset uses
t he MaxComput e V2.0 dat a t ypes and t he Decimal 2.0 dat a t ype. If you want t o configure t he
project t o use t he MaxComput e V2.0 dat a t ypes and t he Decimal 2.0 dat a t ype, add t he following
commands at t he beginning of t he CREAT E T ABLE st at ement s:

setproject odps.sql.type.system.odps2=true;
setproject odps.sql.decimal.odps2=true;

2. Creat e a RAM role t hat has t he OSS access permissions and assign t he RAM role t o t he RAM user. For
more informat ion, see ST S authorization.
3. Run t he LOAD command mult iple t imes t o load all dat a from OSS t o t he MaxComput e t ables t hat
you creat ed, and execut e t he SELECT st at ement t o query and verify t he import ed dat a. For more
informat ion about t he LOAD command, see LOAD.

LOAD OVERWRITE TABLE orders


FROM LOCATION 'oss://endpoint/oss_bucket_name/unload_from_redshift/orders_parquet/' --
The endpoint of the OSS bucket.
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('odps.properties.rolearn'='acs:ram::xxx:role/xxx_role')
STORED AS PARQUET;

Not e If t he dat a import fails, submit a t icket t o cont act t he MaxComput e t eam.

Execut e t he following st at ement t o query and verify t he import ed dat a:

114 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

SELECT * FROM orders limit 100;

T he st at ement ret urns t he following out put .

4. Verify t hat t he dat a migrat ed t o MaxComput e is t he same as t he dat a in Amazon Redshift . T his
verificat ion is based on t he number of t ables, t he number of rows, and t he query result s of t ypical
jobs.
i. Log on t o t he Amazon Redshift console. In t he upper-right corner, select Asia Pacif ic
(Singapore) from t he drop-down list . In t he left -side navigat ion pane, click EDIT OR. Execut e
t he following st at ement t o query dat a:

SELECT l_returnflag, l_linestatus, SUM(l_quantity) as sum_qty,


SUM(l_extendedprice) AS sum_base_price, SUM(l_extendedprice*(1-l_discount)) AS sum_
disc_price,
SUM(l_extendedprice*(1-l_discount)*(1+l_tax)) AS sum_charge, AVG(l_quantity) AS avg
_qty,
AVG(l_extendedprice) AS avg_price, AVG(l_discount) AS avg_disc, COUNT(*) AS count_
order
FROM lineitem
GROUP BY l_returnflag, l_linestatus
ORDER BY l_returnflag,l_linestatus;

T he st at ement ret urns t he following out put .

> Document Version: 20220630 115


Best Pract ices· Dat a migrat ion MaxComput e

ii. On t he Ad-Hoc Query t ab of Dat aWorks or t he MaxComput e client (odpscmd), execut e t he


preceding st at ement and check whet her t he ret urned result s are consist ent wit h t he dat a t hat
is queried from t he Amazon Redshift clust er.
T he following out put is ret urned.

2.14. Migrate data from BigQuery to


MaxCompute
T his t opic describes how t o migrat e dat a from BigQuery on Google Cloud Plat form (GCP) t o Alibaba
Cloud MaxComput e over t he Int ernet .

Prerequisites

Category Platform Requirement Reference

If you do not have the


relevant environment and
datasets, see the following
T he Google BigQuery service is activated, references for preparation:
Google and the environment and datasets for
BigQuery: Quickstarts and
Cloud migration are prepared.
Creating datasets
Platform T he Google Cloud Storage service is
Google Cloud Storage:
activated and a bucket is created.
Quickstart: Using the
Console and Creating
storage buckets.

Environm
ent and
data

116 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

Category Platform Requirement Reference

If you do not have the


relevant environment, see the
T he MaxCompute and DataWorks services following references for
are activated and a project is created. preparation:

In this example, a MaxCompute project in the MaxCompute and


Indo nesia (Jakart a) region is created as DataWorks: Prepare and
Alibaba Create a MaxCompute
the migration destination.
Cloud project.
Object Storage Service (OSS) is activated and
OSS: 开通OSS服务 and
a bucket is created.
Create buckets.
T he Data Online Migration service of OSS is
activated. Data Online Migration:
Submit a ticket or Apply for
the service online.

Google An Identity and Access Management (IAM) user


IAM permissions for JSON
Cloud is created and granted the permissions to
methods
Platform access Google Cloud Storage.

Account
A Resource Access Management (RAM) user and
Alibaba a RAM role are created. T he RAM user is Create a RAM user and ST S
Cloud granted the read and write permissions on OSS authorization
buckets and the online migration permissions.

Google
Cloud N/A N/A
Platform
Region

Alibaba T he OSS bucket and the MaxCompute project


N/A
Cloud are in the same region.

Context
T he following figure shows t he process t o migrat e dat aset s from BigQuery t o Alibaba Cloud
MaxComput e.

> Document Version: 20220630 117


Best Pract ices· Dat a migrat ion MaxComput e

No. Description

① Export datasets from BigQuery to Google Cloud Storage.

Migrate data from Google Cloud Storage to an OSS bucket by using the Dat a Online

Migrat io n service of OSS.

Migrate data from the OSS bucket to a MaxCompute project in the same region, and then

verify the integrity and accuracy of the migrated data.

Step 1: Export datasets from BigQ uery to Google Cloud Storage


Use t he bq command-line t ool t o run t he bq extract command t o export dat aset s from BigQuery t o
Google Cloud St orage.
1. Log on t o Google Cloud Plat form, and creat e a bucket for t he dat a you want t o migrat e. For more
informat ion, see Creat ing st orage bucket s.

2. Use t he bq command-line t ool t o query t he Dat a Definit ion Language (DDL) script s of t ables in t he
T PC-DS dat aset s and download t he script s t o an on-premises device. For more informat ion, see
Get t ing t able met adat a using INFORMAT ION_SCHEMA.
BigQuery does not support commands such as show create table t o query DDL script s of t ables.
BigQuery allows you t o use built -in user-defined funct ions (UDFs) t o query t he DDL script s of t he
t ables in a dat aset . T he following code shows examples of DDL script s.

3. Use t he bq command-line t ool t o run t he bq extract command t o export t ables in BigQuery


dat aset s t o t he dest inat ion bucket of Google Cloud St orage. For more informat ion about t he
operat ions, format s of export ed dat a, and compression t ypes, see Export ing t able dat a.
T he following code shows a sample ext ract command:

118 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

bq extract
--destination_format AVRO
--compression SNAPPY
tpcds_100gb.web_site
gs://bucket_name/web_site/web_site-*.avro.snappy;

4. View t he bucket and check t he dat a export result .

Step 2: Migrate the exported data from Google Cloud Storage to O SS


You can use t he Dat a Online Migrat ion service t o migrat e dat a from Google Cloud St orage t o OSS.
For more informat ion, see Migrate data from Google Cloud Platform to OSS. T he Dat a Online Migrat ion
service is in public preview. Before you use t he service, you must submit a t icket t o cont act t he
Cust omer Service t o act ivat e t he service.

1. Est imat e t he size and t he number of files t hat you want t o migrat e. You can query t he dat a size in
t he bucket of Google Cloud St orage by using t he gsut il t ool or checking t he st orage logs. For more
informat ion, see Get t ing bucket informat ion.
2. (Opt ional)If you do not have a bucket in OSS, log on t o t he OSS console and creat e a bucket t o
st ore t he migrat ed dat a. For more informat ion, see Create buckets.

3. (Opt ional)If you do not have a RAM user, creat e a RAM user and grant relevant permissions t o t he
RAM user.
i. Log on t o t he RAM console and creat e a RAM user. For more informat ion, see Create a RAM user.
ii. Find t he newly creat ed RAM user, and click Add Permissions in t he Act ions column. On t he
page t hat appears, select AliyunOSSFullAccess and AliyunMGWFullAccess, and click OK >
Complet e . T he AliyunOSSFullAccess permission aut horizes t he RAM user t o read and writ e OSS
bucket s. T he AliyunMGWFullAccess permission aut horizes t he RAM user t o perform online
migrat ion jobs.
iii. In t he left -side navigat ion pane, click Overview . In t he Account Management sect ion of t he
Overview page, click t he link under RAM user logon, and use t he credent ials of t he RAM user
t o log on t o t he Alibaba Cloud Management Console.
4. On Google Cloud Plat form, creat e a user who uses t he programmat ic access met hod t o access
Google Cloud St orage. For more informat ion, see IAM permissions for JSON met hods.
i. Log on t o t he IAM & Admin console, and find a user who has permissions t o access BigQuery. In
t he Act ions column, click > Creat e key .

ii. In t he dialog box t hat appears, select JSON, and click CREAT E. Save t he JSON file t o an on-
premises device and click CLOSE.
iii. In t he Creat e service account wizard, click Select a role , and choose Cloud St orage >
St orage Admin t o aut horize t he IAM user t o access Google Cloud St orage.
5. Creat e a source dat a address and a dest inat ion dat a address for online dat a migrat ion.
i. Log on t o t he Alibaba Cloud Dat a T ransport console. In t he left -side navigat ion pane, click
Dat a Address.

> Document Version: 20220630 119


Best Pract ices· Dat a migrat ion MaxComput e

ii. (Opt ional)If you have not act ivat ed t he Dat a Online Migrat ion service, click Applicat ion in t he
dialog box t hat appears. On t he Online Migrat ion Bet a T est page, specify t he required
informat ion and click Submit .

Not e On t he Online Migrat ion Bet a T est page, if t he Source St orage Provider
opt ions do not include Google Cloud Plat form, select a source st orage provider and
specify t he act ual source st orage provider in t he Not es field.

iii. On t he Dat a Address page, click Creat e Dat a Address. In t he Creat e Dat a Address dialog
box, set t he required paramet ers and click OK. For more informat ion about t he paramet ers, see
Migrate data.

120 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

Source dat a address

Not e For t he Key File field, upload t he JSON file t hat is downloaded in St ep 4.

> Document Version: 20220630 121


Best Pract ices· Dat a migrat ion MaxComput e

Dest inat ion dat a address

Not e In t he Access Key Id and Access Key Secret fields, ent er t he AccessKey ID
and t he AccessKey secret of t he RAM user.

6. Creat e an online migrat ion job.


i. In t he left -side navigat ion pane, click Migrat ion Jobs.
ii. On t he File Sync Management page, click Creat e Job . In t he Creat e Job wizard, set t he
required paramet ers and click Creat e . For more informat ion about t he paramet ers, see Migrate
data.

122 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

Job Conf ig

Perf ormance

> Document Version: 20220630 123


Best Pract ices· Dat a migrat ion MaxComput e

Not e In t he Dat a Siz e and File Count fields, ent er t he size and t he number of
files t hat were migrat ed from Google Cloud Plat form.

iii. T he creat ed migrat ion job is aut omat ically run. If Finished is displayed in t he Job St at us
column, t he migrat ion job is complet e.
iv. In t he Operat ion column of t he migrat ion job, click Manage t o view t he migrat ion report and
confirm t hat all dat a is migrat ed.
v. Log on t o t he OSS console.
vi. In t he left -side navigat ion pane, click Bucket s. On t he Bucket s page, click t he creat ed bucket .
In t he left -side navigat ion pane of t he bucket det ails page, choose Files > Files t o view t he
migrat ion result s.

Step 3: Migrate data from the O SS bucket to a MaxCompute project


in the same region
You can execut e t he LOAD st at ement of MaxComput e t o migrat e dat a from an OSS bucket t o a
MaxComput e project in t he same region.

124 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

T he LOAD st at ement support s Securit y T oken Service (ST S) and AccessKey for aut hent icat ion. If you use
AccessKey for aut hent icat ion, you must provide t he AccessKey ID and AccessKey secret of your account
in plaint ext . ST S aut hent icat ion is highly secure because it does not expose t he AccessKey informat ion.
In t his sect ion, ST S aut hent icat ion is used as an example t o show how t o migrat e dat a.

1. On t he Ad-Hoc Query t ab of Dat aWorks or t he MaxComput e client odpscmd, modify t he DDL script s
of t he t ables in t he BigQuery dat aset s, specify t he MaxComput e dat a t ypes, and creat e a
dest inat ion t able t hat st ores t he migrat ed dat a in MaxComput e.
For more informat ion about ad hoc queries, see Use t he ad-hoc query feat ure t o execut e SQL
st at ement s (opt ional). T he following code shows a configurat ion example:

CREATE OR REPLACE TABLE


`****.tpcds_100gb.web_site`
(
web_site_sk INT64,
web_site_id STRING,
web_rec_start_date STRING,
web_rec_end_date STRING,
web_name STRING,
web_open_date_sk INT64,
web_close_date_sk INT64,
web_class STRING,
web_manager STRING,
web_mkt_id INT64,
web_mkt_class STRING,
web_mkt_desc STRING,
web_market_manager STRING,
web_company_id INT64,
web_company_name STRING,
web_street_number STRING,
web_street_name STRING,
web_street_type STRING,
web_suite_number STRING,
web_city STRING,
web_county STRING,
web_state STRING,
web_zip STRING,
web_country STRING,
web_gmt_offset FLOAT64,
web_tax_percentage FLOAT64
)
Modify the INT64 and
FLOAT64 fields to obtain the following DDL script:
CREATE
TABLE IF NOT EXISTS <your_maxcompute_project>.web_site_load
(
web_site_sk BIGINT,
web_site_id STRING,
web_rec_start_date STRING,
web_rec_end_date STRING,
web_name STRING,
web_open_date_sk BIGINT,
web_close_date_sk BIGINT,
web_class STRING,
web_manager STRING,

> Document Version: 20220630 125


Best Pract ices· Dat a migrat ion MaxComput e

web_manager STRING,
web_mkt_id BIGINT,
web_mkt_class STRING,
web_mkt_desc STRING,
web_market_manager STRING,
web_company_id BIGINT,
web_company_name STRING,
web_street_number STRING,
web_street_name STRING,`
web_street_type STRING,
web_suite_number STRING,
web_city STRING,
web_county STRING,
web_state STRING,
web_zip STRING,
web_country STRING,
web_gmt_offset DOUBLE,
web_tax_percentage DOUBLE
);

T he following t able describes t he mapping bet ween BigQuery dat a t ypes and MaxComput e dat a
t ypes.

BigQuery data type MaxCompute data type

INT 64 BIGINT

FLOAT 64 DOUBLE

NUMERIC DECIMAL and DOUBLE

BOOL BOOLEAN

ST RING ST RING

BYT ES VARCHAR

DAT E DAT E

DAT ET IME DAT ET IME

T IME DAT ET IME

T IMEST AMP T IMEST AMP

ST RUCT ST RUCT

GEOGRAPHY ST RING

2. (Opt ional)If you do not have a RAM role, creat e a RAM role t hat has t he OSS access permissions and
assign t he role t o t he RAM user. For more informat ion, see ST S authorization.
3. Execut e t he LOAD st at ement t o load all dat a from t he OSS bucket t o t he MaxComput e t able, and
execut e t he SELECT st at ement t o query and verify t he import ed dat a. You can only load one t able
at a t ime. T o load mult iple t ables, you must execut e t he LOAD st at ement mult iple t imes. For more

126 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

informat ion about t he LOAD st at ement , see LOAD.

LOAD OVERWRITE TABLE web_site


FROM LOCATION 'oss://oss-<your_region_id>-internal.aliyuncs.com/bucket_name/tpc_ds_100
gb/web_site/' --The endpoint of the OSS bucket
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('odps.properties.rolearn'='<Your_RoleARN>','mcfed.parquet.compres
sion'='SNAPPY')
STORED AS AVRO;

Not e If t he dat a import fails, submit a t icket t o cont act t he MaxComput e t eam.

Execut e t he following st at ement t o query and verify t he import ed dat a:

SELECT * FROM web_site;

T he st at ement ret urns t he following out put .

4. Verify t hat t he dat a migrat ed t o MaxComput e is t he same as t he dat a in BigQuery. T his verificat ion
is based on t he number of t ables, t he number of rows, and t he query result s of t ypical jobs.

2.15. Migrate log data to MaxCompute


2.15.1. Overview
Ent erprises develop t heir businesses based on real-t ime log dat a. T he dat a includes t hat for Elast ic
Comput e Service (ECS) inst ances, cont ainers, mobile t erminals, open-source soft ware, websit e services,
and JavaScript . T his t opic describes how t o migrat e log dat a t o MaxComput e by using T unnel, Dat aHub,
Log Service, and Dat aWorks Dat a Int egrat ion.

Migration method Description Scenario

Use T unnel in MaxCompute to


upload log data to MaxCompute. T his method is used to upload
large volumes of offline data to
T unnel For more information, see Use
MaxCompute tables. T unnel is
T unnel to upload log data to
suitable for offline computing.
MaxCompute.

> Document Version: 20220630 127


Best Pract ices· Dat a migrat ion MaxComput e

Migration method Description Scenario

Use DataHub to migrate data to T his method is mainly used for


MaxCompute. DataHub public preview and development.
DataConnector synchronizes
DataHub is used to upload data
streaming data from DataHub to
in real time. It is suitable for
MaxCompute. You only need to
stream processing.
write data to DataHub and
configure the data After data is uploaded to
DataHub DataHub, the data is stored in a
synchronization feature in
DataHub. T hen, you can use the table for real-time processing.
data in MaxCompute. DataHub executes scheduled
tasks to synchronize the data to
For more information, see Use
a MaxCompute table within a few
DataHub to migrate log data to minutes for offline computing.
MaxCompute.

Configure batch synchronization


nodes and synchronization tasks
in DataWorks Data Integration to
T his method is used after you
synchronize data to
configure batch synchronization
MaxCompute.
DataWorks Data Integration tasks in DataWorks Data
For more information, see Use Integration. T he tasks are
DataWorks Data Integration to executed on a regular basis.
migrate log data to
MaxCompute.

2.15.2. Use Tunnel to upload log data to


MaxCompute
T his t opic describes how t o use T unnel t o upload log dat a t o MaxComput e.

Prerequisites
T he odpscmd client is inst alled. For more informat ion, see Inst all and configure t he MaxComput e
client .
Log dat a is st ored in a local direct ory.loghub.csv is used as an example in t his t opic.

Context
T unnel is a t ool t hat can be used t o upload large volumes of dat a t o MaxComput e at a t ime. It is
suit able for offline comput ing. For more informat ion, see Usage notes.

Procedure
1. On t he odpscmd client , run t he following commands t o creat e a t able named loghub t hat is used
t o st ore t he uploaded dat a:

128 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

Enable the new data types supported by MaxCompute V2.0. Commit the following command wi
th the SQL statement that is used to create the table:
set odps.sql.type.system.odps2=true;
-- Create a table named loghub.
CREATE TABLE loghub
(
client_ip STRING ,
receive_time STRING ,
topic STRING,
id STRING,
name VARCHAR(32),
salenum STRING
);

2. Run t he following command t o upload log dat a t o MaxComput e:

Tunnel u D:\loghub.csv loghub;

where,

D:\loghub.csv: specifies t he pat h where t he log dat a file is st ored.


loghub: specifies t he name of t he MaxComput e t able t hat is used t o st ore log dat a.

Not e Wildcards or regular expressions are not support ed for T unnel-based dat a
uploads.

3. Execut e t he following st at ement t o check whet her t he dat a is uploaded t o t he t able:

SELECT * FROM loghub;

If t he result shown in t he following figure is displayed, t he dat a is uploaded.

2.15.3. Use DataHub to migrate log data to


MaxCompute
T his t opic describes how t o use Dat aHub t o migrat e log dat a t o MaxComput e.

Prerequisites
T he following permissions are grant ed t o t he account aut horized t o access MaxComput e:
Creat eInst ance permission on MaxComput e project s

> Document Version: 20220630 129


Best Pract ices· Dat a migrat ion MaxComput e

Permissions t o view, modify, and updat e MaxComput e t ables

For more informat ion, see Permissions.

Context
Dat aHub is a plat form t hat is designed t o process st reaming dat a. Aft er dat a is uploaded t o Dat aHub,
t he dat a is st ored in a t able for real-t ime processing. Dat aHub execut es scheduled t asks wit hin five
minut es t o synchronize t he dat a t o a MaxComput e t able for offline comput ing.

T o periodically archive st reaming dat a in Dat aHub t o MaxComput e, you only need t o creat e and
configure a Dat aConnect or.

Procedure
1. On t he odpscmd client , creat e a t able t hat is used t o st ore t he dat a synchronized from Dat aHub.
Example:

CREATE TABLE test(f1 string, f2 string, f3 double) partitioned by (ds string);

2. Creat e a project in t he Dat aHub console.


i. Log on t o t he Dat aHub console. In t he upper-left corner, select a region.
ii. In t he left -side navigat ion pane, click Project Manager.
iii. In t he upper-right corner of t he Project List page, click Creat e Project .
iv. In t he Creat e Project dialog box, specify Name and Comment , and click Creat e .
3. Creat e a t opic.
i. On t he Project List page, find t he project for which you want t o creat e a t opic and click View
in t he Operat e column.
ii. In t he upper-right corner of t he project det ails page, click Creat e T opic . In t he Creat e T opic
dialog box, configure t he paramet ers.

130 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a migrat ion

iii. Click Next St ep t o complet e t opic configurat ions.

Not e
Schema corresponds t o a MaxComput e t able. T he field names, dat a t ypes, and
field sequence specified by Schema must be consist ent wit h t hose of t he
MaxComput e t able. You can creat e a Dat aConnect or only if t he t hree condit ions
are met .
You are allowed t o migrat e t he t opics of t he T UPLE and BLOB t ypes t o
MaxComput e t ables.
A maximum of 20 t opics can be creat ed by default . If you require more t opics,
submit a t icket .
T he owner of a Dat aHub t opic or t he Creat or account has t he permissions t o
manage a Dat aConnect or. For example, you can creat e or delet e a Dat aConnect or.

4. Writ e dat a t o t he newly creat ed t opic.


i. Click View in t he Operat e column of t he newly creat ed t opic.
ii. On t he t opic det ails page, click Connect or.
iii. In t he Creat e connect or dialog box, click MaxComput e , configure t he paramet ers, and t hen
click Creat e .
5. View Dat aConnect or det ails.
i. In t he left -side navigat ion pane, click Project Manager.
ii. On t he Project List page, find t he project t hat you want t o view it s Dat aConnect or det ails
and click View in t he Operat e column.
iii. On t he T opic List t ab, find t he t opic of t he project and click View in t he Operat e column.
iv. On t he t opic det ails page, click t he Connect or t ab t o view t he creat ed Dat aConnect or.
v. Click View t o view Dat aConnect or det ails.

By default , Dat aHub migrat es dat a t o MaxComput e t ables at five-minut e int ervals or when t he
amount of dat a reaches 60 MB. Sync Of f set indicat es t he number of migrat ed dat a ent ries.

> Document Version: 20220630 131


Best Pract ices· Dat a migrat ion MaxComput e

6. Execut e t he following st at ement t o check whet her t he log dat a is migrat ed t o MaxComput e:

SELECT * FROM test;

If t he result shown in t he following figure is displayed, t he log dat a is migrat ed t o MaxComput e.

2.15.4. Use DataWorks Data Integration to


migrate log data to MaxCompute
T his t opic describes how t o use Dat aWorks Dat a Int egrat ion t o synchronize LogHub dat a t o
MaxComput e.

Context

132 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

3.Data development
3.1. Convert data types among
STRING, TIMESTAMP, and DATETIME
T his t opic describes how t o convert dat a t ypes among ST RING, T IMEST AMP, or DAT ET IME. T his t opic
provides mult iple dat e conversion met hods t hat you can use t o improve your business efficiency.

Common scenarios of dat a t ype conversions for dat e values:

ST RING t o T IMEST AMP


ST RING t o DAT ET IME
T IMEST AMP t o ST RING
T IMEST AMP t o DAT ET IME
DAT ET IME t o T IMEST AMP
DAT ET IME t o ST RING

STRING to TIMESTAMP
Scenarios

Convert a dat e value of t he ST RING t ype t o t he T IMEST AMP t ype. T he dat e value of t he T IMEST AMP
t ype is in t he yyyy-mm-dd hh:mi:ss.ff3 format .

Conversion met hods


Use t he CAST funct ion.

Limit s

Dat e values of t he ST RING t ype must be at least accurat e t o t he second and must be specified in t he
yyyy-mm-dd hh:mi:ss format .

Examples
Example 1: Use t he CAST funct ion t o convert t he st ring 2009-07-01 16:09:00 t o t he T IMEST AMP
t ype. Sample st at ement :

-- The return value is 2009-07-01 16:09:00.000.


select cast('2009-07-01 16:09:00' as timestamp);

Example 2: Incorrect usage of t he CAST funct ion

-- The return value is NULL because the input data value is invalid. The date value mus
t be in the yyyy-mm-dd hh:mi:ss format.
select cast('2009-07-01' as timestamp);

STRING to DATETIME
Scenarios

Convert a dat e value of t he ST RING t ype t o t he DAT ET IME t ype. T he dat e value of t he DAT ET IME
t ype is in t he yyyy-mm-dd hh:mi:ss format .

Conversion met hods

> Document Version: 20220630 133


Best Pract ices· Dat a development MaxComput e

Met hod 1: Use t he CAST funct ion.


Met hod 2: Use t he T O_DAT E funct ion.

Limit s
If you use t he CAST funct ion, t he dat e value of t he ST RING t ype must be specified in t he yyyy-mm
-dd hh:mi:ss format .

If you use t he T O_DAT E funct ion, you must set t he value of t he format paramet er t o yyyy-mm-dd
hh:mi:ss .

Examples
Example 1: Use t he CAST funct ion t o convert t he st ring 2009-07-01 16:09:00 t o t he DAT ET IME
t ype. Sample st at ement :

-- The return value is 2009-07-01 16:09:00.


select cast('2009-07-01 16:09:00' as datetime);

Example 2: Use t he T O_DAT E funct ion and specify t he format paramet er t o convert t he st ring 20
09-07-01 16:09:00 t o t he DAT ET IME t ype. Sample st at ement :

-- The return value is 2009-07-01 16:09:00.


select to_date('2009-07-01 16:09:00','yyyy-mm-dd hh:mi:ss');

Example 3: Incorrect usage of t he CAST funct ion

-- The return value is NULL because the input data value is invalid. The date value mus
t be in the yyyy-mm-dd hh:mi:ss format.
select cast('2009-07-01' as datetime);

Example 4: Incorrect usage of t he T O_DAT E funct ion

-- The return value is NULL because the input data value is invalid. The date value mus
t be in the yyyy-mm-dd hh:mi:ss format.
select to_date('2009-07-01','yyyy-mm-dd hh:mi:ss');

TIMESTAMP to STRING
Scenarios

Convert a dat e value of t he T IMEST AMP t ype t o t he ST RING t ype. T he value of t he T IMEST AMP t ype
is in t he yyyy-mm-dd hh:mi:ss.ff3 format .

Conversion met hods


Met hod 1: Use t he CAST funct ion.
Met hod 2: Use t he T O_CHAR funct ion. T he format of t he value aft er t he conversion is specified by
t he format paramet er.

Examples
Example 1: Use t he CAST funct ion t o convert t he T IMEST AMP value 2009-07-01 16:09:00 t o t he
ST RING t ype. T o const ruct dat a of t he T IMEST AMP t ype, you must use t he CAST funct ion t wice.
Sample st at ement :

-- The return value is 2009-07-01 16:09:00.


select cast(cast('2009-07-01 16:09:00' as timestamp) as string);

134 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

Example 2: Use t he T O_CHAR funct ion t o convert t he T IMEST AMP value 2009-07-01 16:09:00 t o
t he ST RING t ype. T o const ruct dat a of t he T IMEST AMP t ype, you must use t he CAST funct ion once.
Sample st at ement :

-- The return value is 2009-07-01 16:09:00.


select to_char(cast('2009-07-01 16:09:00' as timestamp),'yyyy-mm-dd hh:mi:ss');

TIMESTAMP to DATETIME
Scenarios

Convert a dat e value of t he T IMEST AMP t ype t o t he DAT ET IME t ype. Before t he conversion, t he dat e
value of t he T IMEST AMP t ype is in t he yyyy-mm-dd hh:mi:ss.ff3 format . Aft er t he conversion, t he
dat e value of t he DAT ET IME t ype is in t he yyyy-mm-dd hh:mi:ss format .

Conversion met hods


Met hod 1: Use t he CAST funct ion.
Met hod 2: Use t he T O_DAT E funct ion.

Limit s

If you use t he T O_DAT E funct ion, you must set t he value of t he format paramet er t o yyyy-mm-dd hh
:mi:ss .

Examples
Example 1: Use t he CAST funct ion t o convert t he T IMEST AMP value 2009-07-01 16:09:00 t o t he
DAT ET IME t ype. T o const ruct dat a of t he T IMEST AMP t ype, you must use t he CAST funct ion t wice.
Sample st at ement :

-- The return value is 2009-07-01 16:09:00.


select cast(cast('2009-07-01 16:09:00' as timestamp) as datetime);

Example 2: Use t he T O_DAT E funct ion and specify t he format paramet er t o convert t he
T IMEST AMP value 2009-07-01 16:09:00 t o t he DAT ET IME t ype. T o const ruct dat a of t he
T IMEST AMP t ype, you must use t he CAST funct ion once. Sample st at ement :

-- The return value is 2009-07-01 16:09:00.


select to_date(cast('2009-07-01 16:09:00' as timestamp),'yyyy-mm-dd hh:mi:ss');

DATETIME to TIMESTAMP
Scenarios

Convert a dat e value of t he DAT ET IME t ype t o t he T IMEST AMP t ype. Before t he conversion, t he dat e
value of t he DAT ET IME t ype is in t he yyyy-mm-dd hh:mi:ss format . Aft er t he conversion, t he dat e
value of t he T IMEST AMP t ype is in t he yyyy-mm-dd hh:mi:ss.ff3 format .

Conversion met hods


Use t he CAST funct ion.

Examples

Use t he CAST funct ion t o convert a DAT ET IME value t o t he T IMEST AMP t ype. T o const ruct dat a of
t he DAT ET IME t ype, you must use t he GET DAT E funct ion once. Sample st at ement :

> Document Version: 20220630 135


Best Pract ices· Dat a development MaxComput e

-- The return value is 2021-10-14 10:21:47.939.


select cast(getdate() as timestamp);

DATETIME to STRING
Scenarios

Convert a dat e value of t he DAT ET IME t ype t o t he ST RING t ype. T he dat e value of t he DAT ET IME
t ype is in t he yyyy-mm-dd hh:mi:ss format .

Conversion met hods


Met hod 1: Use t he CAST funct ion.
Met hod 2: Use t he T O_CHAR funct ion. T he format of t he value aft er t he conversion is specified by
t he format paramet er.

Examples
Example 1: Use t he CAST funct ion t o convert a DAT ET IME value t o t he ST RING t ype. T o const ruct
dat a of t he DAT ET IME t ype, you must use t he GET DAT E funct ion once. Sample st at ement :

-- The return value is 2021-10-14 10:21:47.


select cast(getdate() as string);

Example 2: Use t he T O_CHAR funct ion t o convert a DAT ET IME value t o t he ST RING t ype in t he
specified format . T o const ruct dat a of t he DAT ET IME t ype, you must use t he GET DAT E funct ion
once. Sample st at ement s:

-- The return value is 2021-10-14 10:21:47.


select to_char (getdate(),'yyyy-mm-dd hh:mi:ss');
-- The return value is 2021-10-14.
select to_char (getdate(),'yyyy-mm-dd');
-- The return value is 2021.
select to_char (getdate(),'yyyy');

3.2. Use a MaxCompute UDF to


convert IPv4 or IPv6 addresses into
geolocations
T he development of big dat a plat forms allows you t o process mult iple t ypes of unst ruct ured and
semi-st ruct ured dat a. For example, you can convert IP addresses int o geolocat ions. T his t opic describes
how t o use a MaxComput e user-defined funct ion (UDF) t o convert IPv4 or IPv6 addresses int o
geolocat ions.

Prerequisites
Make sure t hat t he following requirement s are met :
T he MaxComput e client is inst alled.

For more informat ion about how t o inst all and configure t he MaxComput e client , see Install and
configure the MaxCompute client .

MaxComput e St udio is inst alled and connect ed t o a MaxComput e project . A MaxComput e Java

136 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

module is creat ed.

For more informat ion, see Install MaxCompute Studio, Manage project connections, and Create a
MaxCompute Java module.

Context
T o convert IPv4 or IPv6 addresses int o geolocat ions, you must download t he IP address library file t hat
includes t he IP addresses, and upload t he file t o t he MaxComput e project as a resource. Aft er you
develop and creat e a MaxComput e UDF based on t he IP address library file, you can call t he UDF in SQL
st at ement s t o convert IP addresses int o geolocat ions.

Usage notes
T he IP address library file provided in t his t opic is for reference only. You must maint ain t he IP address
library file based on your business requirement s.

Procedure
T o convert IPv4 or IPv6 addresses int o geolocat ions by using a MaxComput e UDF, perform t he following
st eps:

1. St ep 1: Upload an IP address library file


Upload an IP address library file as a resource t o your MaxComput e project . T he resource is used
when you creat e a MaxComput e UDF in subsequent st eps.
2. St ep 2: Writ e a MaxComput e UDF
Writ e a MaxComput e UDF by using Int elliJ IDEA.
3. St ep 3: Creat e t he MaxComput e UDF
Creat e t he MaxComput e UDF.
4. St ep 4: Call t he MaxComput e UDF t o convert an IP address int o a geolocat ion
Call t he MaxComput e UDF t hat you creat ed in an SQL st at ement t o convert IP addresses int o
geolocat ions.

Step 1: Upload an IP address library file


1. Download an IP address library file t o your on-premise machine, decompress t he file t o obt ain t he
ipv4.t xt and ipv6.t xt files, and t hen place t he files in t he inst allat ion direct ory of t he MaxComput e
client , ...\odpscmd_public\bin .

T he IP address library file provided in t his t opic is for reference only. You must maint ain t he IP
address library file based on your business requirement s.

2. Start the MaxCompute client and go t o t he MaxComput e project t o which you want t o upload t he
ipv4.t xt and ipv6.t xt files.
3. Run t he add file command t o upload t he t wo files as file resources t o t he MaxComput e
project .
Sample commands:

add file ipv4.txt -f;


add file ipv6.txt -f;

For more informat ion about how t o add resources, see Add resources.

> Document Version: 20220630 137


Best Pract ices· Dat a development MaxComput e

Step 2: Write a MaxCompute UDF


1. Creat e a Java class.
T he Java class is used for writ ing a MaxComput e UDF in t he next subst ep.
i. St art Int elliJ IDEA. In t he left -side navigat ion pane of t he Project t ab, choose src > main >
java, right -click java, and t hen choose New > Java Class.

ii. In t he New Java Class dialog box, ent er a class name, press Ent er, and t hen ent er t he code in
t he code edit or.
You must creat e t hree Java classes. T he following sect ions show t he names and code of t hese
classes. You can reuse t he code wit hout modificat ion.
IpUt ils

package com.aliyun.odps.udf.utils;
import java.math.BigInteger;
import java.net.Inet4Address;
import java.net.Inet6Address;
import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.Arrays;
public class IpUtils {
/**
* Convert the data type of IP addresses from STRING to LONG.
*

138 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

*
* @param ipInString
* IP addresses of the STRING type.
* @return Return the IP addresses of the LONG type.
*/
public static long StringToLong(String ipInString) {
ipInString = ipInString.replace(" ", "");
byte[] bytes;
if (ipInString.contains(":"))
bytes = ipv6ToBytes(ipInString);
else
bytes = ipv4ToBytes(ipInString);
BigInteger bigInt = new BigInteger(bytes);
// System.out.println(bigInt.toString());
return bigInt.longValue();
}
/**
* Convert the data type of IP addresses from STRING to LONG.
*
* @param ipInString
* IP addresses of the STRING type.
* @return Return the IP addresses of the STRING type that is converted from
BIGINT.
*/
public static String StringToBigIntString(String ipInString) {
ipInString = ipInString.replace(" ", "");
byte[] bytes;
if (ipInString.contains(":"))
bytes = ipv6ToBytes(ipInString);
else
bytes = ipv4ToBytes(ipInString);
BigInteger bigInt = new BigInteger(bytes);
return bigInt.toString();
}
/**
* Convert the data type of IP addresses from BIGINT to STRING.
*
* @param ipInBigInt
* IP addresses of the BIGINT type.
* @return Return the IP addresses of the STRING type.
*/
public static String BigIntToString(BigInteger ipInBigInt) {
byte[] bytes = ipInBigInt.toByteArray();
byte[] unsignedBytes = Arrays.copyOfRange(bytes, 1, bytes.length);
// Remove the sign bit.
try {
String ip = InetAddress.getByAddress(unsignedBytes).toString();
return ip.substring(ip.indexOf('/') + 1).trim();
} catch (UnknownHostException e) {
throw new RuntimeException(e);
}
}
/**
* Convert the data type of IPv6 addresses into signed byte 17.
*/

> Document Version: 20220630 139


Best Pract ices· Dat a development MaxComput e

private static byte[] ipv6ToBytes(String ipv6) {


byte[] ret = new byte[17];
ret[0] = 0;
int ib = 16;
boolean comFlag=false;// IPv4/IPv6 flag.
if (ipv6.startsWith(":"))// Remove the colon (:) from the start of IPv6 a
ddresses.
ipv6 = ipv6.substring(1);
String groups[] = ipv6.split(":");
for (int ig=groups.length - 1; ig > -1; ig--) {// Reverse scan.
if (groups[ig].contains(".")) {
// Both IPv4 and IPv6 addresses exist.
byte[] temp = ipv4ToBytes(groups[ig]);
ret[ib--] = temp[4];
ret[ib--] = temp[3];
ret[ib--] = temp[2];
ret[ib--] = temp[1];
comFlag = true;
} else if ("".equals(groups[ig])) {
// Zero-length compression. Calculate the number of missing group
s.
int zlg = 9 - (groups.length + (comFlag ? 1 : 0));
while (zlg-- > 0) {// Set these groups to 0.
ret[ib--] = 0;
ret[ib--] = 0;
}
} else {
int temp = Integer.parseInt(groups[ig], 16);
ret[ib--] = (byte) temp;
ret[ib--] = (byte) (temp >> 8);
}
}
return ret;
}
/**
* Convert the data type of IPv4 addresses into signed byte 5.
*/
private static byte[] ipv4ToBytes(String ipv4) {
byte[] ret = new byte[5];
ret[0] = 0;
// Find the positions of the periods (.) in the IP addresses of the STRIN
G type.
int position1 = ipv4.indexOf(".");
int position2 = ipv4.indexOf(".", position1 + 1);
int position3 = ipv4.indexOf(".", position2 + 1);
// Convert the IP addresses of the STRING type between periods (.) into I
NTEGER.
ret[1] = (byte) Integer.parseInt(ipv4.substring(0, position1));
ret[2] = (byte) Integer.parseInt(ipv4.substring(position1 + 1,
position2));
ret[3] = (byte) Integer.parseInt(ipv4.substring(position2 + 1,
position3));
ret[4] = (byte) Integer.parseInt(ipv4.substring(position3 + 1));
return ret;
}

140 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

}
/**
* @param ipAdress IPv4 or IPv6 addresses of the STRING type.
* @return 4:IPv4, 6:IPv6, 0: Invalid IP addresses.
* @throws Exception
*/
public static int isIpV4OrV6(String ipAdress) throws Exception {
InetAddress address = InetAddress.getByName(ipAdress);
if (address instanceof Inet4Address)
return 4;
else if (address instanceof Inet6Address)
return 6;
return 0;
}
/*
* Check whether the IP address belongs to a specific IP section.
*
* ipSection The IP sections that are separated by hyphens (-).
*
* The IP address to check.
*/
public static boolean ipExistsInRange(String ip, String ipSection) {
ipSection = ipSection.trim();
ip = ip.trim();
int idx = ipSection.indexOf('-');
String beginIP = ipSection.substring(0, idx);
String endIP = ipSection.substring(idx + 1);
return getIp2long(beginIP) <= getIp2long(ip)
&& getIp2long(ip) <= getIp2long(endIP);
}
public static long getIp2long(String ip) {
ip = ip.trim();
String[] ips = ip.split("\\.");
long ip2long = 0L;
for (int i = 0; i < 4; ++i) {
ip2long = ip2long << 8 | Integer.parseInt(ips[i]);
}
return ip2long;
}
public static long getIp2long2(String ip) {
ip = ip.trim();
String[] ips = ip.split("\\.");
long ip1 = Integer.parseInt(ips[0]);
long ip2 = Integer.parseInt(ips[1]);
long ip3 = Integer.parseInt(ips[2]);
long ip4 = Integer.parseInt(ips[3]);
long ip2long = 1L * ip1 * 256 * 256 * 256 + ip2 * 256 * 256 + ip3 * 256
+ ip4;
return ip2long;
}
public static void main(String[] args) {
System.out.println(StringToLong("2002:7af3:f3be:ffff:ffff:ffff:ffff:ffff"
));
System.out.println(StringToLong("54.38.72.63"));
}

> Document Version: 20220630 141


Best Pract ices· Dat a development MaxComput e

}
private class Invalid{
private Invalid()
{
}
}
}

IpV4Obj

142 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

package com.aliyun.odps.udf.objects;
public class IpV4Obj {
public long startIp ;
public long endIp ;
public String city;
public String province;
public IpV4Obj(long startIp, long endIp, String city, String province) {
this.startIp = startIp;
this.endIp = endIp;
this.city = city;
this.province = province;
}
@Override
public String toString() {
return "IpV4Obj{" +
"startIp=" + startIp +
", endIp=" + endIp +
", city='" + city + '\'' +
", province='" + province + '\'' +
'}';
}
public void setStartIp(long startIp) {
this.startIp = startIp;
}
public void setEndIp(long endIp) {
this.endIp = endIp;
}
public void setCity(String city) {
this.city = city;
}
public void setProvince(String province) {
this.province = province;
}
public long getStartIp() {
return startIp;
}
public long getEndIp() {
return endIp;
}
public String getCity() {
return city;
}
public String getProvince() {
return province;
}
}

> Document Version: 20220630 143


Best Pract ices· Dat a development MaxComput e

IpV6Obj

package com.aliyun.odps.udf.objects;
public class IpV6Obj {
public String startIp ;
public String endIp ;
public String city;
public String province;
public String getStartIp() {
return startIp;
}
@Override
public String toString() {
return "IpV6Obj{" +
"startIp='" + startIp + '\'' +
", endIp='" + endIp + '\'' +
", city='" + city + '\'' +
", province='" + province + '\'' +
'}';
}
public IpV6Obj(String startIp, String endIp, String city, String province) {
this.startIp = startIp;
this.endIp = endIp;
this.city = city;
this.province = province;
}
public void setStartIp(String startIp) {
this.startIp = startIp;
}
public String getEndIp() {
return endIp;
}
public void setEndIp(String endIp) {
this.endIp = endIp;
}
public String getCity() {
return city;
}
public void setCity(String city) {
this.city = city;
}
public String getProvince() {
return province;
}
public void setProvince(String province) {
this.province = province;
}
}

2. Writ e a MaxComput e UDF.

144 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

i. In t he left -side navigat ion pane of t he Project t ab, choose src > main > java, right -click java,
and t hen choose New > MaxComput e Java.

ii. In t he Creat e new MaxComput e java class dialog box, click UDF and ent er a class name in
t he Name field. T hen, press Ent er and ent er t he code in t he code edit or.

T he following code shows how t o writ e a UDF based on a Java class named IpLocat ion. You can
reuse t he code wit hout modificat ion.

package com.aliyun.odps.udf.udfFunction;
import com.aliyun.odps.udf.ExecutionContext;
import com.aliyun.odps.udf.UDF;
import com.aliyun.odps.udf.UDFException;
import com.aliyun.odps.udf.utils.IpUtils;

> Document Version: 20220630 145


Best Pract ices· Dat a development MaxComput e

import com.aliyun.odps.udf.utils.IpUtils;
import com.aliyun.odps.udf.objects.IpV4Obj;
import com.aliyun.odps.udf.objects.IpV6Obj;
import java.io.*;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.stream.Collectors;
public class IpLocation extends UDF {
public static IpV4Obj[] ipV4ObjsArray;
public static IpV6Obj[] ipV6ObjsArray;
public IpLocation() {
super();
}
@Override
public void setup(ExecutionContext ctx) throws UDFException, IOException {
//IPV4
if(ipV4ObjsArray==null)
{
BufferedInputStream bufferedInputStream = ctx.readResourceFileAsStream(
"ipv4.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(bufferedIn
putStream));
ArrayList<IpV4Obj> ipV4ObjArrayList=new ArrayList<>();
String line = null;
while ((line = br.readLine()) != null) {
String[] f = line.split("\\|", -1);
if(f.length>=5)
{
long startIp = IpUtils.StringToLong(f[0]);
long endIp = IpUtils.StringToLong(f[1]);
String city=f[3];
String province=f[4];
IpV4Obj ipV4Obj = new IpV4Obj(startIp, endIp, city, province);
ipV4ObjArrayList.add(ipV4Obj);
}
}
br.close();
List<IpV4Obj> collect = ipV4ObjArrayList.stream().sorted(Comparator.com
paring(IpV4Obj::getStartIp)).collect(Collectors.toList());
ArrayList<IpV4Obj> basicIpV4DataList=(ArrayList)collect;
IpV4Obj[] ipV4Objs = new IpV4Obj[basicIpV4DataList.size()];
ipV4ObjsArray = basicIpV4DataList.toArray(ipV4Objs);
}
//IPV6
if(ipV6ObjsArray==null)
{
BufferedInputStream bufferedInputStream = ctx.readResourceFileAsStream(
"ipv6.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(bufferedIn
putStream));
ArrayList<IpV6Obj> ipV6ObjArrayList=new ArrayList<>();
String line = null;
while ((line = br.readLine()) != null) {
String[] f = line.split("\\|", -1);

146 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

String[] f = line.split("\\|", -1);


if(f.length>=5)
{
String startIp = IpUtils.StringToBigIntString(f[0]);
String endIp = IpUtils.StringToBigIntString(f[1]);
String city=f[3];
String province=f[4];
IpV6Obj ipV6Obj = new IpV6Obj(startIp, endIp, city, province);
ipV6ObjArrayList.add(ipV6Obj);
}
}
br.close();
List<IpV6Obj> collect = ipV6ObjArrayList.stream().sorted(Comparator.com
paring(IpV6Obj::getStartIp)).collect(Collectors.toList());
ArrayList<IpV6Obj> basicIpV6DataList=(ArrayList)collect;
IpV6Obj[] ipV6Objs = new IpV6Obj[basicIpV6DataList.size()];
ipV6ObjsArray = basicIpV6DataList.toArray(ipV6Objs);
}
}
public String evaluate(String ip){
if(ip==null||ip.trim().isEmpty()||!(ip.contains(".")||ip.contains(":")))
{
return null;
}
int ipV4OrV6=0;
try {
ipV4OrV6= IpUtils.isIpV4OrV6(ip);
} catch (Exception e) {
return null;
}
// IPv4 addresses are used.
if(ipV4OrV6==4)
{
int i = binarySearch(ipV4ObjsArray, IpUtils.StringToLong(ip));
if(i>=0)
{
IpV4Obj ipV4Obj = ipV4ObjsArray[i];
return ipV4Obj.city+","+ipV4Obj.province;
}else{
return null;
}
} else if(ipV4OrV6==6)// IPv6 addresses are used.
{
int i = binarySearchIPV6(ipV6ObjsArray, IpUtils.StringToBigIntString(ip
));
if(i>=0)
{
IpV6Obj ipV6Obj = ipV6ObjsArray[i];
return ipV6Obj.city+","+ipV6Obj.province;
}else{
return null;
}
} else{// IP addresses are invalid.
return null;
}

> Document Version: 20220630 147


Best Pract ices· Dat a development MaxComput e

}
}
@Override
public void close() throws UDFException, IOException {
super.close();
}
private static int binarySearch(IpV4Obj[] array,long ip){
int low=0;
int hight=array.length-1;
while (low<=hight)
{
int middle=(low+hight)/2;
if((ip>=array[middle].startIp)&&(ip<=array[middle].endIp))
{
return middle;
}
if (ip < array[middle].startIp)
hight = middle - 1;
else {
low = middle + 1;
}
}
return -1;
}
private static int binarySearchIPV6(IpV6Obj[] array,String ip){
int low=0;
int hight=array.length-1;
while (low<=hight)
{
int middle=(low+hight)/2;
if((ip.compareTo(array[middle].startIp)>=0)&&(ip.compareTo(array[middle
].endIp)<=0))
{
return middle;
}
if (ip.compareTo(array[middle].startIp) < 0)
hight = middle - 1;
else {
low = middle + 1;
}
}
return -1;
}
private class Invalid{
private Invalid()
{
}
}
}

3. Debug t he MaxComput e UDF t o check whet her t he code is run as expect ed.
For more informat ion about how t o debug UDFs, see Perform a local run t o debug t he UDF.
i. Right -click t he MaxComput e UDF script t hat you wrot e and select Run.

148 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

ii. In t he Run/Debug Conf igurat ions dialog box, configure t he required paramet ers and click
OK, as shown in t he following figure.

If no error is ret urned, t he code is run successfully. You can proceed wit h subsequent st eps. If
an error is report ed, you can perform t roubleshoot ing based on t he error informat ion displayed
on Int elliJ IDEA.

Not e T he paramet er set t ings in t he preceding figure are provided for reference.

Step 3: Create the MaxCompute UDF


1. Right -click t he MaxComput e UDF script t hat you compiled and select Deploy t o server… .

> Document Version: 20220630 149


Best Pract ices· Dat a development MaxComput e

2. In t he Package a jar, submit resource and regist er f unct ion dialog box, configure t he
paramet ers.
For more informat ion about t he paramet ers, see Package a Java program, upload t he package, and
creat e a MaxComput e UDF.

150 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

Ext ra resources: You must select t he IP address library files ipv4.t xt and ipv6.t xt t hat you
uploaded in St ep 1. In t his t opic, t he creat ed funct ion is named ipv4_ipv6_at on.

Step 4: Call the MaxCompute UDF to convert an IP address into a


geolocation
1. Start the MaxCompute client .
2. You can execut e an SQL SELECT st at ement t o call t he MaxComput e UDF t o convert an IPv4 or IPv6
address int o a geolocat ion.
Sample st at ement s:
Convert an IPv4 address int o a geolocat ion

select ipv4_ipv6_aton('116.11.34.15');

T he following result is ret urned:

Beihai, Guangxi Zhuang Autonomous Region

Convert an IPv6 address int o a geolocat ion

select ipv4_ipv6_aton('2001:0250:080b:0:0:0:0:0');

T he following result is ret urned:

Baoding, Hebei Province

3.3. Use IntelliJ IDEA to develop a Java


UDF
Int elliJ IDEA is an int egrat ed development environment (IDE) t hat is writ t en in Java. Int elliJ IDEA helps you
develop Java programs. T his t opic describes how t o use MaxComput e St udio t o develop a user-defined
funct ion (UDF) t hat is used t o convert uppercase let t ers int o lowercase let t ers. MaxComput e St udio is a
plug-in t hat is developed based on Int elliJ IDEA.

Prerequisites

> Document Version: 20220630 151


Best Pract ices· Dat a development MaxComput e

Prerequisites
Make sure t hat t he following operat ions are performed:
1. Inst all MaxComput e St udio.
2. Est ablish a connect ion t o a MaxComput e project .
3. Creat e a MaxComput e Java module.

Procedure
1. Writ e a UDF in Java
i. In t he left -side navigat ion pane of t he Project t ab, choose src > main > java, right -click java,
and t hen choose New > MaxComput e Java.

152 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

ii. In t he Creat e new MaxComput e java class dialog box, click UDF , ent er a class name in t he
Name field, and t hen press Ent er. In t his example, t he class is named Lower.

Name : t he name of t he MaxComput e Java class. If no package is creat ed, ent er


packagename.classname. T he syst em aut omat ically generat es a package.

iii. Writ e code in t he code edit or.

Sample code:

package <packagename>;
import com.aliyun.odps.udf.UDF;
public final class Lower extends UDF {
public String evaluate(String s) {
if (s == null) {
return null;
}
return s.toLowerCase();
}
}

2. Debug t he UDF t o check whet her t he code is run as expect ed.


i. In t he java direct ory, right -click t he Java script t hat you wrot e and select Run.

> Document Version: 20220630 153


Best Pract ices· Dat a development MaxComput e

ii. In t he Run/Debug Conf igurat ions dialog box, configure t he required paramet ers.

MaxComput e project : t he MaxComput e project in which t he UDF runs. T o perform a local run,
select local from t he drop-down list .
MaxComput e t able: t he name of t he MaxComput e t able in which t he UDF runs.
T able columns: t he columns in t he MaxComput e t able in which t he UDF runs.
iii. Click OK. T he following figure shows t he ret urn result .

154 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

3. Creat e a MaxComput e UDF.


i. Right -click t he UDF Java file and select Deploy t o server....
ii. In t he Package a jar, submit resource and regist er f unct ion dialog box, configure t he
paramet ers.

MaxComput e project : t he name of t he MaxComput e project t o which t he UDF belongs.


Ret ain t he default value, which indicat es t hat t he connect ion t o t he MaxComput e project is
est ablished when you writ e t he UDF.
Resource f ile : t he pat h of t he resource file on which t he UDF depends. Ret ain t he default
value.
Resource name : t he name of t he resource on which t he UDF depends. Ret ain t he default
value.
Funct ion name : t he name of t he UDF t hat you want t o creat e. T his name is used in t he SQL
st at ement s t hat are used t o call t he UDF. Example: Lower_t est .
iii. Click OK.
4. Call t he UDF.
In t he left -side navigat ion pane, click t he Project Explore t ab. Right -click t he MaxComput e project
t o which t he UDF belongs, select Open in Console , ent er t he SQL st at ement t hat is used t o call
t he UDF, and t hen press Ent er t o execut e t he SQL st at ement .

> Document Version: 20220630 155


Best Pract ices· Dat a development MaxComput e

Sample st at ement :

select Lower_test('ALIYUN');

T he following figure shows t he result t hat t he preceding st at ement ret urns. T he result indicat es
t hat t he Java UDF Lower_t est runs as expect ed.

3.4. Use MaxCompute to query


geolocations of IP addresses
T his t opic describes how t o query geolocat ions of IP addresses by using MaxComput e. T o query
geolocat ions of IP addresses, you must download an IP address geolocat ion library, upload t he library
t o MaxComput e, creat e a user-defined funct ion (UDF), and t hen execut e an SQL st at ement .

Prerequisites

Context
T o query t he geolocat ion of an IP address, you can send an HT T P request t o call t he API provided by
t he IP address geolocat ion library of T aobao. Aft er t he API is called, a st ring t hat indicat es t he
geolocat ion of t he IP address is ret urned. T he following figure shows an example of a ret urned st ring.

You cannot send HT T P request s in MaxComput e. You can query geolocat ions of IP addresses in
MaxComput e by using one of t he following met hods:

Execut e SQL st at ement s t o download dat a in t he IP address geolocat ion library t o your on-premises
machine. T hen, send HT T P request s t o query t he geolocat ion informat ion.

156 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

Not e T his met hod is inefficient . T he query frequency must be less t han 10 queries per
second (QPS). Ot herwise, query request s are reject ed by t he IP address geolocat ion library of
T aobao.

Download t he IP address geolocat ion library t o your on-premises machine. T hen, query t he
geolocat ion informat ion in t he library.

Not e T his met hod is inefficient and is not suit able for scenarios in which dat a is analyzed
by using dat a warehouses.

Maint ain an IP address geolocat ion library and upload it t o MaxComput e on a regular basis. T hen,
query geolocat ions of IP addresses in t he IP address geolocat ion library.

Not e T his met hod is efficient . You must maint ain t he IP address geolocat ion library on a
regular basis.

Download an IP address geolocation library


1. Obt ain an IP address geolocat ion library. In t his example, t he sample IP address geolocat ion library
ipdat a.t xt .ut f8 is used. T his IP address geolocat ion library is a library demo in t he UT F-8 format .
2. Download t he ipdat a.t xt .ut f8 file. T he following figure shows t he dat a in t he file.

T he following cont ent describes t he dat a in t he sample IP address geolocat ion library.

T he dat a is in t he UT F-8 format .


T he first t wo st rings in a dat a record are t he st art IP address and t he end IP address of an IP
address range, in t he decimal int eger lit eral format . T he t hird and fourt h st rings are equivalent t o
t he first t wo st rings, but are expressed in dot t ed decimal not at ion. T he decimal int eger lit eral
format helps you check whet her an IP address is wit hin a specific IP address range.

Not e You can also use your own IP address geolocat ion library.

Upload the IP address geolocation library


1. Execut e t he following st at ement s on t he MaxComput e client t o creat e a t able named ipresource.
T his t able is used t o st ore geolocat ion dat a of IP addresses.

> Document Version: 20220630 157


Best Pract ices· Dat a development MaxComput e

DROP TABLE IF EXISTS ipresource ;


CREATE TABLE IF NOT EXISTS ipresource
(
start_ip BIGINT
,end_ip BIGINT
,start_ip_arg string
,end_ip_arg string
,country STRING
,area STRING
,city STRING
,county STRING
,isp STRING
);

2. Run t he following T unnel command t o upload dat a in t he ipdat a.t xt .ut f8 file t o t he ipresource
t able:

odps@ workshop_demo>tunnel upload D:/ipdata.txt.utf8 ipresource;

In t he preceding command, D:/ipdata.txt.utf8 is t he on-premises pat h of t he ipdat a.t xt .ut f8 file.


For more informat ion about t he command, see T unnel commands.

You can execut e t he following st at ement t o check whet her t he dat a in t he file is uploaded:

-- Query the number of data records in the ipresource table.


select count(*) from ipresource;

3. Execut e t he following SQL st at ement t o obt ain t he first 10 dat a records in t he ipresource t able:

select * from ipresource limit 10;

T he following result is ret urned.

Create a UDF
1.
2. Creat e a Pyt hon resource.
i.
ii. In t he Creat e Resource dialog box, ent er a resource name , select Upload t o
MaxComput e , and t hen click Creat e .

158 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

iii. Ent er t he following code in t he Pyt hon resource and click t he icon.

from odps.udf import annotate


@annotate("string->bigint")
class ipint(object):
def evaluate(self, ip):
try:
return reduce(lambda x, y: (x << 8) + y, map(int, ip.split('.')))
except:
return 0

iv. Click t he icon.

3. Creat e a funct ion.


i. Right -click t he workflow t hat you creat ed and choose Creat e > MaxComput e > Funct ion.
ii. In t he Creat e Funct ion dialog box, ent er a f unct ion name , and click Creat e .

Not e If mult iple MaxComput e engines are bound t o t he workspace, select one of
t he engines from t he Engine Inst ance drop-down list .

iii. On t he configurat ion t ab of t he funct ion, configure t he paramet ers.

Parameter Description

T he type of the function. Valid values: Mat hemat ical Operat io n


Funct io n T ype Funct io ns , Aggregat e Funct io ns , St ring Pro cessing Funct io ns ,
Dat e Funct io ns , W indo w Funct io ns , and Ot her Funct io ns .

Engine Inst ance


A default value is displayed and cannot be changed.
MaxCo mput e

T he name of the function. You can use the function name to


reference the function in SQL statements. T he function name must
Funct io n Name
be globally unique and cannot be changed after the function is
created.

Ow ner T he value of this parameter is automatically displayed.

> Document Version: 20220630 159


Best Pract ices· Dat a development MaxComput e

Parameter Description

Required. T he name of the class that implements the function.

No t e If the resource type is Python, enter the class name


Class Name
in the Python resource name.Class name format. Do not include
the .py extension in the resource name.

Required. T he list of resources. You can search for existing resources


in the workspace in fuzzy match mode.
Reso urces
Separate multiple resources with commas (,).

Descript io n T he description of the function.

Expressio n Synt ax T he syntax of the function. Example: test .

T he description of the supported data types of input and output


Paramet er Descript io n
parameters.

Ret urn V alue Optional. T he value to return. Example: 1.

Example Optional. T he example of the function.

4. Click t he icon in t he t oolbar.

5. Commit t he funct ion.

i. Click t he icon in t he t oolbar t o commit t he funct ion.

ii. In t he Commit Node dialog box, ent er your comment s in t he Change descript ion field.
iii. Click OK.

Use the UDF that you created in an SQ L statement to query the


geolocation of an IP address
1.
2.
3. On t he configurat ion t ab of t he ODPS SQL node, ent er t he following st at ement :

select * from ipresource


WHERE ipint('1.2.24.2') >= start_ip
AND ipint('1.2.24.2') <= end_ip

4.
5.

3.5. Resolve the issue that you cannot


160 > Document Version: 20220630
MaxComput e Best Pract ices· Dat a development

3.5. Resolve the issue that you cannot


upload files that exceed 10 MB to
DataWorks
T he JAR package or resource file t hat is used in a MapReduce job oft en exceeds 10 MB. However, t he
maximum size of a file t hat you can upload t o Dat aWorks is 10 MB. T his t opic describes how t o resolve
t his issue t o schedule a MapReduce job where a JAR package or resource file t hat exceeds 10 MB is used.

Prerequisites
T he MaxComput e client is inst alled. For more informat ion, see Inst all and configure t he MaxComput e
client .

Procedure
1. Run t he following command on t he MaxComput e client t o upload a JAR package t hat exceeds 10
MB:

--Upload a JAR package.


add jar C:\test_mr\test_mr.jar -f;

2. Resources t hat you upload on t he MaxComput e client are not displayed on t he Dat aSt udio page of
t he Dat aWorks console. You must run t he following command t o check whet her t he resource is
uploaded:

--View resources.
list resources;

3. Reduce t he size of t he JAR package. Dat aWorks runs a MapReduce job on t he comput er where t he
MaxComput e client resides. T herefore, you can submit only t he Main funct ion t o Dat aWorks t o run a
MapReduce job.

jar
-resources test_mr.jar,test_ab.jar --A file can be referenced after it is registered on
the MaxCompute client.
-classpath test_mr.jar --Reduce the size of a JAR package by using the following method
: Submit only the Mapper and Reducer that contain the Main function on the gateway. You
do not need to submit third-party dependencies. You can store the resources in the wc_i
n directory of the MaxCompute client.

3.6. Grant a specified user the access


permissions on a specific UDF
T his t opic describes how t o grant specified users t he access permissions on specific resources, such as
t ables and user-defined funct ions (UDFs). T his operat ion involves dat a encrypt ion and decrypt ion
algorit hms and is relat ed t o dat a securit y.

Prerequisites
T he MaxComput e client is inst alled. For more informat ion, see Inst all and configure t he MaxComput e

> Document Version: 20220630 161


Best Pract ices· Dat a development MaxComput e

client .

Context
You can use one of t he following met hods t o manage t he access permissions of users:
Use packages t o achieve fine-grained access cont rol.

T his met hod is used for dat a sharing and resource aut horizat ion across project s. Aft er you assign t he
developer role t o a user by using a package, t he user has full permissions on all object s in t he
package. T his may cause uncont rollable risks. For more informat ion, see Cross-project resource access
based on packages.

T he following figure shows t he permissions of t he developer role t hat is defined in Dat aWorks.

By default , t he developer role has full permissions on all packages, funct ions, resources, and t ables
in a workspace. T his does not meet t he requirement s for permission management .
T he following figure shows t he permissions t hat are grant ed t o a RAM user t hat is assigned t he
developer role in Dat aWorks.

You cannot grant a specified user t he access permissions on a specific UDF by using package-based
aut horizat ion or by assigning t he developer role in Dat aWorks t o t he user. For example, if you assign
t he developer role t o t he RAM user named [email protected]:ramtest , t he RAM user
has full permissions on all object s in t he current workspace. For more informat ion, see Authorize users.

Creat e a role in t he Dat aWorks console t o manage access permissions.

On t he MaxComput e Management page in t he Dat aWorks console, you can manage t he access
permissions of cust om user roles. On t his page, you can grant permissions on a t able or a project . You
cannot grant permissions on resources or UDFs.

162 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

Not e For more informat ion about MaxComput e project s for Dat aWorks workspaces, see
Configure MaxComput e.

Use a role policy and a project policy.

Role and project policies allow you t o grant a specified user t he permissions on specific resources.

Not e T o ensure securit y, we recommend t hat you verify role and project policies in a t est
workspace.

You can use a role policy and a project policy t o grant access permissions on a specific UDF t o a
specified user.

T o prevent a user from accessing a specific resource in a workspace, assign t he developer role t o t he
user in t he Dat aWorks console and configure a role policy for t he user t o deny access request s for t he
resource on t he MaxComput e client .
T o allow a user t o access a specific resource, assign t he developer role t o t he user in t he Dat aWorks
console and configure a project policy for t he user t o allow access request s for t he resource on t he
MaxComput e client .

Procedure
1. Creat e a role t hat has no permission t o access a UDF named get region by default .
i. On t he MaxComput e client , run t he following command t o creat e a role named denyudfrole:

create role denyudfrole;

ii. Creat e a role policy file t hat cont ains t he following cont ent :

{
"Version": "1", "Statement"
[{
"Effect":"Deny",
"Action":["odps:Read","odps:List"],
"Resource":"acs:odps:*:projects/sz_mc/resources/getaddr.jar"
},
{
"Effect":"Deny",
"Action":["odps:Read","odps:List"],
"Resource":"acs:odps:*:projects/sz_mc/registration/functions/getregion"
}
] }

iii. Configure t he st orage pat h for t he role policy file.


On t he MaxComput e client , run t he following command:

put policy /Users/yangyi/Desktop/role_policy.json on role denyudfrole;

> Document Version: 20220630 163


Best Pract ices· Dat a development MaxComput e

iv. On t he MaxComput e client , run t he following command t o view t he role policy:

get policy on role denyudfrole;

T he following result is ret urned:

v. On t he MaxComput e client , run t he following command t o assign t he denyudfrole role t o a


RAM user:

grant denyudfrole to [email protected]:ramtest;

2. Check whet her t he denyudfrole role is creat ed.


i. Log on t o t he MaxComput e client as t he RAM user t o which t he denyudfrole role is assigned.
T hen, run t he whoami; command t o check t he current logon user.

ii. Run t he show grants; command t o check t he permissions of t he current logon user.

T he result indicat es t hat t he RAM user has t he following t wo roles: role_project _dev and
denyudfrole. role_project _dev is t he default developer role in Dat aWorks.

164 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

iii. Check t he permissions of t he RAM user on t he get region UDF and it s dependencies.

T he result indicat es t hat t he RAM user wit h t he developer role in Dat aWorks does not have
read permissions on t he get region UDF. You can perform t he next st ep t o configure a project
policy t o ensure t hat only a specified RAM user can access t he UDF.

3. Configure a project policy.


i. Creat e a project policy file t hat cont ains t he following cont ent :

{
"Version": "1", "Statement":
[{
"Effect":"Allow",
"Principal":"[email protected]:yangyitest",
"Action":["odps:Read","odps:List","odps:Select"],
"Resource":"acs:odps:*:projects/sz_mc/resources/getaddr.jar"
},
{
"Effect":"Allow",
"Principal":"[email protected]:yangyitest",
"Action":["odps:Read","odps:List","odps:Select"],
"Resource":"acs:odps:*:projects/sz_mc/registration/functions/getregion"
}] }

ii. Configure t he st orage pat h for t he project policy file.


On t he MaxComput e client , run t he following command:

put policy /Users/yangyi/Desktop/project_policy.json;

iii. On t he MaxComput e client , run t he following command t o view t he project policy:

get policy;

T he following result is ret urned:

> Document Version: 20220630 165


Best Pract ices· Dat a development MaxComput e

iv. Run t he whoami; command t o check t he current logon user. T hen, run t he show grants;
command t o check t he permissions of t he user.

v. Run an SQL job t o check whet her only t he specified RAM user can access t he specific UDF and
it s dependencies.

166 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

T he following result indicat es t hat t he specified RAM user can access t he specific UDF:

T he following result indicat es t hat t he specified RAM user can access t he dependencies of
t he UDF:

3.7. Use a PyODPS node to segment


Chinese text based on Jieba
T his t opic describes how t o use a PyODPS node in Dat aWorks t o segment Chinese t ext based on t he
open source segment at ion t ool Jieba and writ e t he segment ed words and phrases t o a new t able. T his
t opic also describes how t o use closure funct ions t o segment Chinese t ext based on a cust om
dict ionary.

Prerequisites
A Dat aWorks workspace is creat ed. In t his example, a workspace in basic mode is used. T he
workspace is associat ed wit h mult iple MaxComput e comput e engines. For more informat ion, see

> Document Version: 20220630 167


Best Pract ices· Dat a development MaxComput e

Creat e a workspace.
T he open source Jieba package is downloaded from Git Hub.

Context
PyODPS nodes are int egrat ed wit h MaxComput e SDK for Pyt hon. You can direct ly edit Pyt hon code and
use MaxComput e SDK for Pyt hon on PyODPS nodes of Dat aWorks. For more informat ion about PyODPS
nodes, see Create a PyODPS 2 node.

T his t opic describes how t o use a PyODPS node t o segment Chinese t ext based on Jieba.

Use open source packages t o segment Chinese t ext based on Jieba


Use cust om dict ionaries t o segment Chinese t ext based on Jieba

Not ice Sample code in t his t opic is for reference only. We recommend t hat you do not use
t he code in your product ion environment .

Use open source packages to segment Chinese text based on Jieba


1. Creat e a workflow.
i. Log on t o t he Dat aWorks console.
ii. In t he left -side navigat ion pane, click Workspaces.
iii. Select t he region where t he workspace resides, find t he workspace, and t hen click Dat a
Development in t he Act ions column.

iv. Move t he point er over t he icon and click Workf low .

v. In t he Creat e Workf low dialog box, specify t he Workf low Name and Descript ion
paramet ers. T hen, click Creat e .

Not ice T he workflow name must be 1 t o 128 charact ers in lengt h, and can cont ain
let t ers, digit s, underscores (_), and periods (.).

2. Upload t he jieba-mast er.zip package.


i. Click t he workflow t hat you creat ed, expand MaxComput e , right -click Resource , and t hen
choose Creat e > Archive .
ii. In t he Creat e Resource dialog box, configure t he paramet ers and click Creat e . T he following
t able describes t he paramet ers.

168 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

Parameter Description

Select the compute engine where the resource resides from the drop-
down list.

Engine T ype
No t e If only one instance is bound to your workspace, this
parameter is not displayed.

Engine Inst ance T he name of the MaxCompute engine to which the task is bound.

T he folder that is used to store the resource. T he default value is the


Lo cat io n path of the current folder. You can modify the path based on your
business requirements.

Select Archive from the File T ype drop-down list.

No t e If the resource package has been uploaded to the


File T ype
MaxCompute client, clear Uplo ad t o MaxCo mput e . Otherwise, an
error occurs during the upload process.

Click Uplo ad , select the downloaded file named jieb-master.zip from


File
your on-premises machine, and then click Open.

> Document Version: 20220630 169


Best Pract ices· Dat a development MaxComput e

Parameter Description

T he name of the resource. T he resource name can be different from the


name of the file that you uploaded but must comply with the following
conventions:

T he resource name can contain only letters, digits, periods (.),


Reso urce Name underscores (_), and hyphens (-).
If you select Archive from the File T ype drop-down list, the extension
of the resource name must be the same as that of the file name. T he
extension can be .zip, .tgz, .tar.gz, or .tar.

iii. In t he t oolbar, click t he icon.

iv. In t he Commit dialog box, specify t he Change descript ion paramet er and click Commit .
3. Creat e a t able t hat is used t o st ore t est dat a.
i. Click t he workflow t hat you creat ed, expand MaxComput e , right -click T able , and t hen select
Creat e T able .
ii. In t he Creat e T able dialog box, specify t he T able Name paramet er and click Creat e .

Not e In t his example, t he t able name is jieba_t est .

iii. Click DDL St at ement and ent er t he following DDL st at ement t o creat e a t able:

CREATE TABLE jieba_test (


'chinese' string,
`content` string
);

Not e T he t able in t his example cont ains t wo columns. You can segment t ext in one
column during dat a development .

iv. In t he Confirm message, click OK.


v. In t he General sect ion, specify t he Display Name paramet er for t he t able. Click Commit t o
Product ion Environment .
vi. In t he Commit t o Product ion Environment dialog box, select I am aware of t he risk and
conf irm t he commissions and click OK.
4. Use t he same met hod t o creat e a t able named jieba_result . T his t able is used t o st ore t he t est
result . Sample DDL st at ement :

CREATE TABLE jieba_result (


`chinese` string
);

Not e In t his example, only t he t ext in t he chinese column of t he t est dat a is segment ed.
T herefore, t he result t able cont ains only one column.

5. Click T est Dat a t o download t he t est dat a.

170 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

6. Upload t est dat a.

i. Click t he icon on t he Dat aSt udio page.

ii. In t he Dat a Import Wiz ard dialog box, ent er t he name of t he t est t able jieba_t est t o which
you want t o import dat a, select t he t able, and t hen click Next .
iii. Click Browse , upload t he jieba_test.csv file from your on-premises machine, and t hen click
Next .
iv. Select By Name and click Import Dat a.
7. Creat e a PyODPS 2 node.
i. Click t he workflow, expand MaxComput e , right -click Dat a Analyt ics, and t hen choose
Creat e > PyODPS 2 .
ii. In t he Creat e Node dialog box, specify t he Node Name and Locat ion paramet ers and click
Commit .

Not e
T he node name must be 1 t o 128 charact ers in lengt h, and can cont ain let t ers,
digit s, underscores (_), and periods (.).
In t his example, t he Node Name paramet er is set t o word_split .

iii. On t he configurat ion t ab of t he node, ent er t he following PyODPS code:

def test(input_var):
import jieba
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
result=jieba.cut(input_var, cut_all=False)
return "/ ".join(result)
hints = {
'odps.isolation.session.enable': True
}
libraries =['jieba-master.zip'] # Reference the jieba-master.zip package.
iris = o.get_table('jieba_test').to_df() # Reference the data in the jieba_test ta
ble.
example = iris.chinese.map(test).execute(hints=hints, libraries=libraries)
print(example) # Display the text segmentation result. The result is of the MAP ty
pe.
abci=list(example) # Convert the text segmentation result into the LIST type.
i = 0
for i in range(i,len(abci)):
pq=str(abci[i])
o.write_table('jieba_result',[pq]) # Write the data records to the jieba_resul
t table one by one.
i+=1
else:
print("done")

iv. Click t he icon in t he t oolbar t o save t he code.

> Document Version: 20220630 171


Best Pract ices· Dat a development MaxComput e

v. Click t he icon in t he t oolbar. In t he Paramet ers dialog box, select a resource group from t he
Resource Group drop-down list and click OK.

Not e For more informat ion about resource groups f or scheduling , see Overview.

vi. View t he execut ion result of t he Jieba segment at ion program on t he Runt ime Log t ab in t he
lower part of t he page.
8. Creat e and run an ODPS SQL node.
i. Click t he workflow, expand MaxComput e , right -click Dat a Analyt ics, and t hen choose
Creat e > ODPS SQL.
ii. In t he Creat e Node dialog box, specify t he Node Name and Locat ion paramet ers and click
Commit .

Not e T he node name must be 1 t o 128 charact ers in lengt h and can cont ain let t ers,
digit s, underscores (_), and periods (.).

iii. On t he configurat ion t ab of t he node, ent er t he following SQL st at ement :

select * from jieba_result;

iv. Click t he icon in t he t oolbar t o save t he code.

v. Click t he icon in t he t oolbar. In t he Paramet ers dialog box, select a resource group from t he
Resource Group drop-down list and click OK.

Not e For more informat ion about resource groups f or scheduling , see Overview.

vi. In t he Expense Est imat e dialog box, check t he est imat ed cost and click Run.
vii. View t he execut ion result on t he Runt ime Log t ab in t he lower part of t he page.

Use custom dictionaries to segment Chinese text based on Jieba


If t he dict ionary of t he Jieba t ool does not meet your requirement s, you can use a cust om dict ionary.

You can use a PyODPS user-defined funct ion (UDF) t o read t able or file resources t hat are uploaded t o
MaxComput e. In t his case, you must writ e t he UDF as a closure funct ion or a callable class. If you need
t o reference complex UDFs, you can creat e a MaxComput e funct ion in Dat aWorks. For more informat ion,
see Register a MaxCompute function.

In t his t opic, a closure funct ion is used t o reference t he cust om dict ionary file key_words.t xt t hat is
uploaded t o MaxComput e.

Not e In t his example, t he cust om dict ionary file name is key_words.t xt .

1. Click t he workflow, expand MaxComput e , right -click Resource , and t hen choose Creat e > File .
2. In t he Creat e Resource dialog box, configure t he paramet ers and click Creat e . T he following
t able describes t he paramet ers.

172 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

Parameter Description

Select the compute engine where the resource resides from the drop-down
list.

Engine T ype
No t e If only one instance is bound to your workspace, this
parameter is not displayed.

Engine Inst ance T he name of the MaxCompute engine to which the task is bound.

T he folder that is used to store the resource. T he default value is the path of
Lo cat io n the current folder. You can modify the path based on your business
requirements.

Select File from the File T ype drop-down list.

File T ype No t e If you want to upload a dictionary file from the on-premises
machine to DataWorks, the file must be encoded in UT F-8.

Click Uplo ad , select the key_words.txt file from your on-premises machine,
Uplo ad
and then click Open.

T he name of the resource. T he resource name can contain only letters, digits,
Reso urce Name
periods (.), underscores (_), and hyphens (-).

3. On t he configurat ion t ab of t he key_words.t xt resource, ent er t he cont ent of t he cust om


dict ionary. Dict ionary format :
Each word occupies one line.
Each line cont ains t he following part s in sequence: word, frequency, and part s of speech. T he

> Document Version: 20220630 173


Best Pract ices· Dat a development MaxComput e

frequency and part of speech are opt ional. Separat e every t wo part s wit h a space. T he order of
t he t hree part s cannot be adjust ed.

4. Click t he icon in t he t oolbar t o commit t he resource.

5. Creat e a PyODPS 2 node and ent er t he following code:

def test(resources):
import jieba
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
fileobj = resources[0]
def h(input_var):# Use the nested function h() to load the dictionary and segment t
ext.
import jieba
jieba.load_userdict(fileobj)
result=jieba.cut(input_var, cut_all=False)
return "/ ".join(result)
return h
hints = {
'odps.isolation.session.enable': True
}
libraries =['jieba-master.zip'] # Reference the jieba-master.zip package.
iris = o.get_table('jieba_test').to_df() # Reference the data in the jieba_test table.

file_object = o.get_resource('key_words.txt') # Use the get_resource() function to refe


rence the MaxCompute resource.
example = iris.chinese.map(test, resources=[file_object]).execute(hints=hints, librarie
s=libraries) # Call the map function to transfer the resources parameter.
print(example) # Display the text segmentation result. The result is of the MAP type.
abci=list(example) # Convert the text segmentation result into the List type.
for i in range(i,len(abci)):
pq=str(abci[i])
o.write_table('jieba_result',[pq]) # Write the data records to the jieba_result ta
ble one by one.
i+=1
else:
print("done")

6. Run t he code and compare t he result s before and aft er t he cust om dict ionary is referenced.

3.8. Use a PyODPS node to download


data to a local directory for
processing or to process data online
T his t opic describes how t o use a PyODPS node t o download dat a t o a local direct ory for processing or
t o process dat a online.

Background information

174 > Document Version: 20220630


MaxComput e Best Pract ices· Dat a development

PyODPS provides mult iple met hods t o download dat a t o a local direct ory. You can download dat a t o a
local direct ory for processing and t hen upload t he dat a t o MaxComput e. However, local dat a
processing is inefficient because t he massively parallel processing capabilit y of MaxComput e cannot be
used if you download dat a t o a local direct ory. If t he dat a volume is great er t han 10 MB, we
recommend t hat you do not download dat a t o a local direct ory for processing. You can use one of t he
following met hods t o download dat a t o a local direct ory:

Use t he head, t ail, or t o_pandas met hod. In most cases, use t he head or tail met hod t o obt ain
small volumes of dat a. If you want t o obt ain large volumes of dat a, use t he persist met hod t o st ore
dat a in a MaxComput e t able. For more informat ion, see Execut ion.
Use t he open_reader met hod. You can execut e open_reader on a t able or an SQL inst ance t o obt ain
t he dat a. If you need t o process large volumes of dat a, we recommend t hat you use PyODPS
Dat aFrame or MaxComput e SQL. A PyODPS Dat aFrame object is creat ed based on a MaxComput e
t able. T his met hod provides higher efficiency t han local dat a processing.

Sample code
Convert a JSON st ring t o mult iple rows. Each row consist s of a key and it s value.

For local t est ing, use t he head met hod t o obt ain small volumes of dat a

In [12]: df.head(2)
json
0 {"a": 1, "b": 2}
1 {"c": 4, "b": 3}
In [14]: from odps.df import output
In [16]: @output(['k', 'v'], ['string', 'int'])
...: def h(row):
...: import json
...: for k, v in json.loads(row.json).items():
...: yield k, v
...:
In [21]: df.apply(h, axis=1).head(4)
k v
0 a 1
1 b 2
2 c 4
3 b 3

For online product ion, use t he persist met hod t o st ore large volumes of dat a in a MaxComput e
t able

In [14]: from odps.df import output


In [16]: @output(['k', 'v'], ['string', 'int'])
...: def h(row):
...: import json
...: for k, v in json.loads(row.json).items():
...: yield k, v
...:
In [21]: df.apply(h, axis=1).persist('my_table')

> Document Version: 20220630 175


Best Pract ices· Comput e opt imizat i
MaxComput e
on

4.Compute optimization
4.1. Optimize SQL statements
T his t opic describes common scenarios where you can opt imize SQL st at ement s t o achieve bet t er
performance.

Reduce impacts of data skew


Dat a skew can lead t o an ext reme imbalance of work. When dat a is skewed, some workers need t o
process larger amount s of dat a t han t he ot hers. As a result , t hese workers t ake much longer t o
complet e. T his prolongs t he overall t ime used t o process dat a and may lead t o lat ency.

Skewed joins
An imbalance of work may occur when you join t ables based on a key t hat is not evenly dist ribut ed.
For example, execut e t he following st at ement t o join a large t able named A and a small t able named
B:

select * from A join B on A.value= B.value;

Copy t he Logview URL of t he query and open it in a browser t o go t o t he Logview page. Double-click
t he Job Scheduler job t hat performs t he JOIN operat ion. On t he Long-t ails t ab, you can see t hat long
t ails exist , as shown in t he following figure. T his indicat es t hat dat a is skewed.

T o opt imize t he preceding st at ement , you can use one of t he following met hods:

Use a MAPJOIN st at ement . T able B is a small t able which does not exceed 512 MB in size. In t his
case, you can replace t he preceding st at ement wit h t he following st at ement :

select /*+ MAPJOIN(B) */ * from A join B on A.value= B.value;

176 > Document Version: 20220630


Best Pract ices· Comput e opt imizat i
MaxComput e
on

Handle t he skewed key separat ely. If dat a skew occurs because a large number of null key values
exist in bot h t ables, you must filt er out t hese null values or generat e random numbers t o replace
t hem before you perform t he JOIN operat ion. For example, you can replace t he preceding
st at ement wit h t he following st at ement :

select * from A join B


on case when A.value is null then concat('value',rand() ) else A.value end = B.value;

T he following example describes how t o ident ify t he key values t hat cause dat a skew:

-- Data skew leads to an imbalance of work when the following statement is executed:
select * from a join b on a.key=b.key;
-- Execute the following statement to view the distribution of key values and identify
the key values that cause data skew:
select left.key, left.cnt * right.cnt from
(select key, count(*) as cnt from a group by key) left
join
(select key, count(*) as cnt from b group by key) right
on left.key=right.key;

Skewed GROUP BY operat ions

An imbalance of work may occur when you perform a GROUP BY operat ion based on a key t hat is not
evenly dist ribut ed.

Assume t hat T able A has t wo fields, which are Key and Value. T he t able cont ains a large amount of
dat a and t he values of t he Key field are not evenly dist ribut ed. Execut e t he following st at ement t o
perform a GROUP BY operat ion on t he t able:

select key,count(value) from A group by key;

When t he amount of t he dat a in t he t able is large enough, you may find long t rails on t he Logview
page of t he query. T o resolve t he issue, add set odps.sql.groupby.skewindata=true before t he
preceding st at ement t o enable ant i-skew before t he query is performed.

Skewed reduce t asks during improper use of dynamic part it ioning

If you use dynamic part it ioning in MaxComput e, one or more reduce t asks are assigned t o each
part it ion t o aggregat e dat a by part it ion. T his brings t he following benefit s:

Reduce t he number of small files generat ed by MaxComput e and improve t he efficiency of


processing.
Avoid high memory usage when a worker needs t o writ e many files t o a part it ion.

If t he dat a t o be writ t en t o part it ions is skewed, long t ails may occur during t he reduce st age. Each
part it ion can be assigned a maximum of 10 map t asks. If a larger amount of dat a is t o be writ t en t o a
part it ion t han t he ot her part it ions, long t ails may occur. If you can det ermine t he part it ion t o which
dat a is t o be writ t en, we recommend t hat you do not use dynamic part it ioning. For example, long
t ails may occur if you execut e t he following st at ement t o writ e dat a from a specific part it ion in a
t able t o anot her t able:

insert overwrite table A2 partition(dt) select split_part(value,'\t',1) as field1, split_


part(value,'\t',2) as field2, dt from A where dt='20151010';

In t his case, you can replace t he preceding st at ement wit h t he following st at ement :

> Document Version: 20220630 177


Best Pract ices· Comput e opt imizat i
MaxComput e
on

insert overwrite table A2 partition(dt='20151010')


select
split_part(value,'\t',1) as field1,
split_part(value,'\t',2) as field2
from A
where dt='20151010';

For more informat ion about how t o reduce impact s of dat a skew, see Long-tail computing optimization.

O ptimize window functions


If window funct ions are used in SQL st at ement s, a reduce t ask is assigned t o each window funct ion. A
large number of window funct ions consume a large amount of resources. You can opt imize t he window
funct ions t hat meet bot h of t he following condit ions:

T he OVER clause which defines how t o part it ion and sort rows in a t able must be t he same.
Mult iple window funct ions must be execut ed at t he same level of nest ing in an SQL st at ement .
T he window funct ions t hat meet t he preceding condit ions are merged t o be execut ed by one reduce
t ask. T he following SQL st at ement provides an example:

select
rank()over(partition by A order by B desc) as rank,
row_number()over(partition by A order by B desc) as row_num
from MyTable;

O ptimize subqueries
T he following st at ement cont ains a subquery:

SELECT * FROM table_a a WHERE a.col1 IN (SELECT col1 FROM table_b b WHERE xxx);

If t he subquery on t he t able_b t able ret urns more t han 1,000 values from t he col1 column, t he syst em
report s t he following error: records returned from subquery exceeded limit of 1000 . In t his case,
you can replace t he preceding st at ement wit h t he following st at ement :

SELECT a. * FROM table_a a JOIN (SELECT DISTINCT col1 FROM table_b b WHERE xxx) c ON (a.col
1 = c.col1)

Not e
If t he DIST INCT keyword is not used, t he subquery result t able c may cont ain duplicat e
values in t he col1 column. In t his case, t he query on t he a t able ret urns more result s.
If t he DIST INCT keyword is used, only one worker is assigned t o perform t he subquery. If t he
subquery involves a large amount of dat a, t he whole query slows down.
If you are sure t hat t he values t hat meet t he subquery condit ions in t he col1 column are
unique, you can delet e t he DIST INCT keyword t o improve t he query performance.

O ptimize joins
When you join t wo t ables, we recommend t hat you use t he WHERE clause based on t he following rules:

178 > Document Version: 20220630


Best Pract ices· Comput e opt imizat i
MaxComput e
on

Specify t he part it ion limit s of t he primary t able in t he WHERE clause. We recommend t hat you define
a subquery for t he primary t able t o obt ain t he required dat a first .
Writ e t he WHERE clause of t he primary t able at t he end of t he st at ement .
Specify t he part it ion limit s of t he secondary t able in t he ON clause or a subquery.

Examples:
select * from A join (select * from B where dt=20150301)B on B.id=A.id where A.dt=20150301;
select * from A join B on B.id=A.id where B.dt=20150301; -- We recommend that you do not us
e this statement. The system performs the JOIN operation before it performs partition pruni
ng. This can result in a large amount of data and deteriorate the query performance.
select * from (select * from A where dt=20150301)A join (select * from B where dt=20150301)
B on B.id=A.id;

4.2. Optimize JOIN long tails


T his t opic describes t he dat a skew issue and provides relat ed solut ions. T his issue may occur when t he
JOIN st at ement in MaxComput e SQL is execut ed.

Background information
When t he JOIN st at ement in MaxComput e SQL is execut ed, t he dat a wit h t he same join key is sent t o and
processed on t he same inst ance. If a key cont ains a large amount of dat a, t he inst ance t akes a longer
t ime t o process t he dat a t han ot her inst ances. Long t ails exist if t he execut ion log shows t hat a few
inst ances in t his JOIN t ask remain in t he execut ing st at e and ot her inst ances are in t he complet ed st at e.

Long t ails caused by dat a skew are common and significant ly prolong t ask execut ion. During
promot ions such as Double 11, severe long t ails may occur. For example, page views of large sellers are
much more t han page views of small sellers. If page view log dat a is associat ed wit h t he seller
dimension t able, dat a is dist ribut ed by seller ID. T his causes some inst ances t o process far more dat a
t han ot hers. In t his case, t he t ask cannot be complet ed due t o a few long t ails.

You can resolve long t ails from four perspect ives:

If you want t o join one large t able and one small t able, you can execut e t he MAP JOIN st at ement t o
cache t he small t able. For more informat ion about t he MAP JOIN st at ement , see SELECT synt ax.
T o join t wo large t ables, deduplicat e dat a first .
T ry t o find out t he cause for t he Cart esian product of t wo large keys and opt imize t hese keys from
t he business perspect ive.
It t akes a long t ime t o direct ly execut e t he LEFT JOIN st at ement for a small t able and a large t able. In
t his case, we recommend t hat you execut e t he MAP JOIN st at ement for t he small and large t ables t o
generat e an int ermediat e t able t hat cont ains t he int ersect ion of t he t wo t ables. T his int ermediat e
t able is not great er t han t he large t able because t he MAP JOIN st at ement filt ers out unnecessary
dat a from t he large t able. T hen, execut e t he LEFT JOIN st at ement for t he small and int ermediat e
t ables. T he effect of t his operat ion is equivalent t o t hat of execut ing t he LEFT JOIN st at ement for
t he small and large t ables.

Check data skew


Perform t he following st eps t o check dat a skew:
1. Open t he log file generat ed on Logview when SQL st at ement s are execut ed and check t he
execut ion det ails of each Fuxi t ask. In t he following figure, Long-T ails(115) indicat es t hat 115

> Document Version: 20220630 179


Best Pract ices· Comput e opt imizat i
MaxComput e
on

long t ails exist .

2. Find your Fuxi inst ance and click t he icon in t he St dOut column t o view t he size of dat a read
by t he inst ance.

For example, Read from 0 num:52743413 size:1389941257 indicat es t hat 1,389,941,257 rows of
dat a are being read when t he JOIN st at ement is execut ed. If an inst ance list ed in Long-T ails reads
far more dat a t han ot her inst ances, a long t ail occurs due t o t he large dat a size.

Common causes and solutions


MAP JOIN st at ement : If dat a skew occurs when t he JOIN st at ement is execut ed on a large t able and a
small t able, you can execut e t he MAP JOIN st at ement t o prevent a long t ail.

When you use t he MAP JOIN st at ement , t he JOIN operat ion is performed at t he Map side. T his prevent s
dat a skew caused by uneven key dist ribut ion. T he MAP JOIN st at ement is subject t o t he following
limit s:

T he MAP JOIN st at ement is applicable only when t he secondary t able is small. A secondary t able
refers t o t he right t able in t he LEFT OUT ER JOIN st at ement or t he left t able in t he RIGHT OUT ER JOIN
st at ement .
T he size of t he small t able is also limit ed when t he MAP JOIN st at ement is used. By default , t he
maximum size is 512 MB aft er t he small t able is loaded int o t he memory. You can execut e t he
following st at ement t o increase t he maximum size t o 10,000 MB:

set odps.sql.mapjoin.memory.max=10000

T he MAP JOIN st at ement is easy t o use. You can append /* mapjoin(b) */ t o t he SELECT
st at ement , where b indicat es t he alias of t he small t able or t he subquery. Example:

select /* mapjoin(b) */
a.c2
,b.c3
from
(select c1
,c2
from t1 ) a
left outer join
(select c1
,c3
from t2 ) b
on a.c1 = b.c1;

JOIN long t ails caused by hot key values

180 > Document Version: 20220630


Best Pract ices· Comput e opt imizat i
MaxComput e
on

If hot key values cause a long t ail and t he MAP JOIN st at ement cannot be used because no small
t able is involved, ext ract hot key values. Hot key values in t he primary t able are separat ed from non-
hot key values, processed independent ly, and t hen joined wit h non-hot key values. In t he following
example, t he page view log t able of t he T aobao websit e is associat ed wit h t he commodit y
dimension t able.
i. Ext ract hot key values: Ext ract t he IDs of t he commodit ies whose page views are great er t han
50,000 t o a t emporary t able.

insert overwrite table topk_item PARTITION (ds = '${bizdate}')


select item_id
from
(select item_id
,count(1) as cnt
from dwd_tb_log_pv_di
where ds = '${bizdate}'
and url_type = 'ipv'
and item_id is not null
group by item_id
) a
where cnt >= 50000;

ii. Ext ract non-hot key values.

Execut e t he OUT ER JOIN st at ement t o associat e t he sdwd_tb_log_pv_di primary t able wit h t he t


opk_item hot key t able. T hen, apply condit ion b1.item_id is null t o ext ract t he log dat a of
non-hot commodit ies t hat cannot be associat ed. In t his case, execut e t he MAP JOIN st at ement
t o ext ract non-hot key values. T hen, associat e t he non-hot key t able wit h t he commodit y
dimension t able. No long t ails occur because hot key values have been removed.

select ...
from
(select *
from dim_tb_itm
where ds = '${bizdate}'
) a
right outer join
(select /* mapjoin(b1) */
b2.*
from
(select item_id
from topk_item
where ds = '${bizdate}'
) b1
right outer join
(select *
from dwd_tb_log_pv_di
where ds = '${bizdate}'
and url_type = 'ipv'
) b2
on b1.item_id = coalesce(b2.item_id,concat("tbcdm",rand())
where b1.item_id is null
) l
on a.item_id = coalesce(l.item_id,concat("tbcdm",rand());

> Document Version: 20220630 181


Best Pract ices· Comput e opt imizat i
MaxComput e
on

iii. Ext ract hot key values.


Execut e t he INNER JOIN st at ement t o associat e t he sdwd_tb_log_pv_di primary t able wit h t he to
pk_item hot key t able. In t his case, execut e t he MAP JOIN st at ement t o ext ract t he log dat a of
hot commodit ies. Execut e t he INNER JOIN st at ement t o associat e t he dim_tb_itm commodit y
dimension t able wit h t he topk_item hot key t able t o ext ract t he dat a of t he hot commodit y
dimension t able. Execut e t he OUT ER JOIN st at ement t o associat e t he log dat a wit h t he dat a of
t he dimension t able. T he dimension t able cont ains a small amount of dat a and MAP JOIN can be
used t o prevent long t ails.

select /* mapjoin(a) */
...
from
(select /* mapjoin(b1) */
b2.*
from
(select item_id
from topk_item
where ds = '${bizdate}'
)b1
join
(select *
from dwd_tb_log_pv_di
where ds = '${bizdate}'
and url_type = 'ipv'
and item_id is not null
) b2
on (b1.item_id = b2.item_id)
) l
left outer join
(select /* mapjoin(a1) */
a2.*
from
(select item_id
from topk_item
where ds = '${bizdate}'
) a1
join
(select *
from dim_tb_itm
where ds = '${bizdate}'
) a2
on (a1.item_id = a2.item_id)
) a
on a.item_id = l.item_id;

iv. Execut e t he UNION ALL st at ement t o merge t he dat a obt ained in Subst eps ii and iii t o
generat e complet e log dat a, wit h commodit y informat ion associat ed.

Set t he odps.sql.skewjoin paramet er t o resolve long t ails.


T his is a simple solut ion. However, you must modify code and execut e t he st at ement s again if
skewed key values change. In addit ion, value changes cannot be predict ed. If many skewed key values
exist , it is inconvenient for you t o configure t hem in paramet ers. In t his case, you can split code or
specify paramet ers as required. Perform t he following st eps t o set t he odps.sql.skewjoin paramet er:

182 > Document Version: 20220630


Best Pract ices· Comput e opt imizat i
MaxComput e
on

i. Set t he odps.sql.skewjoin paramet er t o t rue.

set odps.sql.skewjoin=true

ii. Set a skewed key and it s value.

set odps.sql.skewinfo=skewed_src:(skewed_key) [("skewed_value")]

skewed_key indicat es t he skewed column and skewed_value indicat es t he skewed value of t his
column.
Use SKEWJOIN HINT t o avoid skewed hot key values. For more informat ion about SKEWJOIN HINT , see
SKEWJOIN HINT .

Procedure

-- Method 1: Include the alias of the table in SKEWJOIN HINT.


select /*+ skewjoin(a) */ * from T0 a join T1 b on a.c0 = b.c0 and a.c1 = b.c1;
-- Method 2: Include the table name and possibly skewed columns in SKEWJOIN HINT. In the
following statement, the c0 and c1 columns of table a are skewed columns.
select /*+ skewjoin(a(c0, c1)) */ * from T0 a join T1 b on a.c0 = b.c0 and a.c1 = b.c1 an
d a.c2 = b.c2;
-- Method 3: Include the table name, columns, and skewed hot key values in SKEWJOIN HINT.
If skewed key values are of the STRING type, enclose each value with double quotation mar
ks. In the following statement, (a.c0=1 and a.c1="2") and (a.c0=3 and a.c1="4") contain s
kewed hot key values.
select /*+ skewjoin(a(c0, c1)((1, "2"), (3, "4"))) */ * from T0 a join T1 b on a.c0 = b.c
0 and a.c1 = b.c1 and a.c2 = b.c2;

Not e Met hod 3 is more efficient t han Met hod 1 and Met hod 2.

Ident if y t he JOIN st at ement t hat causes dat a skew

In t he following snapshot capt ured on Logview, J5_3_4 is t he Fuxi t ask t hat t ook t he longest t ime t o
execut e.

> Document Version: 20220630 183


Best Pract ices· Comput e opt imizat i
MaxComput e
on

Click t he J5_3_4 t ask and query t he inst ances of t his t ask on t he t ab t hat appears. T he query result s
show t hat t he J5_3_4#215_0 inst ance t ook t he longest t ime t o execut e and it s I/O records and I/O
byt es are much more t han t hose of ot her inst ances.

In t his case, you can find t hat dat a skew occurs on t he J5_3_4#215_0 inst ance. T he JOIN st at ement
t hat causes dat a skew needs t o be furt her det ermined. Find t he skewed inst ance, and click t he icon in
t he St dOut column. Find a non-skewed inst ance, and click t he icon in t he St dOut column. T he
cont ent in t he St dOut column cannot be complet ely displayed. You can click Download and view
t he complet e informat ion.

184 > Document Version: 20220630


Best Pract ices· Comput e opt imizat i
MaxComput e
on

In t he following figures, you can find t hat t he value of record count in St reamLineRead7 of t he
skewed inst ance is much great er t han t he value of record count of t he non-skewed inst ance.
T herefore, dat a skew occurs when dat a in St reamLineWrit e7 and SreamLineRead7 is shuffled.

On t he DAG page, right -click t he skewed inst ance and select expand all t o find St reamLineWrit e7
and St reamLineRead7.

> Document Version: 20220630 185


Best Pract ices· Comput e opt imizat i
MaxComput e
on

186 > Document Version: 20220630


Best Pract ices· Comput e opt imizat i
MaxComput e
on

You can find t hat dat a skew occurs on St reamLineRead7 in MergeJoin2. MergeJoin2 is generat ed aft er
t he dim_hm_it em and dim_t b_it m_brand t ables are joined and t hen t he joined t able and t he
dim_t b_brand t able are joined.

> Document Version: 20220630 187


Best Pract ices· Comput e opt imizat i
MaxComput e
on

Use t hese t able names t o find t he skewed t able. T he result shows t hat dat a skew occurs when t he
LEFT OUT ER JOIN st at ement is execut ed and t he t 1 t able is skewed. You can add /*+ skewjoin(t1)
*/ t o t he SQL st at ement t o resolve t he dat a skew issue.

4.3. Long-tail computing optimization


Long t ails may occur in JOIN operat ions and ot her comput ing jobs. T his t opic describes t he scenarios in
which long t ails occur and t he solut ions.

Long t ails are one of t he common issues in dist ribut ed comput ing. T he main cause of a long t ail is
uneven dat a dist ribut ion. As a result , t he workloads of individual nodes differ. T he ent ire job can be
complet ed only aft er t he slowest node processes all it s dat a.
T o prevent one worker from running a large number of jobs, t he jobs must be dist ribut ed t o mult iple
workers.

GRO UP BY long tail


Causes

T he comput ing workloads for t he key of a GROUP BY clause are heavy.


Solut ion

You can use one of t he following met hods t o handle t his issue:

Rewrit e t he SQL st at ement and add random numbers t o split t he key. Example:

SELECT Key,COUNT(*) AS Cnt FROM TableName GROUP BY Key;

188 > Document Version: 20220630


Best Pract ices· Comput e opt imizat i
MaxComput e
on

Regardless of combiners, a mapper shuffles dat a t o a reducer, and t he reducer performs t he count
operat ion. T he execut ion plan is in t he following sequence: Mapper > Reducer. However, if t he jobs
of t he long-t ailed key are dist ribut ed again, use t he following st at ement :

-- Assume that the long-tailed key is KEY001.


SELECT a.Key
, SUM(a.Cnt) AS Cnt
FROM (
SELECT Key
, COUNT(*) AS Cnt
FROM TableName
GROUP BY Key,
CASE
WHEN Key = 'KEY001' THEN Hash(Random()) % 50
ELSE 0
END
) a
GROUP BY a.Key;

T he execut ion plan for t his st at ement is in t he following sequence: Mapper > Reducer > Reducer.
Alt hough more st eps are required for t he execut ion, t he jobs of t he long-t ailed key are processed in
t wo st eps, and t he t ime required may be short er.

Not e If you use t his met hod t o add a reducer execut ion st ep t o handle a long t ail t hat has
slight impact s on your jobs, t he t ime required may be longer.

Specify syst em paramet ers.

set odps.sql.groupby.skewindata=true

T his configurat ion is used for general opt imizat ion inst ead of business-specific opt imizat ion.
T herefore, t he opt imizat ion effect may not be opt imal. You can rewrit e SQL st at ement s in a more
efficient way based on your dat a.

DISTINCT long tail


If a long t ail occurs for t he DIST INCT keyword, t he key split t ing met hod does not apply. In t his case, you
must seek for ot her met hods.

Solut ion

-- The original SQL statement, regardless of the case where uid is not specified.
SELECT COUNT(uid) AS Pv
, COUNT(DISTINCT uid) AS Uv
FROM UserLog;

T he preceding st at ement can be rewrit t en int o t he following st at ement :

> Document Version: 20220630 189


Best Pract ices· Comput e opt imizat i
MaxComput e
on

SELECT SUM(PV) AS Pv
, COUNT(*) AS UV
FROM (
SELECT COUNT(*) AS Pv
, uid
FROM UserLog
GROUP BY uid
) a;

T his met hod is t o change DIST INCT t o COUNT . T his way, t he comput ing workloads are dist ribut ed t o
different reducers. Aft er you rewrit e t he st at ement , you can use t he opt imizat ion met hod for GROUP
BY, and t he combiner is involved in t he comput at ion. T his great ly improves t he performance.

Long tail of dynamic partitions


Causes
T o sort t he dat a of small files, t he dynamic part it ion feat ure st art s a reducer at t he final st age of
execut ion. If dat a writ t en by using t he dynamic part it ion feat ure is skewed, a long t ail occurs.
In general, t he incorrect use of t he dynamic part it ion feat ure causes long t ails.

Solut ion

If you are sure about t he part it ion t o which dat a is writ t en, you can specify t he part it ion before you
insert t he dat a inst ead of using dynamic part it ions.

Use combiners to handle long tails


Combiners are frequent ly used t o handle long t ails in MapReduce jobs. Combiners can be used t o reduce
t he amount of dat a t hat needs t o be shuffled from mappers t o reducers. T his great ly reduces t he
overheads of net work t ransmission. T his opt imizat ion is aut omat ically implement ed in MaxComput e SQL.

Not e Combiners only opt imize execut ion in t he map st ages. Make sure t hat t he result s of t he
execut ion during which combiners are used are t he same as t hose of t he execut ion during which
combiners are not used. WordCount is used in t his example. T he result of passing (KEY,1) t wice
is t he same as t hat of passing (KEY,2) once. For more informat ion, see WordCount . However,
when you calculat e t he average value, you cannot use a combiner t o direct ly combine (KEY,1)
and (KEY,2) t o obt ain (KEY,1.5) .

O ptimize the system to handle long tails


In addit ion t o combiners, MaxComput e is also opt imized. For example, t he following logs (+N backups)
are generat ed during t he running of a job.

M1_Stg1_job0:0/521/521[100%] M2_Stg1_job0:0/1/1[100%] J9_1_2_Stg5_job0:0/523/523[100%] J3_1


_2_Stg1_job0:0/523/523[100%] R6_3_9_Stg2_job0:1/1046/1047[100%]
M1_Stg1_job0:0/521/521[100%] M2_Stg1_job0:0/1/1[100%] J9_1_2_Stg5_job0:0/523/523[100%] J3_1
_2_Stg1_job0:0/523/523[100%] R6_3_9_Stg2_job0:1/1046/1047[100%]
M1_Stg1_job0:0/521/521[100%] M2_Stg1_job0:0/1/1[100%] J9_1_2_Stg5_job0:0/523/523[100%] J3_1
_2_Stg1_job0:0/523/523[100%] R6_3_9_Stg2_job0:1/1046/1047(+1 backups)[100%]
M1_Stg1_job0:0/521/521[100%] M2_Stg1_job0:0/1/1[100%] J9_1_2_Stg5_job0:0/523/523[100%] J3_1
_2_Stg1_job0:0/523/523[100%] R6_3_9_Stg2_job0:1/1046/1047(+1 backups)[100%]

190 > Document Version: 20220630


Best Pract ices· Comput e opt imizat i
MaxComput e
on

A t ot al of 1,047 reducers are used. Among t hese reducers, 1,046 reducers have complet ed t heir
calculat ions, but t he last one has not . Aft er MaxComput e det ect s t his issue, it aut omat ically st art s a
new reducer, calculat es t he same dat a, and t hen aggregat es t he result s of t he reducer t hat complet ed
t he calculat ion earlier t o t he final result set .

O ptimize business logic to handle long tails


T he aforement ioned opt imizat ion met hods cannot handle all t he long t ails. In some cases, you must
opt imize your business logic t o handle long t ails.

A large amount of noisy dat a may exist in calculat ions. For example, you need t o calculat e t he dat a
based on visit or IDs t o check t he access records of each user. In t his case, you must filt er out crawler
dat a. Ot herwise, a long t ail may occur due t o t he crawler dat a during calculat ion. It is increasingly
difficult t o ident ify crawler dat a. Similarly, if you want t o use t he xxid field for associat ions, you must
check whet her t he associat ed field is empt y.
Long t ails may occur in some special business scenarios. For example, t he operat ion records of
independent soft ware vendors (ISVs) are great ly different from t hose of individuals in t erms of t he
amount of dat a and behavior. In t his case, you must use specific analysis met hods t o handle t he
issues of import ant cust omers.
If dat a is unevenly dist ribut ed, we recommend t hat you do not use const ant s as t he key of
DIST RIBUT E BY t o sort all t he dat a records.

4.4. Optimize the calculation for long-


period metrics
T his t opic describes how t o opt imize t he calculat ion for long-period met rics.

Background information
When e-commerce companies build dat a warehouses or analyze t heir business, t hey oft en need t o
calculat e met rics such as t he numbers of visit ors, buyers, and regular buyers in a period of t ime. T hese
met rics are calculat ed based on t he dat a t hat is accumulat ed over t he period of t ime.
In general, t hese met rics are calculat ed based on t he dat a in log t ables. For example, you can execut e
t he following st at ement t o calculat e t he number of visit ors for each it em in t he last 30 days:

select item_id -- The field that indicates the item ID.


,count(distinct visitor_id) as ipv_uv_1d_001
from vistor_item_detail_log
where ds <= ${bdp.system.bizdate}
and ds >=to_char(dateadd(to_date(${bdp.system.bizdate},'yyyymmdd'),-29,'dd'),'yyyymmdd')
group by item_id;

Not e All t he variables in t he code samples in t his t opic are scheduling variables in Dat aWorks.
T herefore, t he code samples in t his t opic are applicable only t o scheduling nodes in Dat aWorks.

If a large amount of log dat a is generat ed every day, t he preceding SELECT st at ement requires a large
number of map t asks. If more t han 99,999 map t asks are required, t he map t asks fail.

O bjective

> Document Version: 20220630 191


Best Pract ices· Comput e opt imizat i
MaxComput e
on

Calculat e long-period met rics wit h minimal impact on t he query performance.

T he amount of t he dat a accumulat ed over a long period of t ime is huge. If t he syst em calculat es
met rics based on t he dat a, t he query performance is det eriorat ed. We recommend t hat you creat e an
int ermediat e t able t hat is used t o summarize t he dat a generat ed every day. T his can remove duplicat e
dat a records and reduce t he amount of dat a t o be queried.

Solution
1. Creat e an int ermediat e t able t o summarize t he dat a generat ed every day.

In t his example, you can creat e an int ermediat e t able based on t he dat a in t he it em_id and
visit or_id fields. T he following code provides an example:

insert overwrite table mds_itm_vsr_xx(ds='${bdp.system.bizdate} ')


select item_id,visitor_id,count(1) as pv
from
(
select item_id,visitor_id
from vistor_item_detail_log
where ds =${bdp.system.bizdate}
group by item_id,visitor_id
) a;

2. Summarize t he dat a t hat is accumulat ed over a long period of t ime from t he int ermediat e t able.

T he following code calculat es t he number of visit ors for each it em in t he last 30 days:

select item_id
,count(distinct visitor_id) as uv
,sum(pv) as pv
from mds_itm_vsr_xx
where ds <= '${bdp.system.bizdate} '
and ds >= to_char(dateadd(to_date('${bdp.system.bizdate} ','yyyymmdd'),-29,'dd'),'yyy
ymmdd')
group by item_id;

Impact and consideration


In t he preceding solut ion, t he log dat a is deduplicat ed on a daily basis. T his can reduce t he amount of
dat a for calculat ion and improve t he query performance. However, t he syst em needs t o read dat a from
mult iple part it ions for every calculat ion on t he dat a t hat is accumulat ed over a long period of t ime.

T o resolve t his issue, you can merge dat a from mult iple part it ions int o one part it ion, which cont ains all
hist orical dat a. T his way, you can accumulat e dat a in an increment al manner and calculat e long-period
met rics based on dat a in one part it ion.

Scenarios
Calculat e t he number of t he regular buyers in t he last day. A regular buyer is a buyer who made a
purchase in a specific period of t ime, for example, in t he last 30 days.
T he following code calculat es t he number of t he regular buyers in a period of t ime:

192 > Document Version: 20220630


Best Pract ices· Comput e opt imizat i
MaxComput e
on

select item_id -- The field that indicates the item ID.


,buyer_id as old_buyer_id
from buyer_item_detail_log
where ds < ${bdp.system.bizdate}
and ds >=to_char(dateadd(to_date(${bdp.system.bizdate},'yyyymmdd'),-29,'dd'),'yyyymmdd')
group by item_id
,buyer_id;

Improvement :

Creat e and maint ain a dimension t able. T his t able is used t o record t he relat ionship bet ween buyers
and purchased it ems, such as t he first purchase t ime, t he last purchase t ime, t he t ot al number of
purchased it ems, and t he t ot al amount of t he purchases.
Updat e t he dat a in t he dimension t able every day wit h t he dat a in t he billing logs of t he last day.
T o det ermine whet her a buyer is a regular buyer, check whet her t he last purchase t ime of t he buyer is
wit hin t he last 30 days. T his deduplicat es dat a mappings and reduces t he amount of dat a for
calculat ion.

> Document Version: 20220630 193


Best Pract ices· Job diagnost ics MaxComput e

5.Job diagnostics
5.1. Use Logview to diagnose jobs
that run slowly
In most cases, ent erprises need job result s t o be generat ed earlier t han expect ed. T his way, t hey can
make business development decisions based on t he result s at t he earliest opport unit y. In t his case, job
developers must pay at t ent ion t o t he job st at us t o ident ify and opt imize t he jobs t hat run slowly. You
can use Logview of MaxComput e t o diagnose jobs t hat run slowly. T his t opic provides t he causes for
which jobs run slowly and t he relat ed solut ions. T his t opic also describes how t o view informat ion
about t he jobs t hat run slowly.

Background information
Logview of MaxComput e records all logs of jobs and provides guidance for you t o view and debug jobs.
You can obt ain t he Logview URL below Log view in t he job result . MaxComput e provides t wo versions
of Logview. We recommend t hat you use Logview V2.0 because it provides fast er page loading and a
bet t er design st yle. For more informat ion about Logview V2.0, see Logview V2.0.

Common causes for which jobs run slowly:

Insufficient CUs

If t he MaxComput e project uses t he subscript ion billing met hod and a large number of jobs are
submit t ed or a large number of small files are generat ed wit hin a specific period of t ime, all t he
purchased comput e unit s (CUs) are occupied and t he jobs become queued.

Dat a skew

If a large amount of dat a is processed or some jobs are dedicat ed for some special dat a, long t ails
may occur even if most jobs are complet ed.

Inefficient code logic

If t he SQL or user-defined funct ion (UDF) logic is inefficient or paramet er set t ings are not opt imal, a
Fuxi t ask may run for a long period of t ime. However, t he t ime for which each Fuxi inst ance runs is
almost t he same. For more informat ion about t he relat ionships bet ween jobs, Fuxi t asks, and Fuxi
inst ances, see Job details section.

Insufficient CUs
Problem descript ion
If t he CUs are insufficient , t he following issues may occur aft er you submit a job:

Issue 1: Job Queueing... is displayed.

T he job may be queued because ot her jobs occupy t he resources of t he resource group. You can
perform t he following st eps t o view t he durat ion for which t he job is queued:
i. Obt ain t he Logview URL in t he job result and open t he URL in t he browser.

194 > Document Version: 20220630


MaxComput e Best Pract ices· Job diagnost ics

ii. On t he SubSt at usHist ory t ab of Logview, find Wait ing for scheduling in t he Descript ion
column and view t he value in t he Lat ency column. T he value indicat es t he durat ion for which t he
job is queued.

Issue 2: T he job runs slowly.

Aft er a job is submit t ed, a large number of CUs are required. However, t he resource group cannot
st art all Fuxi inst ances at t he same t ime. As a result , t he job runs slowly. You can perform t he
following st eps t o view t he job st at us:

i. Obt ain t he Logview URL in t he job result and open t he URL in a browser.
ii. In t he Fuxi Inst ance sect ion of t he Job Det ails t ab, click Lat ency chart t o view t he job st at us
diagram.

T he following figure shows t he st at us of a job t hat has sufficient resources. T he lower blue part
in t he diagram remains at approximat ely t he same height , which indicat es t hat all Fuxi inst ances
of t he job st art at approximat ely t he same t ime.

T he following figure shows t he st at us of a job t hat does not have sufficient resources. T he
diagram shows an upward t rend, which indicat es t hat t he Fuxi inst ances of t he job are gradually
scheduled.

Causes

T o locat e t he causes of t he preceding issues, perform t he following st eps:

1. Go t o MaxComput e Management .

> Document Version: 20220630 195


Best Pract ices· Job diagnost ics MaxComput e

2. In t he left -side navigat ion pane, click Quot as.

3. In t he Subscript ion Quot a Groups sect ion, click t he quot a group t hat corresponds t o t he
MaxComput e project .
4. In t he Usage T rend of Reserved CUs chart on t he Resource Consumpt ion t ab, click t he point
at which t he CU usage is t he highest and record t he point in t ime.
5. In t he left -side navigat ion pane, click Jobs. On t he right part of t he page, click t he Job
Management t ab.
6. On t he Job Management t ab, configure T ime Range based on t he point in t ime t hat you recorded,
select Running from t he Job St at us drop-down list , and t hen click OK.
7. In t he job list , click t he icon next t o CPU Ut iliz at ion (%) t o sort jobs by CPU ut ilizat ion in
descending order.
If t he CPU ut ilizat ion of a job is excessively high, click Logview in t he Act ions column and view
I/O Byt es in t he Fuxi Inst ance sect ion. If I/O Byt es is only 1 MB or t ens of KB and mult iple Fuxi
inst ances are running in t he job, a large number of small files are generat ed when t he job is run. In
t his case, you need t o merge t he small files or adjust t he parallelism.
If t he values of CPU Ut ilizat ion (%) are almost t he same, mult iple large jobs are submit t ed at t he
same t ime and t he jobs consume all CUs. In t his case, you must purchase addit ional CUs or use
pay-as-you-go resources t o run jobs.

Solut ions
Merge small files.
Adjust t he parallelism.

T he parallelism of MaxComput e jobs is aut omat ically est imat ed based on t he amount of input dat a
and t he job complexit y. In most cases, you do not need t o manually adjust t he parallelism. If you
adjust t he parallelism t o a higher value, t he job processing speed increases. However, subscript ion
resource groups may be fully occupied. In t his case, jobs are queued t o wait for resources and
t herefore run slowly. You can configure t he odps.st age.mapper.split .size, odps.st age.reducer.num,
odps.st age.joiner.num, or odps.st age.num paramet er t o adjust t he parallelism. For more informat ion,
see SET operations.
Purchase CUs.

For more informat ion about how t o purchase CUs, see Upgrade resource configurations.

Use pay-as-you-go resources.

Purchase pay-as-you-go resources and use MaxComput e Management t o allow subscript ion project s
t o use t he pay-as-you-go resources.

Data skew
Problem descript ion

Some Fuxi inst ances in a Fuxi t ask cont inue t o run even if most Fuxi inst ances of t he Fuxi t ask st opped.
As a result , long t ails occur.

196 > Document Version: 20220630


MaxComput e Best Pract ices· Job diagnost ics

In t he Fuxi Inst ance sect ion of t he Job Det ails t ab of Logview, you can click Long-T ails t o view t he
Fuxi inst ances t hat have a long t ail.

Cause

T he Fuxi inst ances t hat cont inue t o run process large amount s of dat a or are dedicat ed for special
dat a.

Solut ion

For more informat ion about how t o resolve dat a skew, see Reduce impacts of data skew .

Inefficient code logic


Problem descript ion

If t he code logic is inefficient , t he following issues may occur aft er you submit a job:
Issue 1: Dat a expansion occurs. T he amount of out put dat a of a Fuxi t ask is significant ly great er t han
t he amount of input dat a.

You can view I/O Record and I/O Byt es in t he Fuxi T ask sect ion t o check t he amount s of input and
out put dat a of a Fuxi t ask. In t he following figure, 1 GB of dat a is changed t o 1 T B aft er t he dat a is
processed. One Fuxi inst ance processes 1 T B of dat a, which reduces dat a processing efficiency.

Issue 2: T he UDF execut ion efficiency is low.

A Fuxi t ask runs slowly, and t he Fuxi t ask has UDFs. When a t imeout error occurs on a UDF, t he error F
uxi job failed - WorkerRestart errCode:252,errMsg:kInstanceMonitorTimeout, usually caused by
bad udf performance is ret urned. You can perform t he following st eps t o view t he locat ion and
execut ion speed of t he UDF:

i. Obt ain t he Logview URL in t he job result and open t he URL in a browser.
ii. In t he progress chart , double-click t he Fuxi T ask t hat runs slowly or fails t o run. In t he operat or
graph, view t he locat ion of t he UDF, as shown in t he following figure.

> Document Version: 20220630 197


Best Pract ices· Job diagnost ics MaxComput e

iii. In t he Fuxi Inst ance sect ion, click St dOut t o view t he execut ion speed of t he UDF.

In normal cases, t he value of Speed(records/s) indicat es t hat millions or hundreds of


t housands of records are processed per second.

Causes
Issue 1: T he business processing logic causes dat a expansion. In t his case, check whet her t he business
logic meet s your business requirement s.
Issue 2: T he UDF code logic does not meet your business requirement s. In t his case, adjust t he code
logic.

Solut ions
Issue 1: Check whet her t he business logic has a defect . If t he logic has a defect , modify t he code. If
t he logic does not have a defect , configure t he odps.st age.mapper.split .size,
odps.st age.reducer.num, odps.st age.joiner.num, or odps.st age.num paramet er t o adjust t he
parallelism. For more informat ion, see SET operat ions.
Issue 2: Check and modify t he UDF code logic. We recommend t hat you preferent ially use built -in
funct ions. If built -in funct ions cannot meet your business requirement s, use UDFs. For more
informat ion about built -in funct ions, see Built -in funct ions.

198 > Document Version: 20220630


MaxComput e Best Pract ices· Cost opt imizat ion

6.Cost optimization
6.1. Overview
T his t opic describes t he process of cost opt imizat ion.

Ent erprises must cont inually opt imize t heir cost s on MaxComput e in response t o t he changes in big
dat a. You can reference t he following process for cost opt imizat ion:

1. Before you use MaxComput e, make sure t hat you fully underst and t he billing met hods, accurat ely
est imat e t he resources t hat you require, and t hen select an appropriat e billing met hod. For more
informat ion, see Select a billing met hod.
2. T o reduce cost s when you use MaxComput e, opt imize t he resources t hat are used for dat a
comput ing, st orage, uploads, and downloads. For more informat ion, see Opt imize comput ing
cost s, Opt imize st orage cost s, and Opt imize t he cost s of dat a uploads and downloads.
3. View your bills in a t imely manner. Analyze any except ions in t he bills and perform opt imizat ion. For
more informat ion, see Manage cost s.

6.2. Select a billing method


T his t opic describes how t o select a cost -effect ive billing met hod.

Billing methods
MaxComput e support s t he following billing met hods:

Subscript ion: Comput ing resources are charged on a mont hly or annual basis. St orage and download
resources are charged on a pay-as-you-go basis.
Pay-as-you-go: St orage, comput ing, and download resources are all charged on a pay-as-you-go
basis.

For more informat ion, see Billing method. You can select a billing met hod wit h t he help of T ot al Cost of
Ownership (T CO) t ools and t he best pract ices of cost est imat ion.

TCO tools
You can use t he following T CO t ools t o est imat e cost s:

MaxComput e price calculat or: T his t ool is suit able for t he subscript ion billing met hod. T o calculat e
t he mont hly cost , ent er t he required comput ing resources and t he volumes of t he dat a you want t o
upload and download.
Cost SQL: T his t ool is suit able for t he pay-as-you-go billing met hod.
You can run t he cost sql command t o est imat e t he cost of an SQL job before you execut e t he SQL
job in a product ion environment . For more informat ion, see Cost est imat ion.
If you use Int elliJ IDEA, you can submit SQL script s for aut omat ic cost est imat ion. For more
informat ion, see Develop and submit an SQL script .
If you use Dat aWorks, you can also est imat e cost s.

> Document Version: 20220630 199


Best Pract ices· Cost opt imizat ion MaxComput e

Not e
T he cost s of some SQL jobs cannot be est imat ed, such as SQL jobs t hat involve ext ernal
t ables.
T he act ual cost s are subject t o final bills.

Best practices of cost estimation


T his sect ion provides some cost est imat ion examples and t ips for your reference. You can select a cost -
effect ive billing met hod based on t he informat ion.

Billing met hods f or 1 T B of dat a

T he following t able describes est imat ed cost s for reference.

Estimated cost per


Billing method Business scenario Response speed
month

Compute-intensive Within a few minutes 3768 USD


Subscription
Storage-intensive Within a few hours 1177.5 USD

1413 USD (T he cost is estimated with the SQL complexity of 1 and the
Pay-as-you-go
execution frequency of once per day.)

If you select t he subscript ion billing met hod, t he cost s vary depending on your business t ype:

Comput e-int ensive scenario: In t his scenario, a large number of CPU resources are required. 160
comput e unit s are used t o process 1 T B of dat a. T he syst em responds t o a request wit hin a few
minut es. T he est imat ed cost is 3768 USD per mont h.
St orage-int ensive scenario: If your jobs are not sensit ive t o t he response speed, we recommend
t hat you purchase a st orage plan. About 50 comput e unit s are used t o process 1 T B of dat a. T he
syst em responds t o a request wit hin a few hours. T he est imat ed cost is 1177.5 USD per mont h.

If you select t he pay-as-you-go billing met hod, t he cost for t he comput ing resources t hat are used
t o process 1 T B of dat a once is about 47.1 USD per day and 1413 USD per mont h. T he prerequisit es
are t hat t he SQL complexit y is 1 and t he dat a is processed once per day. If t he dat a is processed
mult iple t imes per day, t he cost is mult iplied.

When you migrat e dat a t o t he cloud for t he first t ime, we recommend t hat you select t he pay-as-
you-go billing met hod first . Perform a Proof of Concept (POC) t est t o calculat e t he approximat e
number of workers used for your jobs. T hen, calculat e t he number of comput e unit s t hat you need t o
purchase based on t he number of workers.
Billing met hods f or Hadoop users t o migrat e dat a t o t he cloud

Assume t hat a Hadoop clust er has one cont roller node and five comput e nodes. Each node has 32
cores, equivalent t o 32 CPUs. T he t ot al number of CPUs for t he comput e nodes is 160. T he est imat ed
cost of t he clust er is 3768 USD per mont h wit h no discount s or promot ional offers applied.

MaxComput e does not require any cont roller nodes. T he performance of MaxComput e is 80% higher
t han Hive. It frees you from operat ions and maint enance (O&M), which also reduces cost s.

Mixed billing met hods

200 > Document Version: 20220630


MaxComput e Best Pract ices· Cost opt imizat ion

Subscript ion billing met hod for product ion businesses, such as hourly ext ract , t ransform, load (ET L),
and pay-as-you-go billing met hod for aperiodic jobs or ad hoc queries

We recommend t hat you select t he subscript ion billing met hod for periodic comput ing jobs t hat
are frequent ly execut ed and t he pay-as-you-go billing met hod for aperiodic jobs t hat are used t o
process large amount s of dat a. In pay-as-you-go mode, you can choose not t o st ore dat a.
Inst ead, you can read dat a from t ables under ot her account s. T his reduces dat a st orage cost s.
Aut horizat ion is required for comput ing operat ions on t ables under different account s. For more
informat ion, see Create a project-level role.

Subscript ion billing met hod for aperiodic jobs or ad hoc queries and pay-as-you-go billing met hod
for product ion businesses, such as daily ET L
Daily dat a t est ing may cause t he issue of uncont rollable cost s. T o avoid t his issue, you can add
dat a t est ing and aperiodic jobs t o fixed resource groups. T hen, use MaxComput e Management t o
configure cust om development groups and business int elligence (BI) groups. If product ion jobs are
execut ed only once per day, you can add t hem t o a pay-as-you-go resource group.

Switching between billing methods


If you select t he subscript ion billing met hod, you can upgrade or degrade t he configurat ions in t he
following scenarios: T he dat a volume changes, and t he purchased resources are insufficient or become
idle. For more informat ion, see Upgrade or downgrade configurations.

You can also swit ch bet ween t he subscript ion and pay-as-you-go billing met hods. For more
informat ion, see Switch billing methods.

Not e Before you swit ch t he billing met hod from pay-as-you-go t o subscript ion, evaluat e t he
comput ing performance and cycles of jobs t o det ermine t he number of comput e unit s you need t o
purchase. If t he comput e unit s you purchase are insufficient , t he comput ing cycle of a job may be
prolonged, and t he comput ing performance may not meet your expect at ions. If t his occurs, you
may need t o swit ch t he billing met hod again.

6.3. Optimize computing costs


T his t opic describes how t o opt imize SQL and MapReduce jobs t o reduce comput ing cost s.

You can est imat e comput ing cost s before you execut e comput ing jobs. For more informat ion, see T CO
tools. You can also configure alert s for resource consumpt ion t o avoid ext ra cost s. If comput ing cost s
are high, you can use t he met hods described in t his t opic t o reduce t he cost s.

Control the computing costs of SQ L jobs


Some SQL jobs t hat t rigger full t able scans incur high comput ing cost s. T he frequent scheduling of SQL
jobs may cause an accumulat ion of jobs, which also increases comput ing cost s. If an accumulat ion
occurs and t he pay-as-you-go billing met hod is used, jobs are queued and require more resources. As a
result , t he bill generat ed t he next day is abnormally high. You can use t he following met hods t o cont rol
t he comput ing cost s of SQL jobs:
Avoid frequent scheduling. MaxComput e provides a comput ing service t o process large amount s of
dat a at a t ime. It is different from real-t ime comput ing services. If SQL jobs are execut ed at short
int ervals, comput ing frequency is increased. T he increased comput ing frequency and improper
execut ion of SQL jobs cause increases in comput ing cost s. If you require frequent scheduling, use
Cost SQL t o est imat e t he cost s of SQL jobs t o avoid ext ra cost s.

> Document Version: 20220630 201


Best Pract ices· Cost opt imizat ion MaxComput e

Reduce full t able scans. You can use t he following met hods:
Specify t he required paramet ers t o disable t he full t able scan feat ure. You can disable t he feat ure
for a session or project .

-- Disable the feature for a session.


set odps.sql.allow.fullscan=false;
-- Disable the feature for a project.
SetProject odps.sql.allow.fullscan=false;

Prune columns. Column pruning allows t he syst em t o read dat a only from t he required columns. We
recommend t hat you do not use t he SELECT * st at ement , which t riggers a full t able scan.

SELECT a,b FROM T WHERE e < 10;

In t his st at ement , t he T t able cont ains t he a, b, c, d, and e columns. However, only t he a, b,


and e columns are read.

Prune part it ions. Part it ion pruning allows you t o specify filt er condit ions for part it ion key columns.
T his way, t he syst em reads dat a only from t he required part it ions. T his avoids t he errors and wast e
of resources caused by full t able scans.

SELECT a,b FROM T WHERE partitiondate='2017-10-01';

Opt imize SQL keywords t hat incur cost s. T he keywords include JOIN, GROUP BY, ORDER BY, DIST INCT ,
and INSERT INT O. You can opt imize t he keywords based on t he following rules:
Before a JOIN operat ion, you must prune part it ions. Ot herwise, a full t able scan may be
performed. For more informat ion about scenarios in which part it ion pruning is invalid, see
Scenarios where part it ion pruning does not t ake effect .
Use UNION ALL inst ead of FULL OUT ER JOIN.

SELECT COALESCE(t1.id, t2.id) AS id, SUM(t1.col1) AS col1


, SUM(t2.col2) AS col2
FROM (
SELECT id, col1
FROM table1
) t1
FULL OUTER JOIN (
SELECT id, col2
FROM table2
) t2
ON t1.id = t2.id
GROUP BY COALESCE(t1.id, t2.id);
-- Optimized:
SELECT t.id, SUM(t.col1) AS col1, SUM(t.col2) AS col2
FROM (
SELECT id, col1, 0 AS col2
FROM table1
UNION ALL
SELECT id, 0 AS col1, col2
FROM table2
) t
GROUP BY t.id;

202 > Document Version: 20220630


MaxComput e Best Pract ices· Cost opt imizat ion

T ry not t o include GROUP BY in UNION ALL. Use GROUP BY out side UNION ALL.

SELECT t.id, SUM(t.val) AS val


FROM (
SELECT id, SUM(col3) AS val
FROM table3
GROUP BY id
UNION ALL
SELECT id, SUM(col4) AS val
FROM table4
GROUP BY id
) t
GROUP BY t.id;
Optimized:---------------------------
SELECT t.id, SUM(t.val) AS val
FROM (
SELECT id, col3 AS val
FROM table3
UNION ALL
SELECT id, col4 AS val
FROM table4
) t
GROUP BY t.id;

T o sort t emporarily export ed dat a, sort t he dat a by using t ools such as EXCEL inst ead of ORDER
BY.
T ry not t o use DIST INCT . Use GROUP BY inst ead.

SELECT COUNT(DISTINCT id) AS cnt


FROM table1;
Optimized:---------------------------
SELECT COUNT(1) AS cnt
FROM (
SELECT id
FROM table1
GROUP BY id
) t;

T ry not t o use INSERT INT O t o writ e dat a. Add a part it ion field inst ead. T his reduces SQL
complexit y and saves comput ing cost s.

T ry not t o execut e SQL st at ement s t o view t able dat a. You can use t he t able preview feat ure t o
view t able dat a. T his met hod does not consume comput ing resources. If you use Dat aWorks, you can
preview a t able and query det ails about t he t able on t he Dat a Map page. For more informat ion, see
View t he det ails of a t able. If you use MaxComput e St udio, double-click a t able t o preview it s dat a.
Select an appropriat e t ool for dat a comput ing. MaxComput e responds t o a query wit hin minut es. It is
not suit able for front end queries. Comput ing result s are synchronized t o an ext ernal st orage syst em.
Most users use relat ional dat abases t o st ore result s. We recommend t hat you use MaxComput e for
light weight comput ing jobs and relat ional dat abases, such as ApsaraDB for RDS, for front end queries.
Front end queries require t he real-t ime generat ion of query result s. If t he query result s are displayed in
t he front end, no condit ional clauses are execut ed on t he dat a. T he dat a is not aggregat ed or
associat ed wit h dict ionaries. T he queries do not even include t he WHERE clause.

> Document Version: 20220630 203


Best Pract ices· Cost opt imizat ion MaxComput e

Control the computing costs of MapReduce jobs


You can use t he following met hods t o cont rol t he comput ing cost s of MapReduce jobs:

Configure t he required set t ings


Split size

T he default split size for a mapper is 256 MB. T he split size det ermines t he number of mappers. If
your code logic for a mapper is t ime-consuming, you can use JobConf#setSplitSize t o reduce
t he split size. You must configure an appropriat e split size. Ot herwise, excessive comput ing
resources are required.

MapReduce Reduce Inst ance

By default , t he number of reducers t hat are used t o complet e a job is one fourt h of t he number of
mappers. You can set t he number of t he reducers t o a value t hat ranges from 0 t o 2,000. More
reducers require more comput ing resources, which increases cost s. You must appropriat ely
configure t he number of reducers.

Reduce t he number of MapReduce jobs


If mult iple MapReduce jobs are correlat ed and t he out put of a job is t he input of t he next job, we
recommend t hat you use t he pipeline mode. T he pipeline mode allows you t o merge mult iple serial
MapReduce jobs int o a single job. T his reduces redundant disk I/O operat ions caused by int ermediat e
t ables and improves performance. T his also simplifies job scheduling and enhances process
maint enance efficiency. For more informat ion, see Pipeline examples.

Prune t he columns of input t ables

For input t ables t hat cont ain a large number of columns, only a few columns are processed by a
mapper. When you add an input t able, you can specify t he columns t o reduce t he amount of dat a
t hat needs t o be read. For example, t o process dat a in t he c1 and c2 columns, use t he following
configurat ion:

InputUtils.addTable(TableInfo.builder().tableName("wc_in").cols(new String[]{"c1","c2"}).
build(), job);

Aft er t he configurat ion, t he mapper reads dat a only from t he c1 and c2 columns. T his does not
affect t he dat a t hat is obt ained based on column names. However, t his may affect t he dat a t hat is
obt ained based on subscript s.

Avoid t he duplicat e reads of resources

We recommend t hat you read resources in t he set up st age. T his avoids performance loss caused by
duplicat e resource reads. You can read resources for up t o 64 t imes. For informat ion about , see
Resource usage example.

Reduce t he overheads of object const ruct ion

Java object s are used in each map or reduce st age. You can const ruct Java object s in t he set up st age
inst ead of t he map or reduce st age. T his reduces t he overheads of object const ruct ion.

204 > Document Version: 20220630


MaxComput e Best Pract ices· Cost opt imizat ion

{
...
Record word;
Record one;
public void setup(TaskContext context) throws IOException {
// Create a Java object in the setup stage. This avoids the repeated creation of Ja
va objects in each map stage.
word = context.createMapOutputKeyRecord();
one = context.createMapOutputValueRecord();
one.set(new Object[]{1L});
}
...
}

Use a combiner in t he proper manner


If t he out put of a map t ask cont ains mult iple duplicat e keys, you can use a combiner t o merge t hese
keys. T his reduces t ransmission bandwidt h and shuffling overheads. If t he out put of a map t ask does
not cont ain mult iple duplicat e keys, using a combiner may incur ext ra overheads. A combiner
implement s a reducer int erface. T he following code defines t he combiner in a WordCount program:

/**
* A combiner class that combines map output by sum them.
*/
public static class SumCombiner extends ReducerBase {
private Record count;
@Override
public void setup(TaskContext context) throws IOException {
count = context.createMapOutputValueRecord();
}
@Override
public void reduce(Record key, Iterator<Record> values, TaskContext context)
throws IOException {
long c = 0;
while (values.hasNext()) {
Record val = values.next();
c += (Long) val.get(0);
}
count.set(0, c);
context.write(key, count);
}
}

Appropriat ely select part it ion key columns or cust omize a part it ioner

You can use JobConf#setPartitionColumns t o specify part it ion key columns. T he default part it ion
key columns are defined in t he key schema. If you use t his met hod, dat a is t ransferred t o reducers
according t o t he hash values of t he specified columns. T his avoids long-t ail issues caused by dat a
skew. You can also cust omize a part it ioner if necessary. T he following code shows how t o cust omize
a part it ioner:

> Document Version: 20220630 205


Best Pract ices· Cost opt imizat ion MaxComput e

import com.aliyun.odps.mapred.Partitioner;
public static class MyPartitioner extends Partitioner {
@Override
public int getPartition(Record key, Record value, int numPartitions) {
// numPartitions indicates the number of reducers.
// This function is used to determine the reducers to which the keys of map tasks are t
ransferred.
String k = key.get(0).toString();
return k.length() % numPartitions;
}
}

Configure t he following set t ings in jobconf :

jobconf.setPartitionerClass(MyPartitioner.class)

Specify t he number of reducers in jobconf .

jobconf.setNumReduceTasks(num)

Configure JVM memory paramet ers as required

T he large memory of a MapReduce job increases comput ing cost s. We recommend t hat you configure
one CPU core and 4 GB of memory for a MapReduce job and set odps.stage.reducer.jvm.mem t o
4006 for a reducer. A large CPU core-t o-memory rat io (great er t han 1:4) also increases comput ing
cost s.

6.4. Optimize storage costs


T his t opic describes how t o opt imize st orage cost s in t erms of dat a part it ions, t able lifecycles, and t he
periodic delet ion of deprecat ed t ables.

You can perform t he following operat ions t o opt imize st orage cost s:

Properly configure dat a part it ions.


Configure reasonable lifecycles for t ables.
Periodically delet e deprecat ed t ables.

Properly configure data partitions


In MaxComput e, each value of a part it ion key column is called a part it ion. You can group mult iple fields
of a t able in a single part it ion t o creat e a mult i-level part it ion. Mult i-level part it ions are similar t o mult i-
level direct ories. If you specify t he name of a part it ion you want t o access, t he syst em reads dat a only
from t hat part it ion and does not scan t he ent ire t able. T his reduces cost s and improves efficiency.

If t he minimum period for dat a collect ion is one day, we recommend t hat you use t he dat e field as a
part it ion field. T he syst em migrat es dat a t o t he specified part it ions every day. T hen, it reads t he dat a
from t he specified part it ions for subsequent operat ions.
If t he minimum period for dat a collect ion is one hour, we recommend t hat you use t he combinat ion
of t he dat e and hour fields as a part it ion field. T he syst em migrat es dat a t o t he specified part it ions
every hour. T hen, it reads t he dat a from t he specified part it ions for subsequent operat ions. If dat a
t hat is collect ed on an hourly basis is part it ioned based on dat es, dat a in each part it ion is appended
every hour. As a result , t he syst em reads large amount s of unnecessary dat a, which increases st orage
cost s.

206 > Document Version: 20220630


MaxComput e Best Pract ices· Cost opt imizat ion

You can use part it ion fields based on your business needs. In addit ion t o t he dat e and t ime fields, you
can use ot her fields t hat have a relat ively fixed number of enumerat ed values, such as channel, count ry,
or province. Alt ernat ively, you can use a combinat ion of t ime and ot her fields as a part it ion field. We
recommend t hat you specify t wo levels of part it ions in a t able. Each t able support s a maximum of
60,000 part it ions.

Configure reasonable lifecycles for tables


When you creat e a t able, you can configure it s lifecycle based on dat a usage. MaxComput e delet es
dat a t hat exceeds t he lifecycle t hreshold in a t imely manner. T his saves st orage space.

For example, you can execut e t he following st at ement t o creat e a t able wit h t he lifecycle of 100 days.
If t he last modificat ion of t he t able or a part it ion occurred more t han 100 days ago, MaxComput e
delet es t he t able or part it ion.

CREATE TABLE test3 (key boolean) PARTITIONED BY (pt string, ds string) LIFECYCLE 100;

T he lifecycle t akes a part it ion as t he smallest unit . If some part it ions in a part it ioned t able reach t he
lifecycle t hreshold, t hese part it ions are delet ed. Part it ions t hat do not reach t he lifecycle t hreshold are
not affect ed.

You can execut e t he following st at ement t o modify t he lifecycle set t ings for an exist ing t able. For
more informat ion, see Lifecycle management operations.

ALTER TABLE table_name SET lifecycle days;

Periodically delete deprecated tables


We recommend t hat you periodically delet e deprecat ed t ables t hat are not accessed for a long period
of t ime. T he following t ables are considered deprecat ed t ables:

T ables t hat are not accessed wit hin t he last t hree mont hs
Non-part it ioned t ables t hat are not accessed wit hin t he last mont h
T ables t hat do not consume st orage resources

6.5. Optimize the costs of data


uploads and downloads
T his t opic describes how t o opt imize t he synchronizat ion cost s incurred by dat a uploads and
downloads.

Use t he classic net work or a Virt ual Privat e Cloud (VPC)

You can use an int ernal net work, such as t he classic net work or VPC, t o upload or download dat a at
no cost . For more informat ion about how t o configure net works, see Endpoints.

Use Elast ic Comput e Service (ECS) t o download resources

If you creat e a subscript ion ECS inst ance, you can use a dat a synchronizat ion t ool such as T unnel t o
synchronize dat a from MaxComput e t o t he ECS inst ance. T hen, download t he dat a t o your local
direct ory. For more informat ion, see Export SQL execution results.

Opt imize T unnel-based file uploads

> Document Version: 20220630 207


Best Pract ices· Cost opt imizat ion MaxComput e

Separat e uploads of small files consume t oo many comput ing resources. We recommend t hat you
upload a large number of small files at a t ime. For example, if you call T unnel SDK, we recommend
t hat you upload files when t he cache of t he files reaches 64 MB.

Est imat e t he VPC bandwidt h


If you want t o synchronize dat a from your on-premises dat a cent er t o MaxComput e by using a
physical connect ion, you must est imat e t he bandwidt h and cost s of dat a synchronizat ion. For
example, if you migrat e 50 T B of dat a t o MaxComput e, t he est imat ed bandwidt h for one day is 5
Gbit /s. T he est imat ed bandwidt h is calculat ed by using t he following formula:
50 × 1024 × 8/(24 × 3600) = 4.7 Gbit/s

6.6. Manage costs


T his t opic describes how t o t rack resource consumpt ion, opt imize resource usage, and reduce cost s.

You can use t he following it ems t o manage cost s:

Bill det ails: You can view bill det ails on t he Billing Management page of t he Alibaba Cloud
Management Console.
Usage records: Each usage record cont ains t he complexit y and met ering informat ion of an SQL
st at ement , as well as det ails about daily st orage and download t raffic.
Command-line int erface (CLI): You can use a CLI t o reproduce operat ion scenarios and det ermine t he
causes of high cost s incurred by SQL st at ement s.

Bill details
We recommend t hat you regularly view your bills t o opt imize cost s in a t imely manner. You can view bill
det ails in t he Alibaba Cloud Management Console. If you select t he subscript ion billing met hod, bills are
generat ed at 12:00 t he next day. If you select t he pay-as-you-go billing met hod, bills are generat ed at
09:00 t he next day. For more informat ion, see View billing details.

Usage records
If t he bill amount of a project reaches t housands of dollars on a given day and is a mult iple of t he
normal bill amount , you must view t he bill det ails. You can download usage records t o view det ails
about except ion records. For more informat ion, see View billing details.

Met ering informat ion about a st orage fee is pushed every hour. T o calculat e a st orage fee, obt ain t he
t ot al number of byt es and calculat e t he average value over a 24-hour period. T hen, use t he t iered
pricing met hod t o obt ain t he st orage fee.

T he calculat ion of met ering informat ion depends on t he end t ime of each t ask. If a t ask is complet e in
t he early morning of t he next day aft er it st art s, t he met ering informat ion of t his t ask is included in t he
calculat ion for t he day t he t ask is complet e.

You are not charged for t he resources t hat are used t o download dat a over an int ernal net work, such
as t he classic net work. T he resources t hat are used t o upload dat a are also free of charge. You are
charged only for t he resources t hat are used t o download dat a over t he Int ernet .

CLI
If an abnormal SQL st at ement is det ect ed, you can use a CLI t o reproduce t he operat ion scenario.

You can check usage records or run t he show p; command t o obt ain t he ID of t he inst ance on
which abnormal dat a is det ect ed. T hen, run t he wait InstanceId command t o obt ain t he Logview

208 > Document Version: 20220630


MaxComput e Best Pract ices· Cost opt imizat ion

URL of t he inst ance. T he logs of t he SQL st at ement are displayed in Logview. You can view t he logs
t o det ermine t he causes of high cost s.

Not e You can obt ain informat ion generat ed only in t he last seven days in Logview.

You can also run t he desc instance instid command t o show informat ion about t he SQL
st at ement in t he console.

6.7. Command reference


T his t opic describes t he commonly used commands and whet her t hese commands are free of charge.

Command Description Charged? Example

TUNNEL DOWNLOAD
table_name
e:/table_name.txt;

Configure the endpoint


of MaxCompute in the
Download data over the classic network:
T UNNEL DOWNLOAD No
classic network. http://dt.cn-
shanghai.maxcompute.
aliyun-inc.com .

For more information,


see Endpoints.

TUNNEL DOWNLOAD
table_name
e:/table_name.txt;

Configure the
public endpoint of
Download data over the MaxCompute:
T UNNEL DOWNLOAD Yes
Internet. http://dt.cn-
shanghai.maxcompute.
aliyun.com.

For more information,


see Endpoints.

TUNNEL UPLOAD
T UNNEL UPLOAD Upload data. No e:/table_name.txt
table_name;

COST SQL SELECT *


COST SQL Estimate costs. No
FROM table_name;

> Document Version: 20220630 209


Best Pract ices· Cost opt imizat ion MaxComput e

INSERT OVERWRITE
TABLE table_name
PARTITION
INSERT (sale_date='20180122
Update data. Yes
OVERWRIT E...SELECT ') SELECT shop_name,
customer_id,
total_price FROM
sale_detail;

Query table
DESC T ABLE No DESC table_name;
information.

DROP TABLE if
DROP T ABLE Delete a table. No exists table_name;

CREATE TABLE if
not exists
table_name (key
CREAT E T ABLE Create a table. No
string ,value
bigint) PARTITIONED
BY(p string);

CREATE TABLE if
not exists
CREAT E T ABLE...SELECT Create a table. Yes
table_name AS SELECT
* FROM a_tab;

INSERT INTO TABLE


table_name partition
INSERT INT O Quickly insert constant (p)(key,p) VALUES
No
T ABLE...VALUES data. ('d','20170101'),
('e','20170101'),
('f','20170101');

INSERT INTO TABLE


table_name SELECT
INSERT INT O shop_name,
Insert data. Yes
T ABLE...SELECT customer_id,
total_price FROM
sale_detail;

SELECT UDF [NOT COUNT SELECT sum(a) FROM


Query table data. Yes
or All] FROM T ABLE table_name;

SET
Configure session
SET FLAG No odps.sql.allow.fulls
settings.
can=true;

JAR -l
Execute a MapReduce com.aliyun.odps.mapr
JAR MR Yes
job. ed.example.WordCount
wc_in wc_out

210 > Document Version: 20220630


MaxComput e Best Pract ices· Cost opt imizat ion

ADD jar
ADD data\resources\mapre
Add a resource. No
JAR/FILE/ARCHIVE/T ABLE duce-examples.jar -
f;

DROP DROP RESOURCE


Remove a resource. No
JAR/FILE/ARCHIVE/T ABLE sale.res

LIST RESOURCES Query resources. No LIST RESOURCES;

GET RESOURCES
GET RESOURCES Download resources. No odps-udf-
examples.jar d:\;

CREATE FUNCTION
CREAT E FUNCT IONS Create a function. No
test_lower ;

DROP FUNCTION
DROP FUNCT IONS Delete a function. No
test_lower;

LIST FUNCT IONS Query functions. No LIST FUNCTIONS;

ALTER TABLE user


DROP if exists
ALT ER T ABLE...DROP Delete a partition from
No partition(region='ha
PART IT ION a table.
ngzhou',dt='20150923
');

Remove all records


TRUNCATE TABLE
T RUNCAT E T ABLE from a non-partitioned No
table_name;
table.

CREATE EXTERNAL
TABLE IF NOT EXISTS
ambulance_data_csv_e
xternal…LOCATION
Create an external
CREAT E EXT ERNAL T ABLE No 'oss://oss-cn-
table.
shanghai-
internal.aliyuncs.co
m/oss-odps-
test/Demo/'

SELECT recordId,
patientId, direction
SELECT [EXT ERNAL] FROM
Read an external table. Yes
T ABLE ambulance_data_csv_e
xternal WHERE
patientId > 25;

Query all tables in the


SHOW T BALES No SHOW TABLES;
current project.

> Document Version: 20220630 211


Best Pract ices· Cost opt imizat ion MaxComput e

SHOW PART IT IONS Query all partitions of a SHOW PARTITIONS


No
table_name table. <table_name>

Show information
SHOW INST ANCE/SHOW SHOW
about the instance that No
P INSTANCES/SHOW P
the current user creates.

Return the Logview URL WAIT


WAIT INST ANCE of the specified No 20131225123302267gk3
instance. u6k4y2

STATUS
Return the status of the
ST AT US INST ANCE No 20131225123302267gk3
specified instance.
u6k4y2

KILL
Stop the specified
KILL INST ANCE No 20131225123302267gk3
instance.
u6k4y2

6.8. Analyze MaxCompute bills


If you want t o know t he dist ribut ion of your expenses or prevent MaxComput e cost s from increasing
beyond expect at ions, you can obt ain and analyze your MaxComput e bills. T he analysis helps maximize
resource ut ilizat ion and reduce cost s. T his t opic describes how t o analyze t he cost dist ribut ion of
MaxComput e based on usage records of bills.

Background information
MaxComput e is a big dat a analyt ics plat form. Comput ing resources of MaxComput e support t wo t ypes
of billing met hods: subscript ion and pay-as-you-go. You are charged based on MaxComput e project s
on a daily basis, and daily bills are generat ed before 06:00 of t he next day. For more informat ion about
billable it ems and billing met hods of MaxComput e, see Billing method.

Alibaba Cloud provides informat ion about t he MaxComput e bill fluct uat ions (cost increases in most
cases) during dat a development or before a version release of MaxComput e. You can analyze t he bill
fluct uat ions and opt imize jobs in your MaxComput e project s based on t he analysis result s. You can
download t he usage records of all commercial services on t he Billing Management page in t he Alibaba
Cloud Management Console. For more informat ion about how t o obt ain and download bills, see View
billing details.

Upload usage records of bills to MaxCompute


1. Creat e a MaxComput e t able t hat is named maxcomput efee on t he MaxComput e client odpscmd.
Sample st at ement s:

212 > Document Version: 20220630


MaxComput e Best Pract ices· Cost opt imizat ion

DROP TABLE IF EXISTS maxcomputefee;


CREATE TABLE IF NOT EXISTS maxcomputefee
(
projectid STRING COMMENT 'ProjectId'
,feeid STRING COMMENT 'MeteringId'
,type STRING COMMENT 'MeteringType, such as Storage, ComputationSQL, or DownloadEx'
,storage BIGINT COMMENT 'Storage'
,endtime STRING COMMENT 'EndTime'
,computationsqlinput BIGINT COMMENT 'SQLInput(Byte)'
,computationsqlcomplexity DOUBLE COMMENT 'SQLComplexity'
,uploadex BIGINT COMMENT 'UploadEx'
,download BIGINT COMMENT 'DownloadEx(Byte)'
,cu_usage DOUBLE COMMENT 'MRCompute(Core*Second)'
,input_ots BIGINT COMMENT 'InputOTS(Byte)'
,input_oss BIGINT COMMENT 'InputOSS(Byte)'
,starttime STRING COMMENT 'StartTime'
,source_type STRING COMMENT 'SpecificationType'
,source_id STRING COMMENT 'DataWorksNodeID'
);

Fields of usage records:

Project Id: a MaxComput e project of your Alibaba Cloud account or a MaxComput e project of t he
Alibaba Cloud account t o which t he current RAM user belongs.
Met eringId: a billing ID, which indicat es t he ID of a st orage t ask, an SQL comput ing t ask, an
upload t ask, or a download t ask. T he ID of an SQL comput ing t ask is specified by Inst anceId, and
t he ID of an upload or download t ask is specified by T unnel SessionId.
Met eringT ype: a billing t ype. Valid values: St orage, Comput at ionSql, UploadIn, UploadEx,
DownloadIn, and DownloadEx.
St orage: t he amount of dat a t hat is read per hour. Unit : byt es.
St art T ime or EndT ime: t he t ime when a job st art ed t o run or t he t ime when a job st opped. Only
st orage dat a is obt ained on an hourly basis.
SQLInput (Byt e): t he SQL comput at ion it em. T his field specifies t he amount of input dat a each
t ime an SQL st at ement is execut ed. Unit : byt es.
SQLComplexit y: t he complexit y of SQL st at ement s. T his field is one of t he SQL billing fact ors.
UploadEx or DownloadEx(Byt e): t he amount of dat a t hat is uploaded or downloaded over t he
Int ernet . Unit : byt es.
MRComput e(Core*Second): t he billable hours of a MapReduce job or a Spark job, which are
calculat ed by using t he following formula: Number of cores × Number of seconds . Aft er
calculat ion, you must convert t he calculat ion result int o billable hours.
Input OT S(Byt e) or Input OSS(Byt e): t he amount of dat a t hat is read from T ablest ore or OSS by
using ext ernal t ables. Unit : byt es. T hese fields are used when fees for ext ernal t ables are
generat ed.

2. Run T unnel commands t o upload usage records of bills.

T o upload a CSV file t hat cont ains usage records of bills t o MaxComput e, you must make sure t hat
t he number and dat a t ypes of columns in t he CSV file are t he same as t he number and dat a t ypes
of columns in t he maxcomput efee t able. Ot herwise, t he dat a upload fails.

> Document Version: 20220630 213


Best Pract ices· Cost opt imizat ion MaxComput e

tunnel upload ODPS_2019-01-12_2019-01-14.csv maxcomputefee -c "UTF-8" -h "true" -dfp "y


yyy-MM-dd HH:mm:ss";

Not e
For more informat ion about t he configurat ions of T unnel commands, see T unnel
commands.
You can also upload usage records of bills by using t he dat a import feat ure of
Dat aWorks. For more informat ion, see Import dat a by using Dat a Int egrat ion.

3. Execut e t he following st at ement t o check whet her all usage records are uploaded:

SELECT * FROM maxcomputefee limit 10;

Use SQ L statements to analyze usage records of bills


1. Analyze t he cost s of SQL jobs. MaxComput e SQL meet s t he business requirement s of 95% of cloud
users. T he generat ed fees also occupy a large proport ion of MaxComput e bills.

Not e Cost s of an SQL job = Amount of input dat a × Complexit y of SQL st at ement s × Unit
price (USD 0.0438/GB)

214 > Document Version: 20220630


MaxComput e Best Pract ices· Cost opt imizat ion

-- Sort SQL jobs based on sqlmoney to analyze the costs of SQL jobs.
SELECT to_char(endtime,'yyyymmdd') as ds,feeid as instanceid
,projectid
,computationsqlcomplexity -- SQL complexity
,SUM((computationsqlinput / 1024 / 1024 / 1024)) as computationsqlinput -- Amo
unt of input data (GB)
,SUM((computationsqlinput / 1024 / 1024 / 1024)) * computationsqlcomplexity * 0
.0438 AS sqlmoney
FROM maxcomputefee
WHERE TYPE = 'ComputationSql'
AND to_char(endtime,'yyyymmdd') >= '20190112'
GROUP BY to_char(endtime,'yyyymmdd'),feeid
,projectid
,computationsqlcomplexity
ORDER BY sqlmoney DESC
LIMIT 10000
;

T he following conclusions can be drawn from t he execut ion result :

T o reduce t he cost s of large jobs, you can reduce t he amount of dat a t hat you want t o read
and t he complexit y of SQL st at ement s.
You can summarize daily dat a based on t he ds field and analyze t he t rend of t he cost s of SQL
jobs in a specified period of t ime. For example, you can creat e a line chart in an EXCEL file or by
using t ools, such as Quick BI, t o display t he t rend.
You can perform t he following st eps t o locat e t he node t hat you want t o opt imize based on
t he execut ion result :
a. Obt ain t he ID of a job inst ance.

Run t he wait InstanceId; command on t he MaxComput e client odpscmd or in t he


Dat aWorks console t o view t he informat ion about a specific job and t he relat ed SQL
st at ement s.

b. Ent er t he ret urned Logview URL in a web browser and press Ent er t o view t he informat ion
about t he SQL job.

For more informat ion about how t o use Logview t o view informat ion about jobs, see Use
Logview to view job information.

> Document Version: 20220630 215


Best Pract ices· Cost opt imizat ion MaxComput e

c. Obt ain t he name of t he Dat aWorks node from Logview.

In Logview, find t he job whose informat ion you want t o view and click XML in t he SourceXML
column t o view t he job det ails. In t he following figure, SKYNET _NODENAME indicat es t he
name of t he Dat aWorks node. T his paramet er is displayed only for t he jobs t hat are run by
t he scheduling syst em. T his paramet er is left empt y for ad hoc queries. Aft er you obt ain t he
node name, you can quickly locat e t he node in t he Dat aWorks console t o opt imize t he node
or view t he node owner.

2. Analyze t he t rend of t he number of jobs. In most cases, a surge in t he number of jobs due t o
repeat ed operat ions or invalid set t ings of scheduling at t ribut es result s in cost increases.

-- Analyze the trend of the number of jobs.


SELECT TO_CHAR(endtime,'yyyymmdd') AS ds
,projectid
,COUNT(*) AS tasknum
FROM maxcomputefee
WHERE TYPE = 'ComputationSql'
AND TO_CHAR(endtime,'yyyymmdd') >= '20190112'
GROUP BY TO_CHAR(endtime,'yyyymmdd')
,projectid
ORDER BY tasknum DESC
LIMIT 10000
;

216 > Document Version: 20220630


MaxComput e Best Pract ices· Cost opt imizat ion

T he following figure shows t he execut ion result .

T he execut ion result shows t he t rend of t he number of jobs t hat were submit t ed t o MaxComput e
and were successfully run from January 12, 2019 t o January 14, 2019.

3. Analyze st orage cost s.

-- Analyze storage costs.


SELECT t.ds
,t.projectid
,t.storage
,CASE WHEN t.storage < 0.5 THEN 0.01
WHEN t.storage >= 0.5 AND t.storage <= 10240 THEN t.storage*0.0072
WHEN t.storage > 10240 AND t.storage <= 102400 THEN (10240*0.0072+(t.s
torage-10240)*0.006)
WHEN t.storage > 102400 THEN (10240*0.0072+(102400-10240)*0.006+(t.st
orage-102400)*0.004)
END storage_fee
FROM (
SELECT to_char(starttime,'yyyymmdd') as ds
,projectid
,SUM(storage/1024/1024/1024)/24 AS storage
FROM maxcomputefee
WHERE TYPE = 'Storage'
and to_char(starttime,'yyyymmdd') >= '20190112'
GROUP BY to_char(starttime,'yyyymmdd')
,projectid
) t
ORDER BY storage_fee DESC
;

T he following figure shows t he execut ion result .

T he following conclusions can be drawn from t he execut ion result :

St orage cost s increased on January 12, 2019 and decreased on January 14, 2019.
T o reduce st orage cost s, we recommend t hat you configure a lifecycle for t ables and delet e
unnecessary t emporary t ables.

4. Analyze download cost s.

> Document Version: 20220630 217


Best Pract ices· Cost opt imizat ion MaxComput e

For Int ernet -based dat a downloads or cross-region dat a downloads in your MaxComput e project ,
you are charged based on t he amount of dat a t hat is downloaded.

Not e Cost s of a download job = Amount of downloaded dat a × Unit price (USD
0.1166/GB)

-- Analyze download costs.


SELECT TO_CHAR(starttime,'yyyymmdd') AS ds
,projectid
,SUM((download/1024/1024/1024)*0.1166) AS download_fee
FROM maxcomputefee
WHERE type = 'DownloadEx'
AND TO_CHAR(starttime,'yyyymmdd') >= '20190112'
GROUP BY TO_CHAR(starttime,'yyyymmdd')
,projectid
ORDER BY download_fee DESC
;

5. Analyze t he cost s of MapReduce jobs.

Not e Comput ing fees for MapReduce jobs on t he day = T ot al billable hours on t he day ×
Unit price (USD 0.0690/Hour/Job)

-- Analyze the costs of MapReduce jobs.


SELECT TO_CHAR(starttime,'yyyymmdd') AS ds
,projectid
,(cu_usage/3600)*0.0690 AS mr_fee
FROM maxcomputefee
WHERE type = 'MapReduce'
AND TO_CHAR(starttime,'yyyymmdd') >= '20190112'
GROUP BY TO_CHAR(starttime,'yyyymmdd')
,projectid
,cu_usage
ORDER BY mr_fee DESC
;

6. Analyze t he cost s of jobs t hat involve T ablest ore ext ernal t ables or OSS ext ernal t ables.

Not e Comput ing fees for an SQL job t hat involves ext ernal t ables = Amount of input
dat a × Unit price (USD 0.0044/GB).

218 > Document Version: 20220630


MaxComput e Best Pract ices· Cost opt imizat ion

-- Analyze the costs of SQL jobs that involve Tablestore external tables.
SELECT TO_CHAR(starttime,'yyyymmdd') AS ds
,projectid
,(input_ots/1024/1024/1024)*1*0.0044 AS ots_fee
FROM maxcomputefee
WHERE type = 'ComputationSql'
AND TO_CHAR(starttime,'yyyymmdd') >= '20190112'
GROUP BY TO_CHAR(starttime,'yyyymmdd')
,projectid
,input_ots
ORDER BY ots_fee DESC
;
-- Analyze the costs of SQL jobs that involve OSS external tables.
SELECT TO_CHAR(starttime,'yyyymmdd') AS ds
,projectid
,(input_oss/1024/1024/1024)*1*0.0044 AS ots_fee
FROM maxcomputefee
WHERE type = 'ComputationSql'
AND TO_CHAR(starttime,'yyyymmdd') >= '20190112'
GROUP BY TO_CHAR(starttime,'yyyymmdd')
,projectid
,input_oss
ORDER BY ots_fee DESC
;

7. Analyze t he cost s of Spark jobs.

Not e Comput ing fees for Spark jobs on t he day = T ot al billable hours on t he day × Unit
price (USD 0.1041/Hour/Job)

-- Analyze the costs of Spark jobs.


SELECT TO_CHAR(starttime,'yyyymmdd') AS ds
,projectid
,(cu_usage/3600)*0.1041 AS mr_fee
FROM maxcomputefee
WHERE type = 'spark'
AND TO_CHAR(starttime,'yyyymmdd') >= '20190112'
GROUP BY TO_CHAR(starttime,'yyyymmdd')
,projectid
,cu_usage
ORDER BY mr_fee DESC
;

> Document Version: 20220630 219


Best Pract ices· Securit y manageme
MaxComput e
nt

7.Security management
7.1. Set a RAM user as the super
administrator for a MaxCompute
project
T his t opic describes how t o set a RAM user as t he super administ rat or for a MaxComput e project , and
provides suggest ions on how t o manage members and permissions.

Background information
T o ensure dat a securit y, t he Alibaba Cloud account of a project is used only by aut horized personnel.
Common users can only log on t o MaxComput e as RAM users. A project owner must be t he Alibaba
Cloud account , and some operat ions can only be performed by t he project owner, such as set t ing a
project flag and configuring cross-project resource sharing by using packages. If you use a RAM user,
make sure t hat it has been grant ed t he super administ rat or role.

T he built -in management role Super_Administ rat or has been added t o MaxComput e. T his role has
permissions on all t ypes of resources in a project and project management permissions. For more
informat ion about permissions, see Role planning and management .

A project owner can grant t he Super_Administ rat or role t o a RAM user. As a super administ rat or, t he
RAM user has t he permissions needed t o manage t he project , such as common project flag set t ing
permissions and permissions on managing all resources.

Authorization methods
We recommend t hat you grant t he Super_Administ rat or role t o a RAM user t hat has t he permissions t o
creat e a project . T his way, t he RAM user can manage bot h Dat aWorks workspaces and MaxComput e
project s t hat are associat ed wit h t hese Dat aWorks workspaces.

Not e
For informat ion about how t o aut horize a RAM user t o creat e project s, see Grant a RAM user
t he permissions t o perform operat ions in t he Dat aWorks console.
T o ensure dat a securit y, we recommend t hat you clarify t he responsibilit ies of owners of
RAM users. Make sure t hat each RAM user belongs t o one developer.
Only one RAM user can be grant ed t he Super_Administ rat or role in a project . You can grant
t he Admin role t o ot her RAM users t hat require basic management permissions.

Aft er you select a RAM user and use t he RAM user t o creat e a project , t he project owner is st ill t he
Alibaba Cloud account , who can grant t he Super_Administ rat or role t o t he RAM user in t he following
ways:
Grant t he Super_Administ rat or role on t he MaxComput e client .

Assume t hat user [email protected] is t he owner of t he project _a project , and user Allen is a RAM user
under [email protected].

i. Run t he following commands t o grant t he Super_Administ rat or and Admin roles as user

220 > Document Version: 20220630


Best Pract ices· Securit y manageme
MaxComput e
nt

[email protected]:

-- Open project_a.
use project_a;
-- Add the RAM user Allen to project_a.
add user [email protected]:Allen;
-- Grant the Super_Administrator role to Allen.
grant super_administrator TO [email protected]:Allen;
-- Grant the Admin role to Allen.
grant admin TO [email protected]:Allen;

ii. Run t he following command t o view t he permissions as t he aut horized RAM user:

show grants;

If t he Super_Administ rat or role is in t he command out put , t he aut horizat ion succeeded.

Grant t he Super_Administ rat or role in t he Dat aWorks console.


i. Log on t o t he Dat aWorks console and choose Workspace Management .
ii. Opt ional. Add a RAM user as a project member. Skip t his st ep if t he RAM user is already a project
member.
a. In t he left -side navigat ion pane, click User Management .
b. In t he upper-right corner, click Add Member.
c. In t he Add Member dialog box, select t he members you want t o add from t he Account s
t o be added sect ion and click t he right wards arrow t o add t hem t o t he Added account
sect ion.

Not e In t he not e block, click Ref resh t o synchronize t he RAM users under t he
current Alibaba Cloud account t o t he Account t o be added sect ion.

d. Select t he required roles and click Conf irm.


iii. Grant t he Super_Administ rat or role t o t he RAM user.
a. In t he left -side navigat ion pane, click MaxComput e Management .
b. In t he navigat ion t ree, click Cust om user roles.

> Document Version: 20220630 221


Best Pract ices· Securit y manageme
MaxComput e
nt

c. Find t he role t hat you want t o grant t o t he user and click Member management in t he
Operat ion column. In t he Member management dialog box, select t he members you want t o
add from t he Account t o be added sect ion and click t he right wards arrow t o add t hem t o
t he Added account sect ion.

d. Click Conf irm.


iv. Run t he following command t o view t he permissions as t he aut horized RAM user:

show grants;

If t he Super_Administ rat or role is in t he command out put , t he aut horizat ion succeeded.

Usage notes
Member management
MaxComput e support s t he Alibaba Cloud account and RAM users. T o ensure dat a securit y, we
recommend t hat you only add RAM users under t he project owner as project members.

T he Alibaba Cloud account is used t o cont rol RAM users, such as revoking or updat ing t heir
credent ials. T his ensures dat a securit y in t he case of personnel t ransfers and resignat ions.

Not e If you use Dat aWorks t o manage project members, you can add only RAM users
under t he project owner as project members.

RAM users can be added by t he Alibaba Cloud account and t he super administ rat or. If you want t o
add RAM users t o a project as t he super administ rat or, wait unt il t he RAM users are creat ed by t he
Alibaba Cloud account .
We recommend t hat you only add t he users who need t o develop dat a, namely, users who need t o
run jobs, in t he current project as project members. For users who require dat a int eract ions, you can
use packages t o share resources across project s. T his reduces t he complexit y of member
management because fewer members are added t o t he project .
If an employee who has a RAM user is t ransferred t o anot her posit ion or resigns, t he RAM user wit h
t he Super_Administ rat or role needs t o remove t he RAM user of t he employee from t he project , and
t hen not ify t he project owner t o revoke it s credent ials. If an employee who has a RAM user wit h
t he Super_Administ rat or role is t ransferred t o anot her posit ion or resigns, t he Alibaba Cloud
account must be used t o remove t he RAM user and revoke it s credent ials.

Permission management

222 > Document Version: 20220630


Best Pract ices· Securit y manageme
MaxComput e
nt

We recommend t hat you manage permissions by role. Permissions are associat ed wit h roles, and
roles are associat ed wit h users.
We recommend t hat you use t he principle of least privilege t o avoid securit y risks caused by
excessive permissions.
If you need t o use cross-project dat a, we recommend t hat you share resources by using packages.
In t his way, resource providers only need t o manage packages, which avoids t he ext ra cost s caused
by t he management of addit ional members.

Not e A RAM user who has been grant ed t he Super_Administ rat or role has t he permissions
t o query and manage all resources in a project . T herefore, no addit ional permissions need t o be
grant ed t o t he RAM user.

Permission audit ing

You can use t he view provided by t he MaxComput e met adat a service t o audit permissions. For more
informat ion, see Metadata views.

Cost management

For more informat ion, see View billing details. RAM users can query t he billing det ails only aft er t he
Alibaba Cloud account grant s t hem t he permissions t o access Billing Management . For informat ion
about how t o grant permissions, see Grant permissions to a RAM role. T he following permissions are
required:
AliyunBSSFullAccess: t he permissions t o manage Billing Management .
AliyunBSSReadOnlyAccess: t he access and read-only permissions on Billing Management .
AliyunBSSOrderAccess: t he permissions t o view, pay for, and cancel orders in Billing Management .

Not e Permissions on Billing Management are independent of t he Super_Administ rat or role


of a MaxComput e project . You must grant t hese permissions separat ely t o t he user.

Resource usage management


If you use subscript ion comput ing resources of MaxComput e, you can view t he usage of comput ing
resources and manage all t he comput ing resources on MaxComput e Management . For more
informat ion, see Use MaxComput e Management .
If you use pay-as-you-go comput ing resources of MaxComput e, you can view t he usage of
comput ing resources in t he view provided by t he MaxComput e met adat a service. For example,
T ASKS_HIST ORY allows you t o view t he execut ion det ails of audit jobs, such as t he t ime, cont ent ,
and resource consumpt ion. For more informat ion, see T ASKS_HIST ORY.

Not e T he view provided by t he met adat a service only ret ains dat a generat ed in t he last
15 days. If you need t o st ore dat a for a longer period of t ime, we recommend t hat you
regularly read and save t he dat a locally.

7.2. Policy-based permission


> Document Version: 20220630 223
Best Pract ices· Securit y manageme
MaxComput e
nt

7.2. Policy-based permission


management for users assigned built-
in roles
If a user is assigned a built -in role in MaxComput e, t he user has t he permissions of t he built -in role. For
example, if a user is assigned t he Development role, t he user is grant ed t he operat ion permissions on
t ables and resources. In act ual business scenarios, you may need t o manage t he operat ion permissions
of such users in a fine-grained manner. For example, you may need t o prohibit t he users from delet ing
import ant t ables. T his t opic describes how t o perform policy-based permission management for users
assigned built -in roles.

Prerequisites
T he MaxComput e client is inst alled. For more informat ion, see Inst all and configure t he MaxComput e
client .

Context
If a user is assigned a built -in role and you want t o manage t he permissions of t he user in a fine-grained
manner, we recommend t hat you use t he policy-based permission management mechanism inst ead of
t he access cont rol list (ACL) mechanism. For more informat ion about built -in roles, see Users and roles.
For more informat ion about t he policy-based permission management mechanism, see Policy-based
access control and download control.

T he policy-based access cont rol mechanism is used t o manage permissions based on roles. T his
mechanism allows you t o grant or revoke operat ion permissions on project objects, such as t ables, for
roles. T he operations include read and writ e operat ions. Aft er you assign a role t o a user, t he permissions
grant ed t o or revoked from t he role also t ake effect on t he user. For more informat ion about t he
GRANT and REVOKE synt ax, see Policy-based access control and download control.

Grant permissions by using the policy-based access control


mechanism
In t he following example, t he RAM user Alice is assigned t he Development role of a MaxComput e
project . You need t o prohibit t he RAM user Alice from delet ing all t ables whose names st art wit h t b_.
T he RAM user Alice belongs t o t he Alibaba Cloud account [email protected].

T his operat ion can be performed only by t he project owner or users assigned t he Super_Administ rat or
or Admin role.

1. Start the MaxCompute client .


2. Execut e t he CREATE ROLE st at ement t o creat e a role named delet e_t est .

Sample st at ement :

create role delete_test;

For more informat ion about how t o creat e a role, see Role planning and management .
3. Execut e t he GRANT st at ement t o grant t he delet e_t est role t he permission t hat prohibit s t he
role from delet ing all t ables whose names st art wit h t b_.
Sample st at ement :

224 > Document Version: 20220630


Best Pract ices· Securit y manageme
MaxComput e
nt

grant drop on table tb_* to role delete_test privilegeproperties("policy" = "true", "al


low"="false");

For more informat ion about t he GRANT synt ax, see t he "Policy-based access cont rol by using t he
GRANT st at ement " sect ion in Policy-based access control and download control.

4. Execut e t he GRANT st at ement t o assign t he delet e_t est role t o t he RAM user Alice.

Sample st at ement :

grant delete_test to [email protected]:Alice;

If you do not know t he Alibaba Cloud account t o which t he RAM user belongs, you can execut e t he
LIST USERS; st at ement on t he MaxComput e client t o obt ain t he account . For more informat ion
about how t o assign a role t o a user, see Role planning and management .

5. Execut e t he SHOW GRANTS st at ement t o view t he permissions of t he RAM user Alice.

Sample st at ement :

show grants for [email protected]:Alice;

T he following result s are ret urned:

[roles]
role_project_admin, delete_test -- Alice is assigned th
e delete_test role.
Authorization Type: Policy -- The authorization me
thod is Policy.
[role/delete_test]
D projects/mcproject_name/tables/tb_*: Drop -- Alice is not allowed
to delete the tables whose names start with tb_ in the project. D indicates Deny.
[role/role_project_admin]
A projects/mcproject_name: *
A projects/mcproject_name/instances/*: *
A projects/mcproject_name/jobs/*: *
A projects/mcproject_name/offlinemodels/*: *
A projects/mcproject_name/packages/*: *
A projects/mcproject_name/registration/functions/*: *
A projects/mcproject_name/resources/*: *
A projects/mcproject_name/tables/*: *
A projects/mcproject_name/volumes/*: *
Authorization Type: ObjectCreator
AG projects/mcproject_name/tables/local_test: All
AG projects/mcproject_name/tables/mr_multiinout_out1: All
AG projects/mcproject_name/tables/mr_multiinout_out2: All
AG projects/mcproject_name/tables/ramtest: All
AG projects/mcproject_name/tables/wc_in: All
AG projects/mcproject_name/tables/wc_in1: All
AG projects/mcproject_name/tables/wc_in2: All
AG projects/mcproject_name/tables/wc_out: All

For more informat ion about how t o view user permissions, see Query permissions by using MaxCompute
SQL.

6. Log on t o t he MaxComput e client as Alice and execut e t he DROP TABLE st at ement t o delet e t he

> Document Version: 20220630 225


Best Pract ices· Securit y manageme
MaxComput e
nt

t ables whose names st art wit h t b_.


Sample st at ement :

drop table tb_test;

T he following result s are ret urned. T he result s indicat e t hat t he permission t akes effect . If t he
t ables are delet ed, t he permission does not t ake effect . In t his case, you must check whet her t he
preceding st eps are correct ly performed.

FAILED: Catalog Service Failed, ErrorCode: 50, Error Message: ODPS-0130013:Authorizatio


n exception - Authorization Failed [4011],
You have NO privilege 'odps:Drop' on {acs:odps:*:projects/mcproject_name/tables/tb_test
}.
Explicitly denied by policy.
Context ID:85efa8e9-40da-4660-bbfd-b503dfa64c0a. --->Tips: Pricipal:[email protected]:
Alice; Deny by policy

Revoke permissions by using the policy-based access control


mechanism
T he RAM user Alice is not allowed t o delet e t he t ables whose names st art wit h t b_, as described in
Grant permissions by using t he policy-based access cont rol mechanism. If t he t ables are no longer
required and you want t o allow t he RAM user Alice t o delet e t he t ables, you can revoke t he relat ed
permission from t he RAM user Alice.

T his operat ion can be performed only by t he project owner or users assigned t he Super_Administ rat or
or Admin role. You can use one of t he following met hods t o revoke t he permission from t he RAM user
Alice based on your business requirement s.

Revoke t he permission t hat is grant ed t o t he role and ret ain t he role

Perform t he following st eps:

i. St art t he MaxComput e client .


ii. Execut e t he REVOKE st at ement t o revoke t he permission t hat is grant ed t o t he delet e_t est
role. T his way, t he delet e_t est role is allowed t o delet e t he t ables whose names st art wit h t b_.

Sample st at ement :

revoke drop on table tb_* from role delete_test privilegeproperties("policy" = "true"


, "allow"="false");

For more informat ion about t he REVOKE synt ax, see t he "Policy-based access cont rol by using
t he GRANT st at ement " sect ion in Policy-based access control and download control.

iii. Execut e t he SHOW GRANTS st at ement t o view t he permissions of t he RAM user Alice. Sample
st at ement :

show grants for [email protected]:Alice;

T he following result s are ret urned:

226 > Document Version: 20220630


Best Pract ices· Securit y manageme
MaxComput e
nt

[roles]
role_project_admin, delete_test -- The delete_test ro
le is retained.
Authorization Type: Policy -- The permission is
revoked.
[role/role_project_admin]
A projects/mcproject_name: *
A projects/mcproject_name/instances/*: *
A projects/mcproject_name/jobs/*: *
A projects/mcproject_name/offlinemodels/*: *
A projects/mcproject_name/packages/*: *
A projects/mcproject_name/registration/functions/*: *
A projects/mcproject_name/resources/*: *
A projects/mcproject_name/tables/*: *
A projects/mcproject_name/volumes/*: *
Authorization Type: ObjectCreator
AG projects/mcproject_name/tables/local_test: All
AG projects/mcproject_name/tables/mr_multiinout_out1: All
AG projects/mcproject_name/tables/mr_multiinout_out2: All
AG projects/mcproject_name/tables/ramtest: All
AG projects/mcproject_name/tables/tb_test: All
AG projects/mcproject_name/tables/wc_in: All
AG projects/mcproject_name/tables/wc_in1: All
AG projects/mcproject_name/tables/wc_in2: All
AG projects/mcproject_name/tables/wc_out: All

For more informat ion about how t o view user permissions, see Query permissions by using
MaxCompute SQL.

iv. Log on t o t he MaxComput e client as Alice and execut e t he DROP TABLE st at ement t o delet e
t he t ables whose names st art wit h t b_.

Sample st at ement :

drop table tb_test;

If OK is ret urned, t he permission is revoked.


Revoke t he role f rom t he user and delet e t he role if required

Perform t he following st eps:

i. St art t he MaxComput e client .


ii. Execut e t he REVOKE st at ement t o revoke t he delet e_t est role from Alice.

Sample st at ement :

revoke delete_test from [email protected]:Alice;

For more informat ion about how t o revoke a role from a user, see Role planning and management .

iii. Execut e t he SHOW GRANTS st at ement t o view t he permissions of t he RAM user Alice. Sample
st at ement :

show grants for [email protected]:Alice;

T he following result s are ret urned:

> Document Version: 20220630 227


Best Pract ices· Securit y manageme
MaxComput e
nt

[roles]
role_project_admin -- The delete_test role
is revoked.
Authorization Type: Policy
[role/role_project_admin]
A projects/mcproject_name: *
A projects/mcproject_name/instances/*: *
A projects/mcproject_name/jobs/*: *
A projects/mcproject_name/offlinemodels/*: *
A projects/mcproject_name/packages/*: *
A projects/mcproject_name/registration/functions/*: *
A projects/mcproject_name/resources/*: *
A projects/mcproject_name/tables/*: *
A projects/mcproject_name/volumes/*: *
Authorization Type: ObjectCreator
AG projects/mcproject_name/tables/local_test: All
AG projects/mcproject_name/tables/mr_multiinout_out1: All
AG projects/mcproject_name/tables/mr_multiinout_out2: All
AG projects/mcproject_name/tables/ramtest: All
AG projects/mcproject_name/tables/wc_in: All
AG projects/mcproject_name/tables/wc_in1: All
AG projects/mcproject_name/tables/wc_in2: All
AG projects/mcproject_name/tables/wc_out: All

iv. Log on t o t he MaxComput e client as Alice and execut e t he DROP TABLE st at ement t o delet e
t he t ables whose names st art wit h t b_.

Sample st at ement :

drop table tb_test;

If OK is ret urned, t he permission is revoked.


v. Opt ional. Execut e t he DROP ROLE st at ement t o delet e t he delet e_t est role.

Sample st at ement :

drop role delete_test;

If OK is ret urned, t he role is delet ed. For more informat ion about how t o delet e a role, see Role
planning and management .

228 > Document Version: 20220630

You might also like