Best Practices Max Compute
Best Practices Max Compute
Best Practices Max Compute
Alibaba Cloud
MaxCompute
MaxCompute
Best Practices
Best Practices
Legal disclaimer
Alibaba Cloud reminds you t o carefully read and fully underst and t he t erms and condit ions of t his legal
disclaimer before you read or use t his document . If you have read or used t his document , it shall be deemed
as your t ot al accept ance of t his legal disclaimer.
1. You shall download and obt ain t his document from t he Alibaba Cloud websit e or ot her Alibaba Cloud-
aut horized channels, and use t his document for your own legal business act ivit ies only. The cont ent of
t his document is considered confident ial informat ion of Alibaba Cloud. You shall st rict ly abide by t he
confident ialit y obligat ions. No part of t his document shall be disclosed or provided t o any t hird part y for
use wit hout t he prior writ t en consent of Alibaba Cloud.
2. No part of t his document shall be excerpt ed, t ranslat ed, reproduced, t ransmit t ed, or disseminat ed by
any organizat ion, company or individual in any form or by any means wit hout t he prior writ t en consent of
Alibaba Cloud.
3. The cont ent of t his document may be changed because of product version upgrade, adjust ment , or
ot her reasons. Alibaba Cloud reserves t he right t o modify t he cont ent of t his document wit hout not ice
and an updat ed version of t his document will be released t hrough Alibaba Cloud-aut horized channels
from t ime t o t ime. You should pay at t ent ion t o t he version changes of t his document as t hey occur and
download and obt ain t he most up-t o-dat e version of t his document from Alibaba Cloud-aut horized
channels.
4. This document serves only as a reference guide for your use of Alibaba Cloud product s and services.
Alibaba Cloud provides t his document based on t he "st at us quo", "being defect ive", and "exist ing
funct ions" of it s product s and services. Alibaba Cloud makes every effort t o provide relevant operat ional
guidance based on exist ing t echnologies. However, Alibaba Cloud hereby makes a clear st at ement t hat
it in no way guarant ees t he accuracy, int egrit y, applicabilit y, and reliabilit y of t he cont ent of t his
document , eit her explicit ly or implicit ly. Alibaba Cloud shall not t ake legal responsibilit y for any errors or
lost profit s incurred by any organizat ion, company, or individual arising from download, use, or t rust in
t his document . Alibaba Cloud shall not , under any circumst ances, t ake responsibilit y for any indirect ,
consequent ial, punit ive, cont ingent , special, or punit ive damages, including lost profit s arising from t he
use or t rust in t his document (even if Alibaba Cloud has been not ified of t he possibilit y of such a loss).
5. By law, all t he cont ent s in Alibaba Cloud document s, including but not limit ed t o pict ures, archit ect ure
design, page layout , and t ext descript ion, are int ellect ual propert y of Alibaba Cloud and/or it s
affiliat es. This int ellect ual propert y includes, but is not limit ed t o, t rademark right s, pat ent right s,
copyright s, and t rade secret s. No part of t his document shall be used, modified, reproduced, publicly
t ransmit t ed, changed, disseminat ed, dist ribut ed, or published wit hout t he prior writ t en consent of
Alibaba Cloud and/or it s affiliat es. The names owned by Alibaba Cloud shall not be used, published, or
reproduced for market ing, advert ising, promot ion, or ot her purposes wit hout t he prior writ t en consent of
Alibaba Cloud. The names owned by Alibaba Cloud include, but are not limit ed t o, "Alibaba Cloud",
"Aliyun", "HiChina", and ot her brands of Alibaba Cloud and/or it s affiliat es, which appear separat ely or in
combinat ion, as well as t he auxiliary signs and pat t erns of t he preceding brands, or anyt hing similar t o
t he company names, t rade names, t rademarks, product or service names, domain names, pat t erns,
logos, marks, signs, or special descript ions t hat t hird part ies ident ify as Alibaba Cloud and/or it s
affiliat es.
6. Please direct ly cont act Alibaba Cloud for any errors of t his document .
Document conventions
St yle Descript io n Example
W arning:
A warning notice indicates a situation
W arning that may cause major system changes, Restarting will cause business
faults, physical injuries, and other adverse interruption. About 10 minutes are
results. required to restart an instance.
Closing angle brackets are used to Click Set t ings > Net w o rk > Set net w o rk
>
indicate a multi-level menu cascade. t ype .
Table of Contents
1.SQL 07
2.Data migration 46
2.1. Overview 46
3.5. Resolve the issue that you cannot upload files that exceed…10 MB161to DataWork
1.SQL
1.1. Write MaxCompute SQL
statements
T his t opic describes t he common scenarios of using MaxComput e SQL st at ement s and how t o writ e
t hem.
Prepare a dataset
T he emp and dept t ables are used as t he dat aset in t his example. You can creat e a t able on a
MaxComput e project and upload dat a t o t he t able. For more informat ion about how t o import dat a,
see Overview .
Download dat a files of t he emp t able and dat a files of t he dept t able.
Examples
Example 1: Query all depart ment s t hat have at least one employee.
We recommend t hat you use t he JOIN clause t o avoid large amount s of dat a in t he query. Execut e t he
following SQL st at ement :
SELECT d. *
FROM dept d
JOIN (
SELECT DISTINCT deptno AS no
FROM emp
) e
ON d.deptno = e.no;
Example 2: Query all employees who have higher salaries t han Smit h.
T he following code shows how t o use MAPJOIN in t he SQL st at ement for t his scenario:
Example 3: Query t he names of all employees and t he names of t heir immediat e superiors.
T he following code shows how t o use EQUI JOIN in t he SQL st at ement for t his scenario:
SELECT a.ename
, b.ename
FROM emp a
LEFT OUTER JOIN emp b
ON b.empno = a.mgr;
Example 4: Query all jobs t hat have basic salaries higher t han USD 1,500.
T he following code shows how t o use t he HAVING clause in t he SQL st at ement for t his scenario:
Example 5: Query t he number of employees in each depart ment , t he average salary, and t he average
lengt h of service.
T he following code shows how t o use built -in funct ions in t he SQL st at ement for t his scenario:
Example 6: Query t he names and t he sort ing order of t he first t hree employees who have t he highest
salaries.
T he following code shows how t o use t he T OP N clause in t he SQL st at ement for t his scenario:
SELECT *
FROM (
SELECT deptno
, ename
, sal
, ROW_NUMBER() OVER (PARTITION BY deptno ORDER BY sal DESC) AS nums
FROM emp
) emp1
WHERE emp1.nums < 4;
Example 7: Query t he number of employees in each depart ment and t he proport ion of clerks in t hese
depart ment s.
SELECT deptno
, COUNT(empno) AS cnt
, ROUND(SUM(CASE
WHEN job = 'CLERK' THEN 1
ELSE 0
END) / COUNT(empno), 2) AS rate
FROM `EMP`
GROUP BY deptno;
Notes
When you use t he GROUP BY clause, t he SELECT list can only consist of aggregat e funct ions or
columns t hat are part of t he GROUP BY clause.
ORDER BY must be followed by LIMIT N.
T he SELECT expression does not support subqueries. T o use subqueries, you can rewrit e t he code t o
include a JOIN clause.
T he JOIN clause does not support Cart esian project s. You can replace t he JOIN clause wit h MAPJOIN.
UNION ALL must be replaced wit h subqueries.
T he subquery t hat is specified in t he IN or NOT IN clause must cont ain only one column and ret urn a
maximum of 1,000 rows. Ot herwise, use t he JOIN clause.
Background information
MaxComput e V2.0 fully embraces open source ecosyst ems, support s more programming languages and
feat ures, and provides higher performance. It also inspect s synt ax more rigorously. As a result , errors
may be ret urned for some st at ement s t hat use less rigorous synt ax and are successfully execut ed in t he
earlier versions.
T he MaxComput e t eam not ifies t he owners of t he jobs for which t he required SQL st at ement s cannot
be execut ed by email or DingT alk based on t he online rollback condit ion. T he job owners must modify
t he SQL st at ement s for t he jobs at t he earliest opport unit y. Ot herwise, t he jobs may fail.
group.by.with.star
T his st at ement is equivalent t o t he select * …group by… st at ement .
In MaxComput e V2.0, all t he columns of a source t able must be included in t he GROUP BY clause.
Ot herwise, an error is ret urned.
In t he earlier version of MaxComput e, select * from group by key is support ed even if not all
columns of a source t able are included in t he GROUP BY clause.
Examples
Error message:
select * from t group by key, value; -- t has columns key and value
Even if t he preceding synt ax causes no errors in MaxComput e V2.0, we recommend t hat you use
t he following synt ax:
bad.escape
T he escape sequence is invalid.
MaxComput e defines t hat , in st ring lit eral, each ASCII charact er t hat ranges from 0 t o 127 must be
writ t en in t he format of a backslash (\) followed by t hree oct al digit s. For example, 0 is writ t en as \001,
and 1 is writ t en as \002. However, \01 and \0001 are processed as \001.
T his met hod confuses new users. For example, "\0001" cannot be processed as "\000"+"1". For users
who migrat e dat a from ot her syst ems t o MaxComput e, invalid dat a may be generat ed.
Error message:
column.repeated.in.creation
If duplicat e column names are det ect ed when t he CREAT E T ABLE st at ement is execut ed, MaxComput e
V2.0 ret urns an error.
Examples
Error message:
string.join.double
You want t o join t he values of t he ST RING t ype wit h t hose of t he DOUBLE t ype.
In t he early version of MaxComput e, t he values of t he ST RING and DOUBLE t ypes are convert ed int o
t he BIGINT t ype. T his causes precision loss. For example, 1.1 = "1" in a JOIN condit ion is considered
equal.
In MaxComput e V2.0, t he values of t he ST RING and DOUBLE t ypes are convert ed int o t he DOUBLE t ype
because MaxComput e V2.0 is compat ible wit h Hive.
Examples
WARNING:[1,48] implicit conversion from STRING to DOUBLE, potential data loss, use CAST
function to suppress
window.ref.prev.window.alias
Window funct ions reference t he aliases of ot her window funct ions in t he SELECT clause of t he same
level.
Examples
Error message:
select.invalid.token.after.star
T he SELECT clause allows you t o use an ast erisk (*) t o select all t he columns of a t able. However, t he
ast erisk cannot be followed by aliases even if t he ast erisk specifies only one column. T he new
edit or ret urns errors for similar synt ax.
Examples
Error message:
agg.having.ref.prev.agg.alias
If HAVING exist s, t he SELECT clause can reference aggregat e funct ion aliases.
Examples
Error message:
s and cnt do not exist in source t able t 1. However, t he early version of MaxComput e does not ret urn
an error because HAVING exist s. In MaxComput e V2.0, t he error message column cannot be resolve
is ret urned.
order.by.no.limit
In MaxComput e, t he ORDER BY clause must be followed by a LIMIT clause t o limit t he number of
dat a records. ORDER BY is used t o sort all dat a records. If ORDER BY is not followed by a LIMIT
clause, t he execut ion performance is low.
Examples
Error message:
FAILED: ODPS-0130071:[4,1] Semantic analysis exception - ORDER BY must be used with a LIM
IT clause
In MaxComput e V1.0, view checks are not rigorous. For example, a view is creat ed in a project which
does not require a check on t he LIMIT clause. odps.sql.validat e.orderby.limit =false indicat es t hat
t he project does not require a check on t he LIMIT clause.
MaxComput e V1.0 does not ret urn an error, whereas MaxComput e V2.0 ret urns t he following error:
generated.column.name.multi.window
Aut omat ically generat ed aliases are used.
In t he early version of MaxComput e, an alias is aut omat ically generat ed for each expression of a SELECT
st at ement . T he alias is displayed on t he MaxComput e client . However, t he early version of MaxComput e
does not guarant ee t hat t he alias generat ion rule is correct or remains unchanged. We recommend t hat
you do not use aut omat ically generat ed aliases.
MaxComput e V2.0 warns you against t he use of aut omat ically generat ed aliases. However,
MaxComput e V2.0 does not prohibit t he use of aut omat ically generat ed aliases t o avoid adverse
impact s.
In some cases, known changes are made t o t he alias generat ion rules in t he different versions of
MaxComput e. Some online jobs depend on aut omat ically generat ed aliases. T hese jobs may fail when
MaxComput e is being upgraded or rolled back. If you encount er t hese issues, modify your queries and
explicit ly specify t he aliases of t he columns.
Examples
non.boolean.filter
Non-BOOLEAN f ilt er condit ions are used.
MaxComput e prohibit s implicit conversions bet ween t he BOOLEAN t ype and ot her dat a t ypes. However,
t he early version of MaxComput e allows t he use of BIGINT filt er condit ions in some cases. MaxComput e
V2.0 prohibit s t he use of BIGINT filt er condit ions. If your script s have BIGINT filt er condit ions, modify
t hem at t he earliest opport unit y. Examples:
Error message:
post.select.ambiguous
T he ORDER BY, CLUST ER BY, DIST RIBUT E BY, and SORT BY clauses ref erence columns wit h
conf lict ing names.
In t he early version of MaxComput e, t he syst em aut omat ically select s t he last column in a SELECT clause
as t he operat ion object . However, MaxComput e V2.0 report s an error in t his case. Modify your queries at
t he earliest opport unit y. Examples:
Error message:
T he change covers t he st at ement s t hat have conflict ing column names but have t he same synt ax. Even
t hough no ambiguit y is caused, t he syst em ret urns an error t o warn you against t hese st at ement s. We
recommend t hat you modify relevant st at ement s.
duplicated.partition.column
Part it ions wit h t he same name are specif ied in a query.
In t he early version of MaxComput e, no error is ret urned if t wo part it ion keys wit h t he same name are
specified. T he lat t er part it ion key overwrit es t he former part it ion. T his causes confusion. MaxComput e
V2.0 ret urns an error in t his case. Examples:
Invalid synt ax 1:
Invalid synt ax 2:
order.by.col.ambiguous
T he ORDER BY clause ref erences t he duplicat e aliases in a SELECT clause.
select id, id
from table_test
order by id;
in.subquery.without.result
If colx in subquery does not ret urn result s, colx does not exist in t he source t able.
Error message:
ctas.if.not.exists
T he synt ax of a dest inat ion t able is invalid.
If t he dest inat ion t able exist s, t he early version of MaxComput e does not check t he synt ax. However,
MaxComput e V2.0 checks t he synt ax. As a result , a large number of errors may be ret urned. Examples:
Error message:
worker.restart.instance.timeout
In t he early version of MaxComput e, each t ime a UDF generat es a record, a writ e operat ion is t riggered
on Apsara Dist ribut ed File Syst em, and a heart beat packet is sent t o Job Scheduler. If t he UDF does not
generat e records for 10 minut es, t he following error is ret urned:
T he runt ime framework of MaxComput e V2.0 support s vect oring t o process mult iple rows of a column
at a t ime. T his makes execut ion more efficient . If mult iple records are processed at a t ime and no
heart beat packet s are sent t o Job Scheduler wit hin t he specific period, vect oring may cause normal
st at ement s t o t ime out . T he int erval bet ween t wo out put records cannot exceed 10 minut es.
If a t imeout error occurs, we recommend t hat you first check t he performance of UDFs. It requires
several seconds t o process each record. If UDFs cannot be opt imized, you can manually set
bat ch.rowcount t o handle t his issue. T he default value of bat ch.rowcount is 1024.
set odps.sql.executionengine.batch.rowcount=16;
divide.nan.or.overflow
T he early version of MaxComput e does not support division const ant f olding.
T he following code shows t he physical execut ion plan in t he early version of MaxComput e:
explain
select if(false, 0/0, 1.0)
from table_name;
in task M1_Stg1:
Data source: meta_dev.table_name
TS: alias: table_name
SEL: If(False, Divide(UDFToDouble(0), UDFToDouble(0)), 1.0)
FS: output: None
T he IF and DIVIDE funct ions are ret ained. During execut ion, t he first paramet er of IF is set t o False, and
t he expression of DIVIDE is not evaluat ed. Divide-by-zero errors do not occur.
However, MaxComput e V2.0 support s division const ant folding. As a result , an error is ret urned.
Examples:
Error message:
Error message:
A similar issue occurs in t he const ant folding for CASE WHEN, such as CASE WHEN T RUE T HEN 0 ELSE 0/0.
During const ant folding in MaxComput e V2.0, all subexpressions are evaluat ed, which causes divide-by-
zero errors.
CASE WHEN may involve more complex opt imizat ion scenarios. Example:
T he opt imizer pushes down t he division operat ion t o subqueries. T he following code shows a similar
conversion:
M (
select case when 0 = 0 then 0 else 1/0 end c1 from src
UNION ALL
select case when key = 0 then 0 else 1/key end c1 from src) r;
Error message:
An error is ret urned for t he const ant folding in t he first clause of UNION ALL. We recommend t hat you
move CASE WHEN in t he SQL st at ement t o subqueries and remove useless CASE WHEN st at ement s and
/0.
select c1 end
from (
select 0 c1 end from src
union all
select case when key = 0 then 0 else 1/key end) r;
small.table.exceeds.mem.limit
T he early version of MaxComput e support s mult i-way join opt imizat ion. Mult iple JOIN operat ions wit h
t he same join key are merged for execut ion in t he same Fuxi t ask, such as J4_1_2_3_St g1 in t his example.
explain
select t1.*
from t1 join t2 on t1.c1 = t2.c1
join t3 on t1.c1 = t3.c1;
T he following code shows t he physical execut ion plan in t he early version of MaxComput e:
In Job job0:
root Tasks: M1_Stg1, M2_Stg1, M3_Stg1
J4_1_2_3_Stg1 depends on: M1_Stg1, M2_Stg1, M3_Stg1
In Task M1_Stg1:
Data source: meta_dev.t1
In Task M2_Stg1:
Data source: meta_dev.t2
In Task M3_Stg1:
Data source: meta_dev.t3
In Task J4_1_2_3_Stg1:
JOIN: t1 INNER JOIN unknown INNER JOIN unknown
SEL: t1._col0, t1._col1, t1._col2
FS: output: None
If MAPJOIN hint s are added, t he physical execut ion plan in t he early version of MaxComput e remains
unchanged. In t he early version of MaxComput e, mult i-way join opt imizat ion is preferent ially used, and
user-defined MAPJOIN hint s can be ignored.
explain
select /* +mapjoin(t1) */ t1.*
from t1 join t2 on t1.c1 = t2.c1
join t3 on t1.c1 = t3.c1;
T he opt imizer of MaxComput e V2.0 preferent ially uses user-defined MAPJOIN hint s. In t his example, if t 1
is a large t able, an error similar t o t he following one is ret urned:
FAILED: ODPS-0010000:System internal error - SQL Runtime Internal Error: Hash Join Cursor H
ashJoin_REL… small table exceeds, memory limit(MB) 640, fixed memory used …, variable memor
y used …
In t his case, if MAPJOIN is not required, we recommend t hat you remove MAPJOIN hint s.
sigkill.oom
sigkill.oom has t he same issue as small.t able.exceeds.mem.limit . If you specify MAPJOIN hint s and t he
sizes of small t ables are large, mult iple JOIN st at ement s may be opt imized by using mult i-way joins in
t he early version of MaxComput e. As a result , t he st at ement s are successfully execut ed in t he early
version of MaxComput e. However, in MaxComput e V2.0, some users may use
odps.sql.mapjoin.memory.max t o prevent small t ables from exceeding t he size limit . Each
MaxComput e worker has a memory limit . If t he sizes of small t ables are large, MaxComput e workers may
be t erminat ed because t he memory limit is exceeded. If t his happens, an error similar t o t he following
one is ret urned:
We recommend t hat you remove MAPJOIN hint s and use mult i-way joins.
wm_concat.first.argument.const
Based on t he WM_CONCAT funct ion described in Aggregate functions, t he first paramet er of
WM_CONCAT must be a const ant . However, t he early version of MaxComput e does not have rigorous
check st andards. For example, if t he source t able has no dat a, no error is ret urned even if t he first
paramet er of WM_CONCAT is ColumnReference.
Function declaration:
string wm_concat(string separator, string str)
Parameters:
separator: the delimiter, which is a constant of the STRING type. Delimiters of other types
or non-constant delimiters result in exceptions.
MaxComput e V2.0 checks t he validit y of paramet ers during t he planning st age. If t he first paramet er of
WM_CONCAT is not a const ant , an error is ret urned. Examples:
Error message:
pt.implicit.convertion.failed
srcpt is a part it ioned t able t hat has t wo part it ions.
In t he preceding SQL st at ement s, t he const ant s of t he INT t ype in t he pt columns of t he ST RING t ype
are convert ed int o t hose of t he DOUBLE t ype for comparison. Even if
odps.sql.udf.strict.mode=true is configured for t he project , t he early version of MaxComput e does
not ret urn an error and it filt ers out all pt columns. However, in MaxComput e V2.0, an error is ret urned.
Examples:
Error message:
We recommend t hat you do not compare t he values in t he part it ion key columns of t he ST RING and INT
const ant s. If such comparison is required, convert t he INT const ant s int o t he ST RING t ype.
having.use.select.alias
SQL specificat ions define t hat t he GROUP BY and HAVING clauses precede a SELECT clause. T herefore,
t he column alias generat ed by t he SELECT clause cannot be used in t he HAVING clause.
Examples
Error message:
id2 is t he column alias generat ed by t he SELECT clause and cannot be used in t he HAVING clause.
dynamic.pt.to.static
In MaxComput e V2.0, dynamic part it ions may be convert ed int o st at ic part it ions by t he opt imizer.
Examples
insert overwrite table srcpt partition(pt) select id, 'pt1' from table_name;
If t he specified part it ion value is invalid, such as '${bizdat e}', MaxComput e V2.0 ret urns an error during
synt ax checks. For more informat ion, see Partition.
insert overwrite table srcpt partition(pt) select id, '${bizdate}' from table_name limit 0;
Error message:
FAILED: ODPS-0130071:[1,24] Semantic analysis exception - wrong columns count 2 in data sou
rce, requires 3 columns (includes dynamic partitions if any)
In t he early version of MaxComput e, no result s are ret urned by t he SQL st at ement s due t o LIMIT 0, and
no dynamic part it ions are creat ed. As a result , no error is ret urned.
lot.not.in.subquery
Processing of NULL values in t he IN subquery.
In a st andard SQL IN operat ion, if t he value list cont ains a NULL value, t he ret urn value may be NULL or
t rue, but cannot be false. For example, 1 in (null, 1, 2, 3) ret urns t rue, 1 in (null, 2, 3) ret urns NULL, and
null in (null, 1, 2, 3) ret urns NULL. Likewise, for t he NOT IN operat ion, if t he value list cont ains a NULL
value, t he ret urn value may be false or NULL, but cannot be t rue.
MaxComput e V2.0 processes NULL values by using st andard execut ion rules. If you receive a not ificat ion
for t his issue, check whet her t he subqueries in t he IN operat ion have a NULL value and whet her t he
relat ed execut ion meet s your expect at ions. If t he relat ed execut ion does not meet your expect at ions,
modify t he queries.
Examples
If t he accept ed column does not cont ain NULL values, ignore t his issue. If t he accept ed column
cont ains NULL values, c not in (select accepted from c_list) ret urns t rue in t he early version of
MaxComput e and NULL in MaxComput e V2.0.
select * from t where c not in (select accepted from c_list where accepted is not null)
Not e T his t opic provides examples based on Alibaba Cloud MaxComput e SDK for Java.
O verview
You can use t he following met hods t o export t he execut ion result s of SQL st at ement s:
If t he amount of dat a is small, use SQLT ask t o obt ain all query result s.
If you want t o export t he query result s of a t able or a part it ion, use T unnel.
If t he SQL st at ement s are complex, use T unnel and SQLT ask t o export t he query result s.
Use Dat aWorks t o execut e SQL st at ement s, synchronize dat a, perform t imed scheduling, and
configure t ask dependencies.
Use t he open source t ool Dat aX t o export dat a from MaxComput e t o specified dest inat ion dat a
sources.
SQLT ask.get Result (i) is used t o export t he result s of SELECT st at ement s. You cannot use it t o export
t he execut ion result s of ot her MaxComput e SQL st at ement s such as SHOW T ABLES .
You can use READ_T ABLE_MAX_ROW t o specify t he maximum number of dat a records t hat t he SELECT
st at ement ret urns t o a client . For more informat ion, see Project operat ions.
T he SELECT st at ement ret urns a maximum of 10,000 dat a records t o a client . You can execut e t he
SELECT st at ement on a client such as SQLT ask. T his is equivalent t o appending a LIMIT N clause t o t he
SELECT st at ement .
T his rule does not apply if you execut e t he CREAT E T ABLE XX AS SELECT or INSERT
INT O/OVERWRIT E T ABLE st at ement t o solidify t he result s int o a specified t able.
T he following example shows how t o run a T unnel command t o export dat a. If t he T unnel command
cannot be used t o export dat a, you can compile t he T unnel SDK t o export dat a. For more informat ion,
see MaxCompute T unnel overview .
T he following sample code provides an example t o show how t o use SQLT ask and T unnel t o export
dat a:
}
}
/*
* Initialize the connection information of MaxCompute.
* */
private static Odps getOdps() {
Account account = new AliyunAccount(accessId, accessKey);
Odps odps = new Odps(account);
odps.setEndpoint(endPoint);
odps.setDefaultProject(project);
return odps;
}
Background information
A MaxComput e part it ioned t able is a t able wit h part it ions. You can specify one or more columns as t he
part it ion key t o creat e a part it ioned t able. If you have specified t he name of a part it ion t hat you want
t o access, MaxComput e reads dat a only from t hat part it ion and does not scan t he ent ire t able. T his
reduces cost s and improves efficiency.
Part it ion pruning allows you t o specify filt er condit ions for part it ion key columns. T his way,
MaxComput e reads dat a only from t he part it ions t hat meet t he filt er condit ions t hat you have
specified in SQL st at ement s. T his avoids t he errors and wast e of resources t hat are caused by full t able
scans. However, part it ion pruning may not t ake effect somet imes.
For a query where part it ion pruning does not t ake effect :
explain
select seller_id
from xxxxx_trd_slr_ord_1d
where ds=rand();
T he execut ion plan indicat es t hat all t he 1,344 part it ions of T able xxxxx_t rd_slr_ord_1d are read.
explain
select seller_id
from xxxxx_trd_slr_ord_1d
where ds='20150801';
T he execut ion plan indicat es t hat only Part it ion 20150801 of T able xxxxx_t rd_slr_ord_1d is read.
If you use user-defined funct ions (UDFs) or specific built -in funct ions t o specify part it ions, part it ion
pruning may not t ake effect . In t his case, we recommend t hat you execut e t he EXPLAIN st at ement t o
check whet her part it ion pruning is effect ive.
explain
select ...
from xxxxx_base2_brd_ind_cw
where ds = concat(SPLIT_PART(bi_week_dim(' ${bdp.system.bizdate}'), ',', 1), SPLIT_PART(b
i_week_dim(' ${bdp.system.bizdate}'), ',', 2))
Not e For more informat ion about UDF-based part it ion pruning, see t he "WHERE" sect ion in
WHERE clause (where_condit ion).
If part it ion pruning condit ions are specified in t he WHERE clause, part it ion pruning is effect ive.
If part it ion pruning condit ions are specified in t he ON clause, part it ion pruning is effect ive for t he
secondary t able, but not t he primary t able.
T he following examples describe how part it ion pruning works when t hree different t ypes of JOIN
operat ions are performed:
For a query where part it ion pruning condit ions are specified in t he ON clause:
set odps.sql.allow.fullscan=true;
explain
select a.seller_id
,a.pay_ord_pbt_1d_001
from xxxxx_trd_slr_ord_1d a
left outer join
xxxxx_seller b
on a.seller_id=b.user_id
and a.ds='20150801'
and b.ds='20150801';
T he execut ion plan indicat es t hat part it ion pruning is effect ive for t he right t able, but not t he
left t able.
For a query where part it ion pruning condit ions are specified in t he WHERE clause:
set odps.sql.allow.fullscan=true;
explain
select a.seller_id
,a.pay_ord_pbt_1d_001
from xxxxx_trd_slr_ord_1d a
left outer join
xxxxx_seller b
on a.seller_id=b.user_id
where a.ds='20150801'
and b.ds='20150801';
T he execut ion plan indicat es t hat part it ion pruning is effect ive for bot h t ables.
A RIGHT OUT ER JOIN operat ion is similar t o a LEFT OUT ER JOIN operat ion. If part it ion pruning
condit ions are specified in t he ON clause, part it ion pruning is effect ive only for t he left t able, but
not t he right t able. If part it ion pruning condit ions are specified in t he WHERE clause, part it ion
pruning is effect ive for bot h t ables.
Part it ion pruning is effect ive only when part it ion pruning condit ions are specified in t he WHERE
clause, but not t he ON clause.
issue can be hardly discovered. We recommend t hat you check whet her part it ion pruning is effect ive
before you commit t he code.
T o use UDFs for part it ion pruning, you must modify t he classes of t he UDFs or add set odps.sql.udf
.ppr.deterministic = true; before t he SQL st at ement s t o execut e. For more informat ion, see
WHERE clause (where_condit ion).
Sample data
Implementation
You can use one of t he following met hods t o query t he first N dat a records of each group:
Query t he row ID of each record and use t he WHERE clause t o filt er t he records.
SELECT * FROM (
SELECT empno
, ename
, sal
, job
, ROW_NUMBER() OVER (PARTITION BY job ORDER BY sal) AS rn
FROM emp
) tmp
WHERE rn < 10;
For more informat ion, see t he last example in MaxComput e learning plan. T his met hod can be used t o
det ermine t he sequence number of a dat a record. If t he sequence number is great er t han t he
specified number, such as 10, t he dat a records t hat remain are no longer processed. T his improves
comput ing efficiency.
Sample data
1 M LiLei
1 F HanMM
1 M Jim
1 F HanMM
2 F Kate
2 M Peter
Examples
Example 1: Execut e t he following st at ement t o merge t he rows whose values in t he class column are
t he same int o one row based on t he values in t he name column and deduplicat e t he values in t he
name column. You can implement t he deduplicat ion by using nest ed subqueries.
Not e T he wm_concat funct ion is used t o aggregat e dat a. For more informat ion, see
Aggregat e funct ions.
class names
1 LiLei,HanMM,Jim
2 Kate,Peter
Example 2: Execut e t he following st at ement t o collect st at ist ics on t he numbers of males and
females based on t he values in t he class column:
SELECT
class
,SUM(CASE WHEN gender = 'M' THEN 1 ELSE 0 END) AS cnt_m
,SUM(CASE WHEN gender = 'F' THEN 1 ELSE 0 END) AS cnt_f
FROM students
GROUP BY class;
1 2 2
2 1 1
Background information
T he following figure shows t he effect of t ransposing rows t o columns or columns t o rows.
Rows t o columns
T ranspose mult iple rows t o one row, or t ranspose one column t o mult iple columns.
Columns t o rows
T ranspose one row t o mult iple rows, or t ranspose mult iple columns t o one column.
Sample data
Sample source dat a is provided for you t o bet t er underst and t he examples of t ransposing rows t o
columns or columns t o rows.
Creat e a source t able and insert dat a int o t he source t able. T he t able is used t o t ranspose rows t o
columns. Sample st at ement s:
Creat e a source t able and insert dat a int o t he source t able. T he t able is used t o t ranspose columns
t o rows. Sample st at ement s:
create table columntorow (name string, chinese bigint, mathematics bigint, physics bigint
);
insert into table columntorow values
('Bob' , 74, 83, 93),
('Alice' , 74, 84, 94);
Met hod 1: Use t he CASE WHEN expression t o ext ract t he values of each subject as separat e
columns. Sample st at ement :
+--------+------------+------------+------------+
name chinese mathematics physics
+--------+------------+------------+------------+
Bob 74 83 93
Alice 74 84 94
+--------+------------+------------+------------+
Met hod 2: Use built -in funct ions t o t ranspose rows t o columns. Merge t he values of t he subject and
result columns int o one column by using t he CONCAT and WM_CONCAT funct ions. T hen, parse t he
values of t he subject column as separat e columns by using t he KEYVALUE funct ion. Sample
st at ement :
+--------+------------+------------+------------+
name chinese mathematics physics
+--------+------------+------------+------------+
Bob 74 83 93
Alice 74 84 94
+--------+------------+------------+------------+
Met hod 1: Use t he UNION ALL clause t o combine t he values in chinese, mat hemat ics, and physics
columns int o one column. Sample st at ement s:
-- Remove the limit on the simultaneous execution of the ORDER BY and LIMIT clauses. This
way, you can use ORDER BY to sort the results by name.
set odps.sql.validate.orderby.limit=false;
-- Transpose columns to rows.
select name as name, subject as subject, result as result
from(
select name, 'chinese' as subject, chinese as result from columntorow
union all
Choose name, 'mathematics' as subject, mathematics as result from columntorow
union all
select name, 'physics' as subject, physics as result from columntorow)
order by name;
+--------+--------+------------+
name subject result
+--------+--------+------------+
Bob chinese 74
Bob mathematics 83
Bob physics 93
Alice chinese 74
Alice mathematics 84
Alice physics 94
+--------+--------+------------+
Met hod 2: Use built -in funct ions t o t ranspose columns t o rows. Concat enat e t he column name of
each subject and t he values in each column by using t he CONCAT funct ion. T hen, split t he
concat enat ed values int o t he subject and result columns as separat e columns by using t he
T RANS_ARRAY and SPLIT _PART funct ions. Sample st at ement :
+--------+--------+------------+
name subject result
+--------+--------+------------+
Bob chinese 74
Bob mathematics 83
Bob physics 93
Alice chinese 74
Alice mathematics 84
Alice physics 94
+--------+--------+------------+
O verview
T he following t able describes t he JOIN operat ions t hat MaxComput e SQL support s.
Operation Description
Returns all the rows in both the left table and the
right table whether the join condition is met or not.
FULL JOIN In the result set, NULL values are returned in the
columns from the table that lacks a matching row in
the other table.
T he ON clause and t he WHERE clause can be used in t he same SQL st at ement . For example, consider t he
following SQL st at ement :
T herefore, a JOIN operat ion may ret urn different result s, depending on whet her t he filt er condit ions are
specified in {subquery_where_condition} , {on_condition} , or {where_condition} . For more
informat ion, see Case-by-case st udy.
Test tables
T able A
Execut e t he following st at ement t o creat e T able A:
CREATE TABLE A AS SELECT * FROM VALUES (1, 20180101),(2, 20180101),(2, 20180102) t (key,
ds);
T able A has t he following t hree rows and is used as t he left t able for all JOIN operat ions in t his t opic.
key ds
1 20180101
2 20180101
2 20180102
T able B
CREATE TABLE B AS SELECT * FROM VALUES (1, 20180101),(3, 20180101),(2, 20180102) t (key,
ds);
T able B has t he following t hree rows and is used as t he right t able for all JOIN operat ions in t his t opic.
key ds
1 20180101
3 20180101
2 20180102
1 20180101 1 20180101
1 20180101 3 20180101
1 20180101 2 20180102
2 20180101 1 20180101
2 20180101 3 20180101
2 20180101 2 20180102
2 20180102 1 20180101
2 20180102 3 20180101
2 20180102 2 20180102
Case-by-case study
INNER JOIN
An INNER JOIN operat ion first t akes t he Cart esian product of t he rows in T able A and T able B and
ret urns rows t hat have mat ching column values in T able A and T able B based on t he ON clause.
Conclusion: An INNER JOIN operat ion ret urns t he same result s independent ly of whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_conditio
n} .
1 20180101 1 20180101
T he Cart esian product of T able A and T able B cont ains nine rows, of which only one meet s t he join
condit ion. T he following t able list s t he result s t hat t he preceding st at ement ret urns.
1 20180101 1 20180101
Case 3: Specify t he filt er condit ions in t he WHERE clause aft er t he ON clause, as shown in t he
following st at ement :
T he Cart esian product of T able A and T able B cont ains nine rows, of which only t hree meet t he join
condit ion. T he following t able list s t he result set .
1 20180101 1 20180101
2 20180102 2 20180102
2 20180101 2 20180102
T he query processor t hen filt ers t he preceding result set based on t he A.ds='20180101' and B.ds
='20180101' filt er condit ion. T he following t able list s t he result s t hat t he preceding st at ement
ret urns.
1 20180101 1 20180101
LEFT JOIN
A LEFT JOIN operat ion first t akes t he Cart esian product of t he rows in T able A and T able B and ret urns
all t he rows of T able A and rows in T able B t hat meet t he join condit ion. If t he join condit ion finds no
mat ching rows in T able B for a row in T able A, t he row in T able A is ret urned in t he result set wit h
NULL values in each column from T able B.
Conclusion: A LEFT JOIN operat ion may ret urn different result s, depending on whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_conditio
n} :
T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able A is
specified in {subquery_where_condition} or {where_condition} .
T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able B is
specified in {subquery_where_condition} or {on_condition} .
1 20180101 1 20180101
T he Cart esian product of T able A and T able B cont ains nine rows, of which only one meet s t he join
condit ion. T he ot her t wo rows in T able A do not have mat ching rows in T able B. T herefore, NULL
values are ret urned in t he columns from T able B for t he t wo rows in T able A. T he following t able
list s t he result s t hat t he preceding st at ement ret urns.
1 20180101 1 20180101
Case 3: Specify t he filt er condit ions in t he WHERE clause aft er t he ON clause, as shown in t he
following st at ement :
T he Cart esian product of T able A and T able B cont ains nine rows, of which only t hree meet t he join
condit ion. T he following t able list s t he result set .
1 20180101 1 20180101
2 20180101 2 20180102
2 20180102 2 20180102
T he query processor t hen filt ers t he preceding result set based on t he A.ds='20180101' and B.ds
='20180101' filt er condit ion. T he following t able list s t he result s t hat t he preceding st at ement
ret urns.
1 20180101 1 20180101
RIGHT JOIN
A RIGHT JOIN operat ion is similar t o a LEFT JOIN operat ion, except t hat t he t wo t ables are used in a
reversed manner. A RIGHT JOIN operat ion ret urns all t he rows of T able B and rows in T able A t hat
meet t he join condit ion.
Conclusion: A RIGHT JOIN operat ion may ret urn different result s, depending on whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_condit
ion} .
T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able B is
specified in {subquery_where_condition} or {where_condition} .
T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able A is
specified in {subquery_where_condition} or {on_condition} .
FULL JOIN
A FULL JOIN operat ion t akes t he Cart esian product of t he rows in T able A and T able B and ret urns all
t he rows in T able A and T able B, whet her t he join condit ion is met or not . In t he result set , NULL
values are ret urned in t he columns from t he t able t hat lacks a mat ching row in t he ot her t able.
Conclusion: A FULL JOIN operat ion may ret urn different result s, depending on whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_conditio
n} .
1 20180101 1 20180101
T he Cart esian product of T able A and T able B cont ains nine rows, of which only one meet s t he join
condit ion. In t he result set , for t he t wo rows in T able A t hat mat ch no rows in T able B, NULL values
are ret urned in t he columns from T able B. For t he t wo rows in T able B t hat mat ch no rows in T able
A, NULL values are ret urned in t he columns from T able A. T he following t able list s t he result s t hat
t he preceding st at ement ret urns.
1 20180101 1 20180101
Case 3: Specify t he filt er condit ions in t he WHERE clause aft er t he ON clause, as shown in t he
following st at ement :
T he Cart esian product of T able A and T able B cont ains nine rows, of which only t hree meet t he join
condit ion.
1 20180101 1 20180101
2 20180101 2 20180102
2 20180102 2 20180102
T he row in T able B t hat has no mat ching rows in T able A is ret urned in t he result set , wit h NULL
values in t he columns from T able A for t hat row. T he following t able list s t he result set .
1 20180101 1 20180101
2 20180101 2 20180102
2 20180102 2 20180102
T he query processor t hen filt ers t he preceding result set based on t he A.ds='20180101' and B.ds
='20180101' filt er condit ion. T he following t able list s t he result s t hat t he preceding st at ement
ret urns.
1 20180101 1 20180101
A LEFT SEMI JOIN operat ion ret urns only t he rows in T able A t hat have a mat ching row in T able B. A
LEFT SEMI JOIN operat ion does not ret urn rows from T able B. T herefore, you cannot specify a filt er
condit ion for T able B in t he WHERE clause aft er t he ON clause.
Conclusion: A LEFT SEMI JOIN operat ion ret urns t he same result s independent ly of whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_conditio
n} .
SELECT A.*
FROM
(SELECT * FROM A WHERE ds='20180101') A
LEFT SEMI JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key;
a.key a.ds
1 20180101
SELECT A.*
FROM A LEFT SEMI JOIN B
ON a.key = b.key and A.ds='20180101' and B.ds='20180101';
a.key a.ds
1 20180101
Case 3: Specify t he filt er condit ions in t he WHERE clause aft er t he ON clause, as shown in t he
following st at ement :
SELECT A.*
FROM A LEFT SEMI JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key
WHERE A.ds='20180101';
a.key a.ds
1 20180101
T he query processor t hen filt ers t he preceding result set based on t he A.ds='20180101' filt er
condit ion. T he following t able list s t he result s t hat t he preceding st at ement ret urns.
a.key a.ds
1 20180101
A LEFT ANT I JOIN operat ion ret urns only t he rows in T able A t hat have no mat ching rows in T able B. A
LEFT ANT I JOIN operat ion does not ret urn rows from T able B. T herefore, you cannot specify a filt er
condit ion for T able B in t he WHERE clause aft er t he ON clause. A LEFT ANT I JOIN operat ion is usually
used t o replace t he NOT EXIST S synt ax.
Conclusion: A LEFT ANT I JOIN operat ion may ret urn different result s, depending on whet her t he filt er
condit ions are specified in {subquery_where_condition} , {on_condition} , or {where_conditio
n} .
T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able A is
specified in {subquery_where_condition} or {where_condition} .
T he operat ion ret urns t he same result s, regardless of whet her t he filt er condit ion for T able B is
specified in {subquery_where_condition} or {on_condition} .
SELECT A.*
FROM
(SELECT * FROM A WHERE ds='20180101') A
LEFT ANTI JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key;
a.key a.ds
2 20180101
SELECT A.*
FROM A LEFT ANTI JOIN B
ON a.key = b.key and A.ds='20180101' and B.ds='20180101';
a.key a.ds
2 20180101
2 20180102
Case 3: Specify t he filt er condit ions in t he WHERE clause aft er t he ON clause, as shown in t he
following st at ement :
SELECT A.*
FROM A LEFT ANTI JOIN
(SELECT * FROM B WHERE ds='20180101') B
ON a.key = b.key
WHERE A.ds='20180101';
a.key a.ds
2 20180101
2 20180102
T he query processor t hen filt ers t he preceding result set based on t he A.ds='20180101' filt er
condit ion. T he following t able list s t he result s t hat t he preceding st at ement ret urns.
a.key a.ds
2 20180101
Usage notes
For an INNER JOIN operat ion or a LEFT SEMI JOIN operat ion, an SQL st at ement ret urns t he same result s,
regardless of where you specify filt er condit ions for t he left t able and t he right t able.
For a LEFT JOIN operat ion or a LEFT ANT I JOIN operat ion, t he filt er condit ion for t he left t able
funct ions t he same whet her it is specified in {subquery_where_condition} or {where_condition}
. T he filt er condit ion for t he right t able funct ions t he same whet her it is specified in {subquery_wh
ere_condition} or {on_condition} .
For a RIGHT JOIN operat ion, t he filt er condit ion for t he left t able funct ions t he same whet her it is
specified in {subquery_where_condition} or {on_condition} . T he filt er condit ion for t he right
t able funct ions t he same whet her it is specified in {subquery_where_condition} or {where_condit
ion} .
For a FULL OUT ER JOIN operat ion, filt er condit ions can be specified only in {subquery_where_conditio
n} .
2.Data migration
2.1. Overview
T his t opic describes t he best pract ices for dat a migrat ion, including migrat ing business dat a or log dat a
from ot her business plat forms t o MaxComput e or migrat ing dat a from MaxComput e t o ot her business
plat forms.
Background information
T radit ional relat ional dat abases are not suit able for processing a large amount of dat a. If you have a
large amount of dat a st ored in a t radit ional relat ional dat abase, you can migrat e t he dat a t o
MaxComput e.
MaxComput e provides a comprehensive set of dat a migrat ion solut ions and a variet y of classic
dist ribut ed comput ing models, allowing you t o st ore a large amount of dat a and comput e dat a fast .
By using MaxComput e, you can efficient ly save cost s for your ent erprise.
Dat aWorks provides comprehensive feat ures for MaxComput e, such as dat a int egrat ion, dat a analyt ics,
dat a management , and dat a administ rat ion. Among t hese feat ures, data integration enables st able,
efficient , and scalable dat a synchronizat ion.
Best practices
Migrat e business dat a from ot her business plat forms t o MaxComput e
Migrat e dat a across Dat aWorks workspaces. For more informat ion, see Migrat e dat a across
Dat aWorks workspaces.
Migrat e dat a from Hadoop t o MaxComput e. For more informat ion, see Best pract ices of migrat ing
dat a from Hadoop t o MaxComput e. For more informat ion about t he issues t hat you may encount er
during dat a and script migrat ion and t he solut ions, see Pract ices of migrat ing dat a from a user-
creat ed Hadoop clust er t o MaxComput e.
Migrat e dat a from Oracle t o MaxComput e. For more informat ion, see Migrat e dat a from Oracle t o
MaxComput e.
Migrat e dat a from a Kafka clust er t o MaxComput e. For more informat ion, see Migrat e dat a from a
Kafka clust er t o MaxComput e.
Migrat e dat a from an Elast icsearch clust er t o MaxComput e. For more informat ion, see Migrat e dat a
from an Elast icsearch clust er t o MaxComput e.
Migrat e dat a from RDS t o MaxComput e. For more informat ion, see Migrat e dat a from RDS t o
MaxComput e t o implement dynamic part it ioning.
Migrat e JSON dat a from Object St orage Service (OSS) t o MaxComput e. For more informat ion, see
Migrat e JSON dat a from OSS t o MaxComput e.
Migrat e JSON dat a from MongoDB t o MaxComput e. For more informat ion, see Migrat e JSON dat a
from MongoDB t o MaxComput e.
Migrat e dat a from a user-creat ed MySQL dat abase on an Elast ic Comput e Service (ECS) inst ance t o
MaxComput e. For more informat ion, see Migrat e dat a from a user-creat ed MySQL dat abase on an
ECS inst ance t o MaxComput e.
Use T unnel t o migrat e log dat a t o MaxComput e. For more informat ion, see Use T unnel t o upload
log dat a t o MaxComput e.
Use Dat aHub t o migrat e log dat a t o MaxComput e. For more informat ion, see Use Dat aHub t o
migrat e log dat a t o MaxComput e.
Use Dat aWorks t o migrat e log dat a t o MaxComput e. For more informat ion, see Use Dat aWorks Dat a
Int egrat ion t o migrat e log dat a t o MaxComput e.
Aft er t he business dat a and log dat a are processed by MaxComput e, you can use Quick BI t o present
t he dat a processing result s in a visualized manner. For more informat ion, see Best pract ices of using
MaxComput e t o process dat a and Quick BI t o present t he dat a processing result s.
Prerequisites
All t he st eps in t he t ut orial Build an online operat ion analysis plat form are complet ed. For more
informat ion, see Business scenarios and development process.
Context
T his t opic uses t he bigdat a_DOC workspace creat ed in t he t ut orial Build an online operat ion analysis
plat form as t he source workspace. You need t o creat e a dest inat ion workspace t o st ore t he t ables,
resources, configurat ions, and dat a synchronized from t he source workspace.
Procedure
1. Creat e a dest inat ion workspace.
i. Log on t o t he Dat aWorks console. In t he left -side navigat ion pane, click Workspaces.
ii. On t he Workspaces page t hat appears, select t he China (Hangz hou) region in t he upper-left
corner and click Creat e Workspace .
iii. In t he Creat e Workspace pane t hat appears, set paramet ers in t he Basic Set t ings st ep and
click Next .
T he source workspace bigdat a_DOC is in t he basic mode. For convenience, set Mode t o Basic
Mode (Product ion Environment Only) in t he Basic Set t ings st ep when you creat e a
dest inat ion workspace.
Set Workspace Name t o a globally unique name. We recommend t hat you use a name t hat is
easy t o dist inguish. In t his example, set Workspace Name t o clone_t est _doc.
iv. In t he Select Engines and Services st ep, select t he MaxComput e check box and Pay-As-You-Go
in t he Comput e Engines sect ion and click Next .
v. In t he Engine Det ails st ep, set t he required paramet ers and click Creat e Workspace .
Not e
T he cross-workspace cloning feat ure cannot clone t able schemas or dat a.
T he cross-workspace cloning feat ure cannot clone combined nodes. If t he dest inat ion
workspace needs t o use t he combined nodes t hat exist in t he source workspace, you
need t o manually creat e t he combined nodes in t he dest inat ion workspace.
ii. Set T arget Workspace t o clone_t est _doc and Workf low t o Workshop t hat needs t o be
cloned. Select all t he nodes in t he workflow and click Add t o List . Click T o-Be-Cloned Node
List in t he upper-right corner.
iii. In t he Nodes t o Clone pane t hat appears, click Clone All. T he select ed nodes are cloned t o
t he clone_t est _doc workspace.
iv. Go t o t he dest inat ion workspace and check whet her t he nodes are cloned.
3. Creat e t ables.
T he cross-workspace cloning feat ure cannot clone t able schemas. T herefore, you need t o
manually creat e required t ables in t he dest inat ion workspace.
For non-part it ioned t ables, we recommend t hat you use t he following SQL st at ement t o
synchronize t he t able schema from t he source workspace:
For part it ioned t ables, we recommend t hat you use t he following SQL st at ement t o synchronize
t he t able schema from t he source workspace:
Commit t he t ables t o t he product ion environment . For more informat ion, see Create tables.
4. Synchronize dat a.
T he cross-workspace cloning feat ure cannot clone dat a from t he source workspace t o t he
dest inat ion workspace. You need t o manually synchronize required dat a t o t he dest inat ion
workspace. T o synchronize t he dat a of t he rpt _user_t race_log t able from t he source workspace t o
t he dest inat ion workspace, follow t hese st eps:
i. Creat e a connect ion.
a. Go t o t he Dat a Int egrat ion page and click Connect ion in t he left -side navigat ion pane.
b. On t he Dat a Source page t hat appears, click Add a Connect ion in t he upper-right
corner. In t he Add Connect ion dialog box t hat appears, select MaxComput e(ODPS) in t he
Big Dat a St orage sect ion.
c. In t he Add MaxComput e(ODPS) Connect ion dialog box t hat appears, set Connect ion
Name , MaxComput e Project Name , AccessKey ID, and AccessKey Secret , and click
Complet e . For more informat ion, see Add a MaxComput e dat a source.
ii. Creat e a bat ch sync node.
a. Go t o t he Dat aSt udio page, click t he Dat a Analyt ics t ab, and t hen click Workshop under
Business Flow . Right -click Dat a Int egrat ion and choose Creat e > Bat ch
Synchroniz at ion t o creat e a bat ch sync node.
b. On t he configurat ion t ab of t he bat ch sync node, set t he required paramet ers. In t his
example, set Connect ion under Source t o bigdat a_DOC and t hat under T arget t o
odps_f irst . Set T able t o rpt _user_t race_log. Aft er t he configurat ion is complet ed, click
t he Propert ies t ab in t he right -side navigat ion pane.
c. Click Use Root Node in t he Dependencies sect ion and commit t he bat ch sync node.
iii. Generat e ret roact ive dat a for t he bat ch sync node.
a. On t he Dat aSt udio page, click t he Dat aWorks icon in t he upper-left corner and choose All
Product s > Operat ion Cent er.
b. On t he page t hat appears, choose Cycle T ask Maint enance > Cycle T ask in t he left -
side navigat ion pane.
c. On t he page t hat appears, find t he bat ch sync node you creat ed in t he node list and click
t he node name. On t he canvas t hat appears on t he right , right -click t he bat ch sync node
and choose Run > Current Node Ret roact ively .
d. In t he Pat ch Dat a dialog box t hat appears, set t he required paramet ers. In t his example,
set Dat a T imest amp t o Jun 11, 2019 - Jun 17, 2019 t o synchronize dat a from mult iple
part it ions. Click OK.
e. On t he Pat ch Dat a page t hat appears, check t he running st at us of t he ret roact ive
inst ances t hat are generat ed. If Successf ul appears in t he ST AT US column of a
ret roact ive inst ance, t he inst ance is run and t he corresponding dat a is synchronized.
iv. Verify t he dat a synchronizat ion.
On t he Dat a Analyt ics t ab of t he Dat aSt udio page, right -click t he Workshop workflow under
Business Flow and choose Creat e > MaxComput e > ODPS SQL t o creat e an ODPS SQL node.
On t he configurat ion t ab of t he ODPS SQL node, run t he following SQL st at ement t o check
whet her dat a is synchronized t o t he dest inat ion workspace:
Prerequisites
MaxComput e is act ivat ed. A MaxComput e project is creat ed.
In t his example, a project named bigdat a_DOC in t he China (Hangzhou) region is used. For more
informat ion, see Activate MaxCompute and DataWorks.
T he EMR Hadoop clust er is a non-high availabilit y (HA) clust er t hat is deployed on t he classic net work
in t he China (Hangzhou) region. A public IP address and a privat e IP address are configured for t he
Elast ic Comput e Service (ECS) inst ance in t he mast er node group of t he EMR Hadoop clust er.
iii. Click Run in t he upper-right corner of t he code edit or on t he Dat a Plat form t ab. If t he Query
executed successfully message appears, t he t able hive_doc_good_sale is creat ed in t he
EMR Hadoop clust er.
Creat e a t able
iv. Insert t est dat a int o t he t able. You can import t est dat a from Object St orage Service (OSS) or
ot her dat a sources t o t he t able. You can also manually insert t est dat a int o t he t able. In t his
example, t he following st at ement is used t o manually insert t est dat a int o t he t able:
insert into
hive_doc_good_sale PARTITION(pt =1 ) values('2018-08-21','Coat','Brand A','lilei',3
,500.6,7),('2018-08-22','Fresh food','Brand B','lilei',1,303,8),('2018-08-22','Coat
','Brand C','hanmeimei',2,510,2),(2018-08-22,'Bathroom product','Brand A','hanmeime
i',1,442.5,1),('2018-08-22','Fresh food','Brand D','hanmeimei',2,234,3),('2018-08-2
3','Coat','Brand B','jimmy',9,2000,7),('2018-08-23','Fresh food','Brand A','jimmy',
5,45.1,5),('2018-08-23','Coat','Brand E','jimmy',5,100.2,4),('2018-08-24','Fresh fo
od','Brand G','peiqi',10,5560,7),('2018-08-24','Bathroom product','Brand F','peiqi'
,1,445.6,2),('2018-08-24','Coat','Brand A','ray',3,777,3),('2018-08-24','Bathroom p
roduct','Brand G','ray',3,122,3),('2018-08-24','Coat','Brand C','ray',1,62,7) ;
v. Aft er you insert t he dat a int o t he t able, execut e t he select * from hive_doc_good_sale whe
re pt =1; st at ement t o check whet her t he dat a exist s in t he t able t hat you creat ed in t he
EMR Hadoop clust er.
i.
ii.
iii.
iv.
v.
vi.
vii. In t he DDL St at ement dialog box, ent er t he following t able creat ion st at ement and click
Generat e T able Schema. In t he Confirm message, click OK. In t his example, t he following
t able creat ion st at ement is used t o creat e a MaxComput e t able named hive_doc_good_sale:
When you creat e t he t able, you must consider t he mappings bet ween Hive dat a t ypes and
MaxComput e dat a t ypes. For more informat ion about t he mappings, see Data type mappings.
You can also use t he MaxComput e client odpscmd t o creat e a MaxComput e t able. For more
informat ion about how t o inst all and configure t he MaxComput e client , see Install and configure
the MaxCompute client .
Not e If you need t o resolve compat ibilit y issues bet ween Hive dat a t ypes and
MaxComput e dat a t ypes, we recommend t hat you run t he following commands on t he
MaxComput e client :
set odps.sql.type.system.odps2=true;
set odps.sql.hive.compatible=true;
ix. In t he left -side navigat ion pane of t he Dat aSt udio page, click Workspace T ables. In t he
Workspace T ables pane, view t he MaxComput e t able t hat you creat ed.
You can also click t he ECS inst ance ID of t he mast er node t o go t o t he Inst ance Det ails t ab
of t he ECS inst ance in t he ECS console. In t he Basic Informat ion sect ion of t he Inst ance
Det ails t ab, click Connect t o log on t o t he ECS inst ance and run t he hadoop dfsadmin –re
port command t o view t he informat ion about t he dat a nodes.
Not e In t his example, each dat a node has only a privat e IP address and cannot
communicat e wit h t he default resource group of Dat aWorks. T herefore, you must
creat e a cust om resource group t o run your Dat aWorks synchronizat ion node on t he
mast er node.
Not e You can perform t his st ep t o creat e a cust om resource group only when
you use Dat aWorks Professional Edit ion or a more advanced edit ion.
b. When you add a server, ent er informat ion such as t he UUID of t he ECS inst ance and t he
server IP address. If t he net work t ype is classic net work, ent er t he host name. If t he net work
t ype is virt ual privat e cloud (VPC), ent er t he UUID of t he ECS inst ance. You can add
scheduling resources whose net work t ype is classic net work in Dat aWorks V2.0 only in t he
China (Shanghai) region. In ot her regions, you must add scheduling resources whose
net work t ype is VPC regardless of t he net work t ype of your ECS inst ances.
c. Aft er you add t he server, you must make sure t hat t he mast er node and Dat aWorks are
connect ed. If you add an ECS inst ance, you must configure a securit y group for t he
inst ance.
If you use a privat e IP address, add t he privat e IP address t o t he securit y group of t he
ECS inst ance. For more informat ion, see Configure a securit y group for an ECS inst ance
where a self-managed dat a st ore resides.
If you use a public IP address, configure t he Int ernet inbound and out bound rules in t he
securit y group of t he ECS inst ance. In t his example, all port s are specified in t he
configured inbound rules t o allow t raffic from t he Int ernet . In act ual scenarios, we
recommend t hat you configure specific securit y group rules for securit y purposes.
Inbound and out bound rules
d. Aft er you complet e t he preceding st eps, inst all an agent for t he cust om resource group
as prompt ed. If t he st at us of t he ECS inst ance is Available , t he cust om resource group is
creat ed.
example, t he default MaxComput e dat a source is used. T herefore, you need t o add only a Hadoop
dat a source. For more informat ion about how t o add a Hadoop dat a source, see Add an HDFS dat a
source.
i. On t he Dat a Int egrat ion page of t he Dat aWorks console, click Dat a Source in t he left -side
navigat ion pane.
ii. On t he Dat a Source page, click Add dat a source in t he upper-right corner.
iii. In t he Add dat a source dialog box, click HDFS in t he Semi-st ruct uredst orage sect ion.
iv. In t he Add HDFS dat a source dialog box, configure t he paramet ers.
Parameter Description
Enviro nment
No t e T his parameter is displayed only when the
workspace is in standard mode.
Not e If t he net work t ype of t he EMR Hadoop clust er is VPC, t he connect ivit y t est is
not support ed.
iii.
iv. In t he Conf irm message, click OK t o swit ch t o t he code edit or.
v. Click t he Apply T emplat e icon in t he t op t oolbar.
Apply T emplat e icon
vi. In t he Apply T emplat e dialog box, configure t he Source Connect ion T ype , Connect ion,
T arget Connect ion T ype , and Connect ion paramet ers and click OK.
Apply T emplat e dialog box
vii. Aft er t he t emplat e is applied, t he basic set t ings of HDFS Reader are configured.
You can furt her configure t he dat a source and source t able for HDFS Reader based on your
business requirement s. In t his example, t he following script is used. For more informat ion, see
HDFS Reader.
{
"configuration": {
"reader": {
"plugin": "hdfs",
"parameter": {
"path": "/user/hive/warehouse/hive_doc_good_sale/",
"datasource": "HDFS1",
"column": [
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "string"
},
{
"index": 2,
"type": "string"
},
{
"index": 3,
"type": "string"
},
{
"index": 4,
"type": "long"
},
{
"index": 5,
"type": "double"
},
{
"index": 6,
"type": "long"
}
],
"defaultFS": "hdfs://47.100.XX.XXX:9000",
"fieldDelimiter": ",",
"fieldDelimiter": ",",
"encoding": "UTF-8",
"fileType": "text"
}
},
"writer": {
"plugin": "odps",
"parameter": {
"partition": "pt=1",
"truncate": false,
"datasource": "odps_first",
"column": [
"create_time",
"category",
"brand",
"buyer_id",
"trans_num",
"trans_amount",
"click_cnt"
],
"table": "hive_doc_good_sale"
}
},
"setting": {
"errorLimit": {
"record": "1000"
},
"speed": {
"throttle": false,
"concurrent": 1,
"mbps": "1",
}
}
},
"type": "job",
"version": "1.0"
}
In t he preceding script , t he pat h paramet er specifies t he direct ory where t he source dat a is
st ored in t he EMR Hadoop clust er. You can log on t o t he mast er node and run t he hdfs dfs –
ls /user/hive/warehouse/hive_doc_good_sale command t o check t he direct ory. For a
part it ioned t able, t he dat a synchronizat ion feat ure of Dat aWorks can aut omat ically recurse t o
t he part it ion where t he dat a is st ored.
viii. Aft er t he configurat ion is complet e, click t he Run icon in t he t op t oolbar. If a message
indicat ing t hat t he synchronizat ion node is successfully run appears, t he dat a is synchronized.
If a message indicat ing t hat t he synchronizat ion node failed t o be run appears, check logs for
t roubleshoot ing.
figure.
ODPS SQL
3. In t he code edit or of t he creat ed ODPS SQL node, writ e and execut e an SQL st at ement t o view t he
dat a t hat is synchronized t o t he hive_doc_good_sale t able.
Sample st at ement :
Not e You can also run t he select * FROM hive_doc_good_sale where pt =1;
command by using t he MaxComput e client t o query t he synchronized dat a.
If you want t o synchronize dat a from MaxComput e t o Hadoop, you can also perform t he preceding
st eps. However, you must exchange t he reader and writ er in t he preceding script . You can use t he
following script t o synchronize dat a from MaxComput e t o Hadoop:
{
"configuration": {
"reader": {
"plugin": "odps",
"parameter": {
"partition": "pt=1",
"isCompress": false,
"datasource": "odps_first",
"column": [
"create_time",
"category",
"brand",
"buyer_id",
"trans_num",
"trans_amount",
"click_cnt"
],
"table": "hive_doc_good_sale"
}
},
"writer": {
"plugin": "hdfs",
"parameter": {
"path": "/user/hive/warehouse/hive_doc_good_sale",
"fileName": "pt=1",
"datasource": "HDFS_data_source",
"column": [
{
"name": "create_time",
"type": "string"
},
{
"name": "category",
"type": "string"
},
{
{
"name": "brand",
"type": "string"
},
{
"name": "buyer_id",
"type": "string"
},
{
"name": "trans_num",
"type": "BIGINT"
},
{
"name": "trans_amount",
"type": "DOUBLE"
},
{
"name": "click_cnt",
"type": "BIGINT"
}
],
"defaultFS": "hdfs://47.100.XX.XX:9000",
"writeMode": "append",
"fieldDelimiter": ",",
"encoding": "UTF-8",
"fileType": "text"
}
},
"setting": {
"errorLimit": {
"record": "1000"
},
"speed": {
"throttle": false,
"concurrent": 1,
"mbps": "1",
}
}
},
"type": "job",
"version": "1.0"
}
Not e Before you run a synchronizat ion node t o synchronize dat a from MaxComput e t o
Hadoop, you must configure t he Hadoop clust er. For more informat ion, see HDFS Writ er. Aft er
t he synchronizat ion node is run, you can copy t he file t hat is synchronized.
T his t opic describes how t o use t he dat a int egrat ion feat ure of Dat aWorks t o migrat e dat a from Oracle
t o MaxComput e.
Prerequisites
T he Dat aWorks environment is ready.
i. Act ivat e MaxComput e and Dat aWorks.
ii. Creat e a workspace. In t his example, a workspace in basic mode is used.
iii. A workflow is creat ed in t he Dat aWorks console. For more informat ion, see Creat e a workflow.
T he Oracle dat abase is ready.
In t his example, t he Oracle dat abase is inst alled on an Elast ic Comput e Service (ECS) inst ance. T o
enable net work communicat ion, you must configure a public IP address for t he ECS inst ance. In
addit ion, you must configure a securit y group rule for t he ECS inst ance t o ensure t hat t he common
port 1521 of t he Oracle dat abase is accessible. For more informat ion about how t o configure a
securit y group rule for an ECS inst ance, see Modify security group rules.
In t his example, t he t ype of t he ECS inst ance is ecs.c5.xlarge . T he ECS inst ance resides in a virt ual
privat e cloud (VPC) in t he China (Hangzhou) region.
Context
In t his example, Dat aWorks Oracle Reader is used t o read t est dat a from t he Oracle dat abase. For more
informat ion, see Oracle Reader.
3. Aft er dat a insert ion, execut e t he following st at ement t o view t he dat a in t he t able:
2. On t he Dat aSt udio page, creat e a dest inat ion t able t o receive dat a migrat ed from t he Oracle
dat abase.
i.
ii.
iii.
iv. In t he DDL St at ement dialog box, ent er t he following st at ement and click Generat e T able
Schema:
When you creat e t he MaxComput e t able, make sure t hat t he dat a t ypes of t he MaxComput e
t able mat ch t hose of t he Oracle t able. For more informat ion about t he dat a t ypes support ed
by Oracle Reader, see Data types.
v.
3. Creat e an Oracle connect ion. For more informat ion, see Add an Oracle data source.
4. Creat e a bat ch sync node.
i.
ii.
iii. Aft er you creat e t he bat ch sync node, set t he Connect ion paramet er t o t he creat ed Oracle
connect ion and t he T able paramet er t o t he Oracle t able t hat you have creat ed. Click Map
Fields wit h t he Same Name . Use t he default values for ot her paramet ers.
iv.
v.
4.
5.
Prerequisites
MaxComput e is act ivat ed. For more informat ion, see Act ivat e MaxComput e and Dat aWorks.
A workflow is creat ed in Dat aWorks. In t his example, a Dat aWorks workspace in basic mode is used.
For more informat ion, see Creat e a workflow.
A Kafka clust er is creat ed.
Before dat a migrat ion, make sure t hat your Kafka clust er works as expect ed. In t his example, Alibaba
Cloud E-MapReduce (EMR) is used t o aut omat ically creat e a Kafka clust er. For more informat ion, see
Kafka quick start .
T he Kafka clust er is deployed in a virt ual privat e cloud (VPC) in t he China (Hangzhou) region. T he
Elast ic Comput e Service (ECS) inst ances in t he primary inst ance group of t he Kafka clust er are
configured wit h public and privat e IP addresses.
Context
Kafka is dist ribut ed middleware t hat is used t o publish and subscribe t o messages. Kafka is widely used
because of it s high performance and high t hroughput . Kafka can process millions of messages per
second. Kafka is applicable t o st reaming dat a processing, and is used in scenarios such as user behavior
t racing and log collect ion.
A t ypical Kafka clust er cont ains several producers, brokers, consumers, and a ZooKeeper clust er. A Kafka
clust er uses ZooKeeper t o manage configurat ions and coordinat e services in t he clust er.
A t opic is t he most commonly used collect ion of messages in a Kafka clust er, and is a logical concept
for message st orage. T opics are not st ored on physical disks. Inst ead, messages in each t opic are st ored
on t he disks of each clust er node by part it ion. Mult iple producers can publish messages t o a t opic, and
mult iple consumers can subscribe t o messages in a t opic.
When a message is st ored t o a part it ion, t he message is allocat ed an offset . T he offset is t he unique ID
of t he message in t he part it ion. T he offset s of messages in each part it ion st art from 0.
Run t he following command t o simulat e a producer t o writ e dat a t o t he t est kafka t opic. Kafka is
used t o process st reaming dat a. You can cont inuously writ e dat a t o t he t opic. T o ensure t hat t est
result s are valid, we recommend t hat you writ e more t han 10 records.
T o simulat e a consumer t o check whet her dat a is writ t en t o Kafka, open anot her SSH window and
run t he following command. If t he dat a t hat is writ t en appears, t he dat a is writ t en t o t he t opic.
1.
2.
3.
4. Click DDL St at ement . In t he DDL St at ement dialog box, ent er t he following CREAT E T ABLE
st at ement and click Generat e T able Schema:
Each column in t he st at ement corresponds t o a default column of Kafka Reader t hat is provided by
Dat aWorks Dat a Int egrat ion.
__key__: t he key of t he message.
__value__: t he complet e cont ent of t he message.
__part it ion__: t he part it ion where t he message resides.
__headers__: t he header of t he message.
__offset __: t he offset of t he message.
__t imest amp__: t he t imest amp of t he message.
You can cust omize a column. For more informat ion, see Kafka Reader.
5.
T he Kafka plug-in cannot run on t he default resource group of Dat aWorks as expect ed. You must
use an exclusive resource group for Dat a Int egrat ion t o synchronize dat a. For more informat ion, see
Create and use an exclusive resource group for Data Integration.
2.
3.
4.
5. Configure t he script . In t his example, ent er t he following code:
{
"type": "job",
"steps": [
{
"stepType": "kafka",
"parameter": {
"server": "47.xxx.xxx.xxx:9092",
"kafkaConfig": {
"group.id": "console-consumer-83505"
},
"valueType": "ByteArray",
"column": [
"__key__",
"__value__",
"__partition__",
"__partition__",
"__timestamp__",
"__offset__",
"'123'",
"event_id",
"tag.desc"
],
"topic": "testkafka",
"keyType": "ByteArray",
"waitTime": "10",
"beginOffset": "0",
"endOffset": "3"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "",
"truncate": true,
"compress": false,
"datasource": "odps_first",
"column": [
"key",
"value",
"partition1",
"timestamp1",
"offset",
"t123",
"event_id",
"tag"
],
"emptyAsNull": false,
"table": "testkafka"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"throttle": false,
"throttle": false,
"concurrent": 1,
}
}
}
T o view t he values of t he group.id paramet er and t he names of consumer groups, run t he kaf ka-
consumer-groups.sh --boot st rap-server emr-header-1:9092 --list command on t he header
node.
Not e Assume t hat you want t o writ e Kafka dat a t o MaxComput e at a regular
int erval, for example, on an hourly basis. You can use t he beginDat eT ime and endDat eT ime
paramet ers t o set t he int erval for dat a reading t o 1 hour. T hen, t he dat a int egrat ion node
is scheduled t o run once per hour. For more informat ion, see Kafka Reader.
7.
8.
What's next
You can creat e a dat a development job and run SQL st at ement s t o check whet her t he dat a has been
synchronized from Message Queue for Apache Kafka t o t he current t able. T his t opic uses t he select
* from testkafka st at ement as an example. Specific st eps are as follows:
1. In t he left -side navigat ion pane, choose Dat a Development > Business Flow .
2. Right -click and choose Dat a Development > Creat e Dat a Development Node ID > ODPS SQL.
3. In t he Creat e Node dialog box, ent er t he node name, and t hen click Submit .
4. On t he page of t he creat ed node, ent er select * from testkafka and t hen click t he Run icon.
Prerequisites
MaxComput e is act ivat ed. For more informat ion, see Act ivat e MaxComput e and Dat aWorks.
Dat aWorks is act ivat ed.
A workflow is creat ed in Dat aWorks. In t his example, a Dat aWorks workspace in basic mode is used.
For more informat ion, see Creat e a workflow.
An Alibaba Cloud Elast icsearch clust er is creat ed.
Before you migrat e dat a, make sure t hat your Alibaba Cloud Elast icsearch clust er works as expect ed.
For more informat ion about how t o creat e an Alibaba Cloud Elast icsearch clust er, see Quick start .
An Alibaba Cloud Elast icsearch clust er wit h t he following configurat ions is used in t his example:
Context
Elast icsearch is a Lucene-based search server. It provides a dist ribut ed mult i-t enant search engine t hat
support s full-t ext search. Elast icsearch is an open source service t hat complies wit h t he Apache open
st andards. It is a mainst ream ent erprise-class search engine.
Alibaba Cloud Elast icsearch includes Elast icsearch 5.5.3 wit h Commercial Feat ure, Elast icsearch 6.3.2 wit h
Commercial Feat ure, and Elast icsearch 6.7.0 wit h Commercial Feat ure. It also cont ains t he commercial X-
Pack plug-in. You can use Alibaba Cloud Elast icsearch in scenarios such as dat a analysis and search.
Based on open source Elast icsearch, Alibaba Cloud Elast icsearch provides ent erprise-class access
cont rol, securit y monit oring and alert ing, and aut omat ic report ing.
Procedure
1. Creat e a source t able in Elast icsearch. For more informat ion, see Use DataWorks to synchronize data
from MaxCompute to an Alibaba Cloud Elasticsearch cluster.
vi.
3. Synchronize dat a.
i.
ii.
iii.
iv.
v.
vi. Configure t he script .
In t his example, ent er t he following code. For more informat ion about t he code descript ion,
see Elast icsearch Reader.
{
"type": "job",
"steps": [
{
"stepType": "elasticsearch",
"parameter": {
"retryCount": 3,
"column": [
"age",
"job",
"marital",
"education",
"default",
"housing",
"loan",
"contact",
"month",
"month",
"day_of_week",
"duration",
"campaign",
"pdays",
"previous",
"poutcome",
"emp_var_rate",
"cons_price_idx",
"cons_conf_idx",
"euribor3m",
"nr_employed",
"y"
],
"scroll": "1m",
"index": "es_index",
"pageSize": 1,
"sort": {
"age": "asc"
},
"type": "elasticsearch",
"connTimeOut": 1000,
"retrySleepTime": 1000,
"endpoint": "http://es-cn-xxxx.xxxx.xxxx.xxxx.com:9200",
"password": "xxxx",
"search": {
"match_all": {}
},
"readTimeOut": 5000,
"username": "xxxx"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "",
"truncate": true,
"compress": false,
"datasource": "odps_first",
"column": [
"age",
"job",
"marital",
"education",
"default",
"housing",
"loan",
"contact",
"month",
"day_of_week",
"duration",
"campaign",
"pdays",
"pdays",
"previous",
"poutcome",
"emp_var_rate",
"cons_price_idx",
"cons_conf_idx",
"euribor3m",
"nr_employed",
"y"
],
"emptyAsNull": false,
"table": "elastic2mc_bankdata"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"throttle": false,
"concurrent": 1,
"dmu": 1
}
}
}
Not e On t he Basic Inf ormat ion page of t he creat ed Alibaba Cloud Elast icsearch
clust er, you can view t he public IP address and port number in t he Public Net work Access
and Public Net work Port fields.
viii. You can view t he running result on t he Runt ime Log t ab.
4. View t he result .
i.
ii.
iii. On t he configurat ion t ab of t he ODPS SQL node, ent er t he following st at ement :
iv.
v.
Prerequisites
MaxComput e is act ivat ed. For more informat ion, see Act ivat e MaxComput e and Dat aWorks.
Dat aWorks is act ivat ed.
A workflow is creat ed in Dat aWorks. In t his example, a Dat aWorks workspace in basic mode is used.
For more informat ion, see Creat e a workflow.
db.createUser({user:"bookuser",pwd:"123456",roles:["root"]})
2. Prepare dat a.
Upload t he dat a t o t he MongoDB database. In t his example, an ApsaraDB for MongoDB inst ance in a
virt ual privat e cloud (VPC) is used. You must apply for a public endpoint for t he ApsaraDB for
MongoDB inst ance t o communicat e wit h t he default resource group of Dat aWorks. T he following
t est dat a is uploaded:
{
"store": {
"book": [
{
"category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95
},
{
"category": "fiction",
"author": "Evelyn Waugh",
"title": "Sword of Honour",
"price": 12.99
},
{
"category": "fiction",
"author": "J. R. R. Tolkien",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99
}
],
"bicycle": {
"color": "red",
"price": 19.95
}
},
"expensive": 10
}
3. Log on t o t he MongoDB dat abase in t he Dat a Management (DMS) console. In t his example, t he
name of t he dat abase is admin, and t he name of t he collect ion is userlog. You can run t he
following command t o view t he uploaded dat a:
db.userlog.find().limit(10)
{
"type": "job",
"steps": [
{
"stepType": "mongodb",
"parameter": {
"datasource": "mongodb_userlog", // The name of the connection.
"column": [
{
"name": "store.bicycle.color", // The path of the JSON-formatted fi
eld. In this example, the color field is extracted.
"type": "document.String" // For fields other than top-level fields
, the data type of such a field is the type that is finally obtained. If the specif
ied JSON-formatted field is a top-level field, such as the expensive field in this
example, enter string.
}
],
"collectionName": "userlog" // The name of the collection.
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "",
"isCompress": false,
"truncate": true,
"datasource": "odps_first",
"column": [
"mqdata" // The name of the column in the MaxCompute table.
],
"emptyAsNull": false,
"table": "mqdata"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"concurrent": 2,
"throttle": false,
}
}
}
vii.
viii.
4.
5.
Prerequisites
T he Dat aWorks environment is ready.
i. MaxComput e is act ivat ed. For more informat ion, see Act ivat e MaxComput e and Dat aWorks.
ii. Dat aWorks is act ivat ed. T o act ivat e Dat aWorks, go t o t he Dat aWorks buy page.
iii. A workflow is creat ed in t he Dat aWorks console. In t his example, a workflow is creat ed in a
Dat aWorks workspace in basic mode. For more informat ion, see Creat e a workflow.
Connect ions t o t he source and dest inat ion dat a st ores are creat ed.
A MySQL connect ion is creat ed as t he source connect ion. For more informat ion, see Add a MySQL
dat a source.
A MaxComput e connect ion is creat ed as t he dest inat ion connect ion. For more informat ion, see Add
a MaxComput e dat a source.
1.
2. Creat e a dest inat ion t able in MaxComput e.
i.
ii.
iii.
iv.
v.
vi. In t he DDL St at ement dialog box, ent er t he following st at ement and click Generat e T able
Schema:
vii.
3. Creat e a bat ch sync node.
i.
ii.
iii. Configure t he source and dest inat ion for t he bat ch sync node.
ii. In t he General sect ion, set t he Argument s paramet er. T he default value is ${bizdate} in
t he format of yyyymmdd.
You can specify a dat e in one of t he following format s for t he part it ion paramet er:
Not e
Keep t he value calculat ion formula in bracket s []. For example, key1=$[yyyy-mm-dd]
.
5.
6.
If you have a large amount of hist orical dat a in ApsaraDB RDS t hat is generat ed before t he node is run,
all hist orical dat a needs t o be aut omat ically migrat ed t o MaxComput e and t he part it ions need t o be
aut omat ically creat ed. T o generat e ret roact ive dat a for t he current sync node, you can use t he Pat ch
Dat a feat ure of Dat aWorks.
1. Filt er hist orical dat a in ApsaraDB RDS by dat e.
You can set t he Filt er paramet er in t he Source sect ion t o filt er dat a in ApsaraDB RDS.
2. Generat e ret roact ive dat a for t he node. For more informat ion, see Perform retroactive data
generation and view retroactive data generation instances.
3. View t he process of ext ract ing dat a from ApsaraDB RDS on t he Run Log t ab.
T he logs indicat e t hat Part it ion 20180913 is aut omat ically creat ed in MaxComput e.
4. Verify t he execut ion result . Execut e t he following st at ement on t he MaxComput e client t o check
whet her t he dat a is writ t en t o MaxComput e:
If you have a large amount of dat a or full dat a is migrat ed t o part it ions based on a non-dat e field for
t he first t ime, t he part it ions cannot be aut omat ically creat ed during t he migrat ion. In t his case, you can
map t he values in a field in t he source t able t o a corresponding part it ion in MaxComput e by using a
hash funct ion.
1. Creat e an SQL script node. Execut e t he following st at ement s t o creat e a t emporary t able in
MaxComput e and migrat e dat a t o t he t able:
2. Creat e a sync node named mysql_t o_odps t o migrat e full dat a from ApsaraDB RDS t o
MaxComput e. Part it ioning is not required.
3. Execut e t he following SQL st at ement s t o migrat e dat a from T able ods_user_t t o T able ods_user_d
based on dynamic part it ioning:
You can use SQL st at ement s t o migrat e dat a in MaxComput e. For more informat ion about SQL
st at ement s, see Use part it ioned t ables in MaxComput e.
4. Configure t he t hree nodes t o form a workflow t o run t hese nodes sequent ially, as shown in t he
following figure.
5. View t he execut ion process. T he last node represent s t he process of dynamic part it ioning, as
6. Verify t he execut ion result . Execut e t he following st at ement on t he MaxComput e client t o check
whet her t he dat a is writ t en t o MaxComput e:
Prerequisites
MaxComput e is act ivat ed.
Dat aWorks is act ivat ed.
A workflow is creat ed in t he Dat aWorks console. In t his example, a workflow is creat ed in a Dat aWorks
workspace in basic mode. For more informat ion, see Creat e a workflow.
A T XT file t hat cont ains JSON dat a is uploaded t o an OSS bucket . In t his example, t he OSS bucket is in
t he China (Shanghai) region. T he T XT file cont ains t he following JSON dat a:
{
"store": {
"book": [
{
"category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95
},
{
"category": "fiction",
"author": "Evelyn Waugh",
"title": "Sword of Honour",
"price": 12.99
},
{
"category": "fiction",
"author": "J. R. R. Tolkien",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99
}
],
"bicycle": {
"color": "red",
"price": 19.95
}
},
"expensive": 10
}
v.
3. Creat e a bat ch synchronizat ion node.
i.
ii.
iii.
iv.
v.
Sample code:
{
"type": "job",
"steps": [
{
"stepType": "oss",
"parameter": {
"fieldDelimiterOrigin": "^",
"nullFormat": "",
"compress": "",
"datasource": "OSS_userlog",
"column": [
{
"name": 0,
"type": "string",
"index": 0
}
],
"skipHeader": "false",
"encoding": "UTF-8",
"fieldDelimiter": "^",
"fileFormat": "binary",
"object": [
"applog.txt"
]
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "",
"isCompress": false,
"truncate": true,
"datasource": "odps_first",
"column": [
"mqdata"
],
"emptyAsNull": false,
"table": "mqdata"
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": ""
},
"speed": {
"concurrent": 2,
"throttle": false,
}
}
}
iv.
v.
Prerequisites
Procedure
1. Creat e a t able in t he Dat aWorks console.
i.
ii.
iii.
iv.
v.
vi.
vii. In t he DDL St at ement dialog box, ent er t he following st at ement and click Generat e T able
Schema:
viii.
2. Import dat a t o t able T ranss.
i.
ii.
iii. In t he dialog box t hat appears, set Select Dat a Import Met hod t o Upload Local File and
click Browse next t o Select File . Select t he local file t hat you want t o import . T hen, specify
ot her paramet ers.
Example:
qwe,145,F
asd,256,F
xzc,345,M
rgth,234,F
ert,456,F
dfg,12,M
tyj,4,M
bfg,245,M
nrtjeryj,15,F
rwh,2344,M
trh,387,F
srjeyj,67,M
saerh,567,M
iv.
v.
vi.
3. Creat e a t able in t he T ablest ore console.
i. Log on t o t he T ablest ore console and creat e an inst ance. For more informat ion, see Create
instances.
ii. Creat e a t able named T rans. For more informat ion, see Create tables.
4. Add dat a sources in t he Dat aWorks console.
i.
ii.
iii.
iv.
v. In t he upper-right corner, click New dat a source . In t he dialog box t hat appears, click
MaxComput e(ODPS).
vi. In t he Add MaxComput e(ODPS) dat a source dialog box, specify t he required paramet ers
and click Complet e . For more informat ion, see Add a MaxCompute data source.
vii. Add T ablest ore as a dat a source. For more informat ion, see Add a T ablestore data source.
5. Configure MaxComput e as t he reader and T ablest ore as t he writ er.
i.
ii.
iii.
iv.
v.
Sample code:
{
"type": "job",
"steps": [
{
"stepType": "odps",
"parameter": {
"partition": [],
"datasource": "odps_first",
"column": [
"name",
"id",
"gender"
],
"table": "Transs"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "ots",
"parameter": {
"datasource": "Transs",
"column": [
{
"name": "Gender",
"type": "STRING"
}
],
"writeMode": "UpdateRow",
"table": "Trans",
"primaryKey": [
{
"name": "Name",
"type": "STRING"
},
{
"name": "ID",
"type": "INT"
}
]
},
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"throttle": false,
"concurrent": 1,
"dmu": 1
}
}
}
Prerequisites
Procedure
1. Creat e a t able in t he Dat aWorks console.
i.
ii.
iii.
iv.
v.
vi.
vii. In t he DDL St at ement dialog box, ent er t he following st at ement and click Generat e T able
Schema:
viii.
2. Import dat a t o t able T ranss.
i.
ii.
iii. In t he dialog box t hat appears, set Select Dat a Import Met hod t o Upload Local File and
click Browse next t o Select File . Select t he local file t hat you want t o import . T hen, specify
ot her paramet ers.
Example:
qwe,145,F
asd,256,F
xzc,345,M
rgth,234,F
ert,456,F
dfg,12,M
tyj,4,M
bfg,245,M
nrtjeryj,15,F
rwh,2344,M
trh,387,F
srjeyj,67,M
saerh,567,M
iv.
v.
vi.
3. Creat e a t able in t he OSS console.
i. Log on t o t he OSS console and creat e a bucket . For more informat ion, see Create buckets.
ii. Upload t he qwee.csv file t o OSS. For more informat ion, see Upload objects.
Not e Make sure t hat fields in t he qwee.csv file are exact ly t he same as t hose in t he
T ranss t able.
iv. In t he left -side navigat ion pane of t he page t hat appears, click Connect ion. T he Dat a
Source page appears.
v. In t he upper-right corner, click New dat a source . In t he dialog box t hat appears, click
MaxComput e(ODPS).
vi. In t he Add MaxComput e(ODPS) dat a source dialog box, specify t he required paramet ers
and click Complet e . For more informat ion, see Add a MaxCompute data source.
vii. Add OSS as a dat a source. For more informat ion, see Add an OSS data source.
5. Configure MaxComput e as t he reader and OSS as t he writ er.
i.
ii.
iii.
iv.
v.
Sample code:
{
"order":{
"hops":[
{
"from":"Reader",
"to":"Writer"
}
]
},
"setting":{
"errorLimit":{
"record":"0"
},
"speed":{
"concurrent":1,
"dmu":1,
"throttle":false
}
},
"steps":[
{
"category":"reader",
"name":"Reader",
"parameter":{
"column":[
"name",
"id",
"gender"
],
"datasource":"odps_first",
"partition":[],
"table":"Transs"
},
"stepType":"odps"
},
{
"category":"writer",
"name":"Writer",
"parameter":{
"datasource":"Trans",
"dateFormat":"yyyy-MM-dd HH:mm:ss",
"encoding":"UTF-8",
"fieldDelimiter":",",
"fileFormat":"csv",
"nullFormat":"null",
"object":"qweee.csv",
"writeMode":"truncate"
},
"stepType":"oss"
}
],
"type":"job",
"version":"2.0"
}
6. View t he dat a of t he newly creat ed t able in t he OSS console. For more informat ion, see Download
objects.
Prerequisites
An ECS inst ance is purchased and bound t o a virt ual privat e cloud (VPC) but not t he classic net work. A
MySQL dat abase t hat st ores t est dat a is deployed on t he ECS inst ance. An account used t o connect
t o t he dat abase is creat ed. In t his example, use t he following st at ement s t o creat e a t able in t he
MySQL dat abase and insert t est dat a t o t he t able:
T he privat e IP address, VPC, and vSwit ch of your ECS inst ance are not ed.
A securit y group rule is added for t he ECS inst ance t o allow access request s on t he port used by t he
MySQL dat abase. By default , t he MySQL dat abase uses port 3306. For more informat ion, see Add a
securit y group rule. T he name of t he securit y group is not ed.
A Dat aWorks workspace is creat ed. In t his example, creat e a Dat aWorks workspace t hat is in basic
mode and uses a MaxComput e comput e engine. Make sure t hat t he creat ed Dat aWorks workspace
belongs t o t he same region as t he ECS inst ance. For more informat ion about how t o creat e a
workspace, see Creat e a workspace.
An exclusive resource group for Dat a Int egrat ion is purchased and bound t o t he VPC where t he ECS
inst ance resides. T he exclusive resource group and t he ECS inst ance are in t he same zone. For more
informat ion, see Creat e and use an exclusive resource group for Dat a Int egrat ion. Aft er t he exclusive
resource group is bound t o t he VPC, you can view informat ion about t he exclusive resource group on
t he Resource Groups page.
Check whet her t he VPC, vSwit ch, and securit y group of t he exclusive resource group are t he same as
t hose of t he ECS inst ance.
Context
An exclusive resource group can t ransmit your dat a in a fast and st able manner. Make sure t hat t he
exclusive resource group for Dat a Int egrat ion belongs t o t he same zone in t he same region as t he dat a
st ore t hat needs t o be accessed. Make sure t hat t he exclusive resource group for Dat a Int egrat ion
belongs t o t he same region as t he Dat aWorks workspace. In t his example, t he dat a st ore t hat needs t o
be accessed is a user-creat ed MySQL dat abase on an ECS inst ance.
Procedure
1. Creat e a connect ion t o t he MySQL dat abase in t he Dat aWorks console.
i. Log on t o t he Dat aWorks console by using your Alibaba Cloud account .
ii. On t he Workspaces page, find t he required workspace and click Dat a Int egrat ion.
iii. In t he left -side navigat ion pane, click Connect ion.
iv. On t he Dat a Source page, click New dat a source in t he upper-right corner.
v. In t he Add dat a source dialog box, select MySQL.
vi. In t he Add MySQL dat a source dialog box, set t he paramet ers. For more informat ion, see Add
a MySQL data source.
For example, set t he Dat a source t ype paramet er t o Connect ion st ring mode . Use t he
privat e IP address of t he ECS inst ance and t he default port number 3306 of t he MySQL
dat abase when you specify t he Java Dat abase Connect ivit y (JDBC) URL.
Not e Dat aWorks cannot t est t he connect ivit y of a user-creat ed MySQL dat abase in
a VPC. T herefore, it is normal t hat a connect ivit y t est fails.
vii. Find t he required resource group and click T est connect ivit y .
During dat a synchronizat ion, a sync node uses only one resource group. You must t est t he
connect ivit y of all t he resource groups for Dat a Int egrat ion on which your sync nodes will be
run and make sure t hat t he resource groups can connect t o t he dat a st ore. T his ensures t hat
your sync nodes can be run as expect ed. For more informat ion, see Select a net work
connect ivit y solut ion.
viii. Aft er t he connect ion passes t he connect ivit y t est , click Complet e .
2. Creat e a MaxComput e t able.
You must creat e a t able in Dat aWorks t o receive t est dat a from t he MySQL dat abase.
i. Click t he icon in t he upper-left corner and choose All Product s > Dat aSt udio .
ii. Creat e a workflow. For more informat ion, see Create a workflow .
iii. Right -click t he creat ed workflow and choose Creat e > MaxComput e > T able .
iv. Ent er a name for your MaxComput e t able. In t his example, set t he T able Name paramet er t o
good_sale, which is t he same as t he name of t he t able in t he MySQL dat abase. Click DDL
St at ement , ent er t he t able creat ion st at ement , and t hen click Generat e T able Schema.
In t his example, ent er t he following t able creat ion st at ement . Pay at t ent ion t o dat a t ype
conversion.
v. Set t he Display Name paramet er and click Commit t o Product ion Environment . T he
MaxComput e t able named good_sale is creat ed.
3. Configure a dat a int egrat ion node.
i. Right -click t he workflow you just creat ed and choose Creat e > Dat a Int egrat ion > Bat ch
Synchroniz at ion t o creat e a dat a int egrat ion node.
ii. Set t he Connect ion paramet er under Source t o t he creat ed MySQL connect ion and t he
Connect ion paramet er under T arget t o odps_first . Click t he Swit ch t o Code Edit or icon t o
swit ch t o t he code edit or.
If you cannot set t he T able paramet er under Source or an error is ret urned when you at t empt
t o swit ch t o t he code edit or, ignore t he issue.
iii. Click t he Resource Group conf igurat ion t ab in t he right -side navigat ion pane and select an
exclusive resource group t hat you have purchased.
If you do not select t he exclusive resource group as t he resource group for Dat a Int egrat ion of
your node, t he node may fail t o be run.
iv. Ent er t he following code for t he dat a int egrat ion node:
{
"type": "job",
"steps": [
{
"stepType": "mysql",
"parameter": {
"column": [// The columns in the source table.
"create_time",
"category",
"brand",
"buyer_id",
"trans_num",
"trans_amount",
"click_cnt"
],
"connection": [
{
"datasource": "shuai",// The source connection.
"table": [
"good_sale"// The name of the table in the source datab
ase. The name must be enclosed in brackets [].
]
}
],
"where": "",
"splitPk": "",
"encoding": "UTF-8"
},
"name": "Reader",
"category": "reader"
},
{
"stepType": "odps",
"parameter": {
"partition": "",
"truncate": true,
"datasource": "odps_first",// The destination connection.
"column": [// The columns in the destination table.
"create_time",
"category",
"brand",
"buyer_id",
"trans_num",
"trans_amount",
"click_cnt"
],
"emptyAsNull": false,
"table": "good_sale"// The name of the destination table.
},
"name": "Writer",
"category": "writer"
}
],
"version": "2.0",
"order": {
"hops": [
{
"from": "Reader",
"to": "Writer"
}
]
},
"setting": {
"errorLimit": {
"record": "0"
},
"speed": {
"throttle": false,
"concurrent": 2
"concurrent": 2
}
}
}
v. Click t he Run icon. You can view t he Runt ime Log t ab in t he lower part of t he page t o check
whet her t he t est dat a is synchronized t o MaxComput e.
Result
T o query dat a in t he MaxComput e t able, creat e an ODPS SQL node.Ent er t he st at ement select *
from good_sale ; , and click t he Run icon. If t he t est dat a appears, it is synchronized t o t he
MaxComput e t able.
Prerequisites
Creat e an Amazon Redshift clust er and prepare dat a for migrat ion.
For more informat ion about how t o creat e an Amazon Redshift clust er, see Amazon Redshift Clust er
Management Guide.
i. Creat e an Amazon Redshift clust er. If you already have an Amazon Redshift clust er, skip t his
st ep.
ii. Prepare t he dat a t hat you want t o migrat e in t he Amazon Redshift clust er.
In t his example, a T PC-H dat aset is available in public schema. T he dat aset uses t he MaxComput e
V2.0 dat a t ypes and t he Decimal 2.0 dat a t ype.
In t his example, a MaxComput e project is creat ed as t he migrat ion dest inat ion in t he Singapore
(Singapore) region. T he project is creat ed in MaxComput e V2.0 because t he T PC-H dat aset uses t he
MaxComput e V2.0 dat a t ypes and t he Decimal V2.0 dat a t ype.
Context
T he following figure shows t he process t o migrat e dat a from Amazon Redshift t o MaxComput e.
No. Description
Unload the data from Amazon Redshift to a data lake on Amazon Simple Storage Service
①
(S3).
Migrate the data from Amazon S3 to an OSS bucket by using the Dat a Online Migrat io n
②
service of OSS.
Migrate the data from the OSS bucket to a MaxCompute project in the same region, and
③
then verify the integrity and accuracy of the migrated data.
T he synt ax of t he UNLOAD command varies based on t he aut hent icat ion met hod.
-- Run the UNLOAD command to unload data from the customer table to Amazon S3.
UNLOAD ('SELECT * FROM customer')
TO 's3://bucket_name/unload_from_redshift/customer/customer_' -- The Amazon S3 bucket.
IAM_ROLE 'arn:aws:iam::****:role/MyRedshiftRole'; -- The Alibaba Cloud Resource Name (ARN
) of the IAM role.
-- Run the UNLOAD command to unload data in the customer table to Amazon S3.
UNLOAD ('SELECT * FROM customer')
TO 's3://bucket_name/unload_from_redshift/customer/customer_' -- The Amazon S3 bucket.
Access_Key_id '<access-key-id>' -- The AccessKey ID of the IAM user.
Secret_Access_Key '<secret-access-key>' -- The AccessKey secret of the IAM user.
Session_Token '<temporary-token>'; -- The temporary access token of the IAM user.
Default format
T he following sample command shows how t o unload dat a in t he default format :
Aft er t he command is run, dat a is unloaded t o t ext files in which values are separat ed by vert ical bars
(|). You can log on t o t he Amazon S3 console and view t he unloaded t ext files in t he specified
bucket .
Dat a unloaded in t he Apache Parquet format can be direct ly read by ot her engines. T he following
sample command shows how t o unload dat a in t he Apache Parquet format :
Aft er t he command is run, you can view t he unloaded Parquet files in t he specified bucket . Parquet
files are smaller t han t ext files and have a higher dat a compression rat io.
T his sect ion describes how t o aut hent icat e request s based on IAM roles and unload dat a in t he Apache
Parquet format .
1. Creat e an IAM role for Amazon Redshift .
i. Log on t o t he IAM console. In t he left -side navigat ion pane, choose Access Management >
Roles. On t he Roles page, click Creat e role .
ii. In t he Common use cases sect ion of t he Creat e role page, click Redshif t . In t he Select
your use case sect ion, click Redshif t -Cust omiz able , and t hen click Next : Permissions.
2. Add an IAM policy t hat grant s t he read and writ e permissions on Amazon S3. In t he At t ach
permissions policies sect ion of t he Creat e role page, ent er S3 , select Amaz onS3FullAccess,
and t hen click Next : T ags.
3. Assign a name t o t he IAM role and complet e t he IAM role creat ion.
i. Click Next : Review . In t he Review sect ion of t he Creat e role page, specify Role name and
Role descript ion, and click Creat e role . T he IAM role is t hen creat ed.
ii. Go t o t he IAM console, and ent er redshif t _s3_role in t he search box t o search for t he role.
T hen, click t he role name redshif t _s3_role , and copy t he value of Role ARN.
When you run t he UNLOAD command t o unload dat a, you must provide t he Role ARN value t o
access Amazon S3.
4. Associat e t he creat ed IAM role wit h t he Amazon Redshift clust er t o aut horize t he clust er t o access
Amazon S3.
iii. On t he Manage IAM roles page, click t he icon next t o t he search box, and select
redshif t _s3_role . Click Add IAM role > Done t o associat e t he redshif t _s3_role role wit h
t he Amazon Redshift clust er.
5. Unload dat a from Amazon Redshift t o Amazon S3.
i. Go t o t he Amazon Redshift console.
ii. In t he left -side navigat ion pane, click EDIT OR. Run t he UNLOAD command t o upload dat a from
Amazon Redshift t o each dest inat ion bucket on Amazon S3 in t he Apache Parquet format .
T he following sample command shows how t o unload dat a from Amazon Redshift t o Amazon
S3:
Not e You can submit mult iple UNLOAD commands at a t ime in EDIT OR.
iii. Log on t o t he Amazon S3 console and check t he unloaded dat a in t he direct ory of each
dest inat ion bucket on Amazon S3.
T he unloaded dat a is available in t he Apache Parquet format .
1. Log on t o t he OSS console, and creat e a bucket t o save t he migrat ed dat a. For more informat ion,
see Create buckets.
2. Creat e a Resource Access Management (RAM) user and grant relevant permissions t o t he RAM user.
i. Log on t o t he RAM console and creat e a RAM user. For more informat ion, see Create a RAM user.
ii. Find t he RAM user t hat you creat ed, and click Add Permissions in t he Act ions column. On t he
page t hat appears, select AliyunOSSFullAccess and AliyunMGWFullAccess, and click OK
and complet e. T he AliyunOSSFullAccess policy aut horizes t he RAM user t o read dat a from and
writ e dat a t o OSS bucket s. T he AliyunMGWFullAccess policy aut horizes t he RAM user t o
perform online migrat ion jobs.
iii. In t he left -side navigat ion pane, click Overview . In t he Account Management sect ion of t he
Overview page, click t he link under RAM user logon, and use t he credent ials of t he RAM user
t o log on t o t he Alibaba Cloud Management Console.
3. On t he Amazon Web Services (AWS) plat form, creat e an IAM user who uses t he programmat ic
access met hod t o access Amazon S3.
i. Log on t o t he Amazon S3 console.
ii. Right -click t he export ed folder and select Get t ot al siz e t o obt ain t he t ot al size of t he
folder and t he number of files in t he folder.
Obt ain t he t ot al size.
iv. On t he Add user page, specify t he User name . In t he Select AWS access t ype sect ion,
select Programmat ic access and t hen click Next : Permissions.
v. On t he Add user page, click At t ach exist ing policies direct ly . Ent er S3 in t he search box,
select t he Amaz onS3ReadOnlyAccess policy, and t hen click Next : T ags.
vi. Click Next : Review > Creat e user. T he IAM user is creat ed. Obt ain t he AccessKey pair.
If you creat e an online migrat ion job, you must provide t his AccessKey pair.
4. Creat e a source dat a address and a dest inat ion dat a address for online migrat ion.
i. Log on t o t he Alibaba Cloud Dat a T ransport console. In t he left -side navigat ion pane, click
Dat a Address.
ii. (Opt ional)If you have not act ivat ed Dat a Online Migrat ion, click Applicat ion in t he dialog box
t hat appears. On t he Online Migrat ion Bet a T est page, specify t he required informat ion and
click Submit .
iii. On t he Dat a Address page, click Creat e Dat a Address. In t he Creat e Dat a Address panel,
set t he required paramet ers and click OK. For more informat ion about t he required paramet ers,
see Migrate data.
Source dat a address
Not e In t he Access Key Id and Access Key Secret fields, ent er t he AccessKey ID
and AccessKey secret of t he IAM user.
Not e In t he Access Key Id and Access Key Secret fields, ent er t he AccessKey ID
and t he AccessKey secret of t he RAM user.
Job Conf ig
Perf ormance
Not e In t he Dat a Siz e and File Count fields, ent er t he size and t he number of
files t hat you want t o migrat e from Amazon S3.
iii. T he migrat ion job t hat you creat ed is aut omat ically run. If Finished is displayed in t he Job
St at us column, t he migrat ion job is complet e.
iv. In t he Operat ion column of t he migrat ion job, click Manage t o view t he migrat ion report and
confirm t hat all t he dat a is migrat ed.
T he LOAD command support s Securit y T oken Service (ST S) and AccessKey for aut hent icat ion. If you use
AccessKey for aut hent icat ion, you must provide t he AccessKey ID and AccessKey secret of your account
in plaint ext . ST S aut hent icat ion is highly secure because it does not expose t he AccessKey pair. In t his
sect ion, ST S aut hent icat ion is used as an example t o show how t o migrat e dat a.
C_NationKey int ,
C_Phone varchar(64) ,
C_AcctBal decimal(13, 2) ,
C_MktSegment varchar(64) ,
C_Comment varchar(120) ,
skip varchar(64)
);
CREATE TABLE lineitem(
L_OrderKey int ,
L_PartKey int ,
L_SuppKey int ,
L_LineNumber int ,
L_Quantity int ,
L_ExtendedPrice decimal(13, 2) ,
L_Discount decimal(13, 2) ,
L_Tax decimal(13, 2) ,
L_ReturnFlag varchar(64) ,
L_LineStatus varchar(64) ,
L_ShipDate timestamp ,
L_CommitDate timestamp ,
L_ReceiptDate timestamp ,
L_ShipInstruct varchar(64) ,
L_ShipMode varchar(64) ,
L_Comment varchar(64) ,
skip varchar(64)
);
CREATE TABLE nation(
N_NationKey int ,
N_Name varchar(64) ,
N_RegionKey int ,
N_Comment varchar(160) ,
skip varchar(64)
);
CREATE TABLE orders(
O_OrderKey int ,
O_CustKey int ,
O_OrderStatus varchar(64) ,
O_TotalPrice decimal(13, 2) ,
O_OrderDate timestamp ,
O_OrderPriority varchar(15) ,
O_Clerk varchar(64) ,
O_ShipPriority int ,
O_Comment varchar(80) ,
skip varchar(64)
);
CREATE TABLE part(
P_PartKey int ,
P_Name varchar(64) ,
P_Mfgr varchar(64) ,
P_Brand varchar(64) ,
P_Type varchar(64) ,
P_Size int ,
P_Container varchar(64) ,
P_RetailPrice decimal(13, 2) ,
P_Comment varchar(64) ,
P_Comment varchar(64) ,
skip varchar(64)
);
CREATE TABLE partsupp(
PS_PartKey int ,
PS_SuppKey int ,
PS_AvailQty int ,
PS_SupplyCost decimal(13, 2) ,
PS_Comment varchar(200) ,
skip varchar(64)
);
CREATE TABLE region(
R_RegionKey int ,
R_Name varchar(64) ,
R_Comment varchar(160) ,
skip varchar(64)
);
CREATE TABLE supplier(
S_SuppKey int ,
S_Name varchar(64) ,
S_Address varchar(64) ,
S_NationKey int ,
S_Phone varchar(18) ,
S_AcctBal decimal(13, 2) ,
S_Comment varchar(105) ,
skip varchar(64)
);
In t his example, t he project uses t he MaxComput e V2.0 dat a t ypes because t he T PC-H dat aset uses
t he MaxComput e V2.0 dat a t ypes and t he Decimal 2.0 dat a t ype. If you want t o configure t he
project t o use t he MaxComput e V2.0 dat a t ypes and t he Decimal 2.0 dat a t ype, add t he following
commands at t he beginning of t he CREAT E T ABLE st at ement s:
setproject odps.sql.type.system.odps2=true;
setproject odps.sql.decimal.odps2=true;
2. Creat e a RAM role t hat has t he OSS access permissions and assign t he RAM role t o t he RAM user. For
more informat ion, see ST S authorization.
3. Run t he LOAD command mult iple t imes t o load all dat a from OSS t o t he MaxComput e t ables t hat
you creat ed, and execut e t he SELECT st at ement t o query and verify t he import ed dat a. For more
informat ion about t he LOAD command, see LOAD.
Not e If t he dat a import fails, submit a t icket t o cont act t he MaxComput e t eam.
4. Verify t hat t he dat a migrat ed t o MaxComput e is t he same as t he dat a in Amazon Redshift . T his
verificat ion is based on t he number of t ables, t he number of rows, and t he query result s of t ypical
jobs.
i. Log on t o t he Amazon Redshift console. In t he upper-right corner, select Asia Pacif ic
(Singapore) from t he drop-down list . In t he left -side navigat ion pane, click EDIT OR. Execut e
t he following st at ement t o query dat a:
Prerequisites
Environm
ent and
data
Account
A Resource Access Management (RAM) user and
Alibaba a RAM role are created. T he RAM user is Create a RAM user and ST S
Cloud granted the read and write permissions on OSS authorization
buckets and the online migration permissions.
Google
Cloud N/A N/A
Platform
Region
Context
T he following figure shows t he process t o migrat e dat aset s from BigQuery t o Alibaba Cloud
MaxComput e.
No. Description
Migrate data from Google Cloud Storage to an OSS bucket by using the Dat a Online
②
Migrat io n service of OSS.
Migrate data from the OSS bucket to a MaxCompute project in the same region, and then
③
verify the integrity and accuracy of the migrated data.
2. Use t he bq command-line t ool t o query t he Dat a Definit ion Language (DDL) script s of t ables in t he
T PC-DS dat aset s and download t he script s t o an on-premises device. For more informat ion, see
Get t ing t able met adat a using INFORMAT ION_SCHEMA.
BigQuery does not support commands such as show create table t o query DDL script s of t ables.
BigQuery allows you t o use built -in user-defined funct ions (UDFs) t o query t he DDL script s of t he
t ables in a dat aset . T he following code shows examples of DDL script s.
bq extract
--destination_format AVRO
--compression SNAPPY
tpcds_100gb.web_site
gs://bucket_name/web_site/web_site-*.avro.snappy;
1. Est imat e t he size and t he number of files t hat you want t o migrat e. You can query t he dat a size in
t he bucket of Google Cloud St orage by using t he gsut il t ool or checking t he st orage logs. For more
informat ion, see Get t ing bucket informat ion.
2. (Opt ional)If you do not have a bucket in OSS, log on t o t he OSS console and creat e a bucket t o
st ore t he migrat ed dat a. For more informat ion, see Create buckets.
3. (Opt ional)If you do not have a RAM user, creat e a RAM user and grant relevant permissions t o t he
RAM user.
i. Log on t o t he RAM console and creat e a RAM user. For more informat ion, see Create a RAM user.
ii. Find t he newly creat ed RAM user, and click Add Permissions in t he Act ions column. On t he
page t hat appears, select AliyunOSSFullAccess and AliyunMGWFullAccess, and click OK >
Complet e . T he AliyunOSSFullAccess permission aut horizes t he RAM user t o read and writ e OSS
bucket s. T he AliyunMGWFullAccess permission aut horizes t he RAM user t o perform online
migrat ion jobs.
iii. In t he left -side navigat ion pane, click Overview . In t he Account Management sect ion of t he
Overview page, click t he link under RAM user logon, and use t he credent ials of t he RAM user
t o log on t o t he Alibaba Cloud Management Console.
4. On Google Cloud Plat form, creat e a user who uses t he programmat ic access met hod t o access
Google Cloud St orage. For more informat ion, see IAM permissions for JSON met hods.
i. Log on t o t he IAM & Admin console, and find a user who has permissions t o access BigQuery. In
t he Act ions column, click > Creat e key .
ii. In t he dialog box t hat appears, select JSON, and click CREAT E. Save t he JSON file t o an on-
premises device and click CLOSE.
iii. In t he Creat e service account wizard, click Select a role , and choose Cloud St orage >
St orage Admin t o aut horize t he IAM user t o access Google Cloud St orage.
5. Creat e a source dat a address and a dest inat ion dat a address for online dat a migrat ion.
i. Log on t o t he Alibaba Cloud Dat a T ransport console. In t he left -side navigat ion pane, click
Dat a Address.
ii. (Opt ional)If you have not act ivat ed t he Dat a Online Migrat ion service, click Applicat ion in t he
dialog box t hat appears. On t he Online Migrat ion Bet a T est page, specify t he required
informat ion and click Submit .
Not e On t he Online Migrat ion Bet a T est page, if t he Source St orage Provider
opt ions do not include Google Cloud Plat form, select a source st orage provider and
specify t he act ual source st orage provider in t he Not es field.
iii. On t he Dat a Address page, click Creat e Dat a Address. In t he Creat e Dat a Address dialog
box, set t he required paramet ers and click OK. For more informat ion about t he paramet ers, see
Migrate data.
Not e For t he Key File field, upload t he JSON file t hat is downloaded in St ep 4.
Not e In t he Access Key Id and Access Key Secret fields, ent er t he AccessKey ID
and t he AccessKey secret of t he RAM user.
Job Conf ig
Perf ormance
Not e In t he Dat a Siz e and File Count fields, ent er t he size and t he number of
files t hat were migrat ed from Google Cloud Plat form.
iii. T he creat ed migrat ion job is aut omat ically run. If Finished is displayed in t he Job St at us
column, t he migrat ion job is complet e.
iv. In t he Operat ion column of t he migrat ion job, click Manage t o view t he migrat ion report and
confirm t hat all dat a is migrat ed.
v. Log on t o t he OSS console.
vi. In t he left -side navigat ion pane, click Bucket s. On t he Bucket s page, click t he creat ed bucket .
In t he left -side navigat ion pane of t he bucket det ails page, choose Files > Files t o view t he
migrat ion result s.
T he LOAD st at ement support s Securit y T oken Service (ST S) and AccessKey for aut hent icat ion. If you use
AccessKey for aut hent icat ion, you must provide t he AccessKey ID and AccessKey secret of your account
in plaint ext . ST S aut hent icat ion is highly secure because it does not expose t he AccessKey informat ion.
In t his sect ion, ST S aut hent icat ion is used as an example t o show how t o migrat e dat a.
1. On t he Ad-Hoc Query t ab of Dat aWorks or t he MaxComput e client odpscmd, modify t he DDL script s
of t he t ables in t he BigQuery dat aset s, specify t he MaxComput e dat a t ypes, and creat e a
dest inat ion t able t hat st ores t he migrat ed dat a in MaxComput e.
For more informat ion about ad hoc queries, see Use t he ad-hoc query feat ure t o execut e SQL
st at ement s (opt ional). T he following code shows a configurat ion example:
web_manager STRING,
web_mkt_id BIGINT,
web_mkt_class STRING,
web_mkt_desc STRING,
web_market_manager STRING,
web_company_id BIGINT,
web_company_name STRING,
web_street_number STRING,
web_street_name STRING,`
web_street_type STRING,
web_suite_number STRING,
web_city STRING,
web_county STRING,
web_state STRING,
web_zip STRING,
web_country STRING,
web_gmt_offset DOUBLE,
web_tax_percentage DOUBLE
);
T he following t able describes t he mapping bet ween BigQuery dat a t ypes and MaxComput e dat a
t ypes.
INT 64 BIGINT
FLOAT 64 DOUBLE
BOOL BOOLEAN
ST RING ST RING
BYT ES VARCHAR
DAT E DAT E
ST RUCT ST RUCT
GEOGRAPHY ST RING
2. (Opt ional)If you do not have a RAM role, creat e a RAM role t hat has t he OSS access permissions and
assign t he role t o t he RAM user. For more informat ion, see ST S authorization.
3. Execut e t he LOAD st at ement t o load all dat a from t he OSS bucket t o t he MaxComput e t able, and
execut e t he SELECT st at ement t o query and verify t he import ed dat a. You can only load one t able
at a t ime. T o load mult iple t ables, you must execut e t he LOAD st at ement mult iple t imes. For more
Not e If t he dat a import fails, submit a t icket t o cont act t he MaxComput e t eam.
4. Verify t hat t he dat a migrat ed t o MaxComput e is t he same as t he dat a in BigQuery. T his verificat ion
is based on t he number of t ables, t he number of rows, and t he query result s of t ypical jobs.
Prerequisites
T he odpscmd client is inst alled. For more informat ion, see Inst all and configure t he MaxComput e
client .
Log dat a is st ored in a local direct ory.loghub.csv is used as an example in t his t opic.
Context
T unnel is a t ool t hat can be used t o upload large volumes of dat a t o MaxComput e at a t ime. It is
suit able for offline comput ing. For more informat ion, see Usage notes.
Procedure
1. On t he odpscmd client , run t he following commands t o creat e a t able named loghub t hat is used
t o st ore t he uploaded dat a:
Enable the new data types supported by MaxCompute V2.0. Commit the following command wi
th the SQL statement that is used to create the table:
set odps.sql.type.system.odps2=true;
-- Create a table named loghub.
CREATE TABLE loghub
(
client_ip STRING ,
receive_time STRING ,
topic STRING,
id STRING,
name VARCHAR(32),
salenum STRING
);
where,
Not e Wildcards or regular expressions are not support ed for T unnel-based dat a
uploads.
Prerequisites
T he following permissions are grant ed t o t he account aut horized t o access MaxComput e:
Creat eInst ance permission on MaxComput e project s
Context
Dat aHub is a plat form t hat is designed t o process st reaming dat a. Aft er dat a is uploaded t o Dat aHub,
t he dat a is st ored in a t able for real-t ime processing. Dat aHub execut es scheduled t asks wit hin five
minut es t o synchronize t he dat a t o a MaxComput e t able for offline comput ing.
T o periodically archive st reaming dat a in Dat aHub t o MaxComput e, you only need t o creat e and
configure a Dat aConnect or.
Procedure
1. On t he odpscmd client , creat e a t able t hat is used t o st ore t he dat a synchronized from Dat aHub.
Example:
Not e
Schema corresponds t o a MaxComput e t able. T he field names, dat a t ypes, and
field sequence specified by Schema must be consist ent wit h t hose of t he
MaxComput e t able. You can creat e a Dat aConnect or only if t he t hree condit ions
are met .
You are allowed t o migrat e t he t opics of t he T UPLE and BLOB t ypes t o
MaxComput e t ables.
A maximum of 20 t opics can be creat ed by default . If you require more t opics,
submit a t icket .
T he owner of a Dat aHub t opic or t he Creat or account has t he permissions t o
manage a Dat aConnect or. For example, you can creat e or delet e a Dat aConnect or.
By default , Dat aHub migrat es dat a t o MaxComput e t ables at five-minut e int ervals or when t he
amount of dat a reaches 60 MB. Sync Of f set indicat es t he number of migrat ed dat a ent ries.
6. Execut e t he following st at ement t o check whet her t he log dat a is migrat ed t o MaxComput e:
Context
3.Data development
3.1. Convert data types among
STRING, TIMESTAMP, and DATETIME
T his t opic describes how t o convert dat a t ypes among ST RING, T IMEST AMP, or DAT ET IME. T his t opic
provides mult iple dat e conversion met hods t hat you can use t o improve your business efficiency.
STRING to TIMESTAMP
Scenarios
Convert a dat e value of t he ST RING t ype t o t he T IMEST AMP t ype. T he dat e value of t he T IMEST AMP
t ype is in t he yyyy-mm-dd hh:mi:ss.ff3 format .
Limit s
Dat e values of t he ST RING t ype must be at least accurat e t o t he second and must be specified in t he
yyyy-mm-dd hh:mi:ss format .
Examples
Example 1: Use t he CAST funct ion t o convert t he st ring 2009-07-01 16:09:00 t o t he T IMEST AMP
t ype. Sample st at ement :
-- The return value is NULL because the input data value is invalid. The date value mus
t be in the yyyy-mm-dd hh:mi:ss format.
select cast('2009-07-01' as timestamp);
STRING to DATETIME
Scenarios
Convert a dat e value of t he ST RING t ype t o t he DAT ET IME t ype. T he dat e value of t he DAT ET IME
t ype is in t he yyyy-mm-dd hh:mi:ss format .
Limit s
If you use t he CAST funct ion, t he dat e value of t he ST RING t ype must be specified in t he yyyy-mm
-dd hh:mi:ss format .
If you use t he T O_DAT E funct ion, you must set t he value of t he format paramet er t o yyyy-mm-dd
hh:mi:ss .
Examples
Example 1: Use t he CAST funct ion t o convert t he st ring 2009-07-01 16:09:00 t o t he DAT ET IME
t ype. Sample st at ement :
Example 2: Use t he T O_DAT E funct ion and specify t he format paramet er t o convert t he st ring 20
09-07-01 16:09:00 t o t he DAT ET IME t ype. Sample st at ement :
-- The return value is NULL because the input data value is invalid. The date value mus
t be in the yyyy-mm-dd hh:mi:ss format.
select cast('2009-07-01' as datetime);
-- The return value is NULL because the input data value is invalid. The date value mus
t be in the yyyy-mm-dd hh:mi:ss format.
select to_date('2009-07-01','yyyy-mm-dd hh:mi:ss');
TIMESTAMP to STRING
Scenarios
Convert a dat e value of t he T IMEST AMP t ype t o t he ST RING t ype. T he value of t he T IMEST AMP t ype
is in t he yyyy-mm-dd hh:mi:ss.ff3 format .
Examples
Example 1: Use t he CAST funct ion t o convert t he T IMEST AMP value 2009-07-01 16:09:00 t o t he
ST RING t ype. T o const ruct dat a of t he T IMEST AMP t ype, you must use t he CAST funct ion t wice.
Sample st at ement :
Example 2: Use t he T O_CHAR funct ion t o convert t he T IMEST AMP value 2009-07-01 16:09:00 t o
t he ST RING t ype. T o const ruct dat a of t he T IMEST AMP t ype, you must use t he CAST funct ion once.
Sample st at ement :
TIMESTAMP to DATETIME
Scenarios
Convert a dat e value of t he T IMEST AMP t ype t o t he DAT ET IME t ype. Before t he conversion, t he dat e
value of t he T IMEST AMP t ype is in t he yyyy-mm-dd hh:mi:ss.ff3 format . Aft er t he conversion, t he
dat e value of t he DAT ET IME t ype is in t he yyyy-mm-dd hh:mi:ss format .
Limit s
If you use t he T O_DAT E funct ion, you must set t he value of t he format paramet er t o yyyy-mm-dd hh
:mi:ss .
Examples
Example 1: Use t he CAST funct ion t o convert t he T IMEST AMP value 2009-07-01 16:09:00 t o t he
DAT ET IME t ype. T o const ruct dat a of t he T IMEST AMP t ype, you must use t he CAST funct ion t wice.
Sample st at ement :
Example 2: Use t he T O_DAT E funct ion and specify t he format paramet er t o convert t he
T IMEST AMP value 2009-07-01 16:09:00 t o t he DAT ET IME t ype. T o const ruct dat a of t he
T IMEST AMP t ype, you must use t he CAST funct ion once. Sample st at ement :
DATETIME to TIMESTAMP
Scenarios
Convert a dat e value of t he DAT ET IME t ype t o t he T IMEST AMP t ype. Before t he conversion, t he dat e
value of t he DAT ET IME t ype is in t he yyyy-mm-dd hh:mi:ss format . Aft er t he conversion, t he dat e
value of t he T IMEST AMP t ype is in t he yyyy-mm-dd hh:mi:ss.ff3 format .
Examples
Use t he CAST funct ion t o convert a DAT ET IME value t o t he T IMEST AMP t ype. T o const ruct dat a of
t he DAT ET IME t ype, you must use t he GET DAT E funct ion once. Sample st at ement :
DATETIME to STRING
Scenarios
Convert a dat e value of t he DAT ET IME t ype t o t he ST RING t ype. T he dat e value of t he DAT ET IME
t ype is in t he yyyy-mm-dd hh:mi:ss format .
Examples
Example 1: Use t he CAST funct ion t o convert a DAT ET IME value t o t he ST RING t ype. T o const ruct
dat a of t he DAT ET IME t ype, you must use t he GET DAT E funct ion once. Sample st at ement :
Example 2: Use t he T O_CHAR funct ion t o convert a DAT ET IME value t o t he ST RING t ype in t he
specified format . T o const ruct dat a of t he DAT ET IME t ype, you must use t he GET DAT E funct ion
once. Sample st at ement s:
Prerequisites
Make sure t hat t he following requirement s are met :
T he MaxComput e client is inst alled.
For more informat ion about how t o inst all and configure t he MaxComput e client , see Install and
configure the MaxCompute client .
MaxComput e St udio is inst alled and connect ed t o a MaxComput e project . A MaxComput e Java
For more informat ion, see Install MaxCompute Studio, Manage project connections, and Create a
MaxCompute Java module.
Context
T o convert IPv4 or IPv6 addresses int o geolocat ions, you must download t he IP address library file t hat
includes t he IP addresses, and upload t he file t o t he MaxComput e project as a resource. Aft er you
develop and creat e a MaxComput e UDF based on t he IP address library file, you can call t he UDF in SQL
st at ement s t o convert IP addresses int o geolocat ions.
Usage notes
T he IP address library file provided in t his t opic is for reference only. You must maint ain t he IP address
library file based on your business requirement s.
Procedure
T o convert IPv4 or IPv6 addresses int o geolocat ions by using a MaxComput e UDF, perform t he following
st eps:
T he IP address library file provided in t his t opic is for reference only. You must maint ain t he IP
address library file based on your business requirement s.
2. Start the MaxCompute client and go t o t he MaxComput e project t o which you want t o upload t he
ipv4.t xt and ipv6.t xt files.
3. Run t he add file command t o upload t he t wo files as file resources t o t he MaxComput e
project .
Sample commands:
For more informat ion about how t o add resources, see Add resources.
ii. In t he New Java Class dialog box, ent er a class name, press Ent er, and t hen ent er t he code in
t he code edit or.
You must creat e t hree Java classes. T he following sect ions show t he names and code of t hese
classes. You can reuse t he code wit hout modificat ion.
IpUt ils
package com.aliyun.odps.udf.utils;
import java.math.BigInteger;
import java.net.Inet4Address;
import java.net.Inet6Address;
import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.Arrays;
public class IpUtils {
/**
* Convert the data type of IP addresses from STRING to LONG.
*
*
* @param ipInString
* IP addresses of the STRING type.
* @return Return the IP addresses of the LONG type.
*/
public static long StringToLong(String ipInString) {
ipInString = ipInString.replace(" ", "");
byte[] bytes;
if (ipInString.contains(":"))
bytes = ipv6ToBytes(ipInString);
else
bytes = ipv4ToBytes(ipInString);
BigInteger bigInt = new BigInteger(bytes);
// System.out.println(bigInt.toString());
return bigInt.longValue();
}
/**
* Convert the data type of IP addresses from STRING to LONG.
*
* @param ipInString
* IP addresses of the STRING type.
* @return Return the IP addresses of the STRING type that is converted from
BIGINT.
*/
public static String StringToBigIntString(String ipInString) {
ipInString = ipInString.replace(" ", "");
byte[] bytes;
if (ipInString.contains(":"))
bytes = ipv6ToBytes(ipInString);
else
bytes = ipv4ToBytes(ipInString);
BigInteger bigInt = new BigInteger(bytes);
return bigInt.toString();
}
/**
* Convert the data type of IP addresses from BIGINT to STRING.
*
* @param ipInBigInt
* IP addresses of the BIGINT type.
* @return Return the IP addresses of the STRING type.
*/
public static String BigIntToString(BigInteger ipInBigInt) {
byte[] bytes = ipInBigInt.toByteArray();
byte[] unsignedBytes = Arrays.copyOfRange(bytes, 1, bytes.length);
// Remove the sign bit.
try {
String ip = InetAddress.getByAddress(unsignedBytes).toString();
return ip.substring(ip.indexOf('/') + 1).trim();
} catch (UnknownHostException e) {
throw new RuntimeException(e);
}
}
/**
* Convert the data type of IPv6 addresses into signed byte 17.
*/
}
/**
* @param ipAdress IPv4 or IPv6 addresses of the STRING type.
* @return 4:IPv4, 6:IPv6, 0: Invalid IP addresses.
* @throws Exception
*/
public static int isIpV4OrV6(String ipAdress) throws Exception {
InetAddress address = InetAddress.getByName(ipAdress);
if (address instanceof Inet4Address)
return 4;
else if (address instanceof Inet6Address)
return 6;
return 0;
}
/*
* Check whether the IP address belongs to a specific IP section.
*
* ipSection The IP sections that are separated by hyphens (-).
*
* The IP address to check.
*/
public static boolean ipExistsInRange(String ip, String ipSection) {
ipSection = ipSection.trim();
ip = ip.trim();
int idx = ipSection.indexOf('-');
String beginIP = ipSection.substring(0, idx);
String endIP = ipSection.substring(idx + 1);
return getIp2long(beginIP) <= getIp2long(ip)
&& getIp2long(ip) <= getIp2long(endIP);
}
public static long getIp2long(String ip) {
ip = ip.trim();
String[] ips = ip.split("\\.");
long ip2long = 0L;
for (int i = 0; i < 4; ++i) {
ip2long = ip2long << 8 | Integer.parseInt(ips[i]);
}
return ip2long;
}
public static long getIp2long2(String ip) {
ip = ip.trim();
String[] ips = ip.split("\\.");
long ip1 = Integer.parseInt(ips[0]);
long ip2 = Integer.parseInt(ips[1]);
long ip3 = Integer.parseInt(ips[2]);
long ip4 = Integer.parseInt(ips[3]);
long ip2long = 1L * ip1 * 256 * 256 * 256 + ip2 * 256 * 256 + ip3 * 256
+ ip4;
return ip2long;
}
public static void main(String[] args) {
System.out.println(StringToLong("2002:7af3:f3be:ffff:ffff:ffff:ffff:ffff"
));
System.out.println(StringToLong("54.38.72.63"));
}
}
private class Invalid{
private Invalid()
{
}
}
}
IpV4Obj
package com.aliyun.odps.udf.objects;
public class IpV4Obj {
public long startIp ;
public long endIp ;
public String city;
public String province;
public IpV4Obj(long startIp, long endIp, String city, String province) {
this.startIp = startIp;
this.endIp = endIp;
this.city = city;
this.province = province;
}
@Override
public String toString() {
return "IpV4Obj{" +
"startIp=" + startIp +
", endIp=" + endIp +
", city='" + city + '\'' +
", province='" + province + '\'' +
'}';
}
public void setStartIp(long startIp) {
this.startIp = startIp;
}
public void setEndIp(long endIp) {
this.endIp = endIp;
}
public void setCity(String city) {
this.city = city;
}
public void setProvince(String province) {
this.province = province;
}
public long getStartIp() {
return startIp;
}
public long getEndIp() {
return endIp;
}
public String getCity() {
return city;
}
public String getProvince() {
return province;
}
}
IpV6Obj
package com.aliyun.odps.udf.objects;
public class IpV6Obj {
public String startIp ;
public String endIp ;
public String city;
public String province;
public String getStartIp() {
return startIp;
}
@Override
public String toString() {
return "IpV6Obj{" +
"startIp='" + startIp + '\'' +
", endIp='" + endIp + '\'' +
", city='" + city + '\'' +
", province='" + province + '\'' +
'}';
}
public IpV6Obj(String startIp, String endIp, String city, String province) {
this.startIp = startIp;
this.endIp = endIp;
this.city = city;
this.province = province;
}
public void setStartIp(String startIp) {
this.startIp = startIp;
}
public String getEndIp() {
return endIp;
}
public void setEndIp(String endIp) {
this.endIp = endIp;
}
public String getCity() {
return city;
}
public void setCity(String city) {
this.city = city;
}
public String getProvince() {
return province;
}
public void setProvince(String province) {
this.province = province;
}
}
i. In t he left -side navigat ion pane of t he Project t ab, choose src > main > java, right -click java,
and t hen choose New > MaxComput e Java.
ii. In t he Creat e new MaxComput e java class dialog box, click UDF and ent er a class name in
t he Name field. T hen, press Ent er and ent er t he code in t he code edit or.
T he following code shows how t o writ e a UDF based on a Java class named IpLocat ion. You can
reuse t he code wit hout modificat ion.
package com.aliyun.odps.udf.udfFunction;
import com.aliyun.odps.udf.ExecutionContext;
import com.aliyun.odps.udf.UDF;
import com.aliyun.odps.udf.UDFException;
import com.aliyun.odps.udf.utils.IpUtils;
import com.aliyun.odps.udf.utils.IpUtils;
import com.aliyun.odps.udf.objects.IpV4Obj;
import com.aliyun.odps.udf.objects.IpV6Obj;
import java.io.*;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.List;
import java.util.stream.Collectors;
public class IpLocation extends UDF {
public static IpV4Obj[] ipV4ObjsArray;
public static IpV6Obj[] ipV6ObjsArray;
public IpLocation() {
super();
}
@Override
public void setup(ExecutionContext ctx) throws UDFException, IOException {
//IPV4
if(ipV4ObjsArray==null)
{
BufferedInputStream bufferedInputStream = ctx.readResourceFileAsStream(
"ipv4.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(bufferedIn
putStream));
ArrayList<IpV4Obj> ipV4ObjArrayList=new ArrayList<>();
String line = null;
while ((line = br.readLine()) != null) {
String[] f = line.split("\\|", -1);
if(f.length>=5)
{
long startIp = IpUtils.StringToLong(f[0]);
long endIp = IpUtils.StringToLong(f[1]);
String city=f[3];
String province=f[4];
IpV4Obj ipV4Obj = new IpV4Obj(startIp, endIp, city, province);
ipV4ObjArrayList.add(ipV4Obj);
}
}
br.close();
List<IpV4Obj> collect = ipV4ObjArrayList.stream().sorted(Comparator.com
paring(IpV4Obj::getStartIp)).collect(Collectors.toList());
ArrayList<IpV4Obj> basicIpV4DataList=(ArrayList)collect;
IpV4Obj[] ipV4Objs = new IpV4Obj[basicIpV4DataList.size()];
ipV4ObjsArray = basicIpV4DataList.toArray(ipV4Objs);
}
//IPV6
if(ipV6ObjsArray==null)
{
BufferedInputStream bufferedInputStream = ctx.readResourceFileAsStream(
"ipv6.txt");
BufferedReader br = new BufferedReader(new InputStreamReader(bufferedIn
putStream));
ArrayList<IpV6Obj> ipV6ObjArrayList=new ArrayList<>();
String line = null;
while ((line = br.readLine()) != null) {
String[] f = line.split("\\|", -1);
}
}
@Override
public void close() throws UDFException, IOException {
super.close();
}
private static int binarySearch(IpV4Obj[] array,long ip){
int low=0;
int hight=array.length-1;
while (low<=hight)
{
int middle=(low+hight)/2;
if((ip>=array[middle].startIp)&&(ip<=array[middle].endIp))
{
return middle;
}
if (ip < array[middle].startIp)
hight = middle - 1;
else {
low = middle + 1;
}
}
return -1;
}
private static int binarySearchIPV6(IpV6Obj[] array,String ip){
int low=0;
int hight=array.length-1;
while (low<=hight)
{
int middle=(low+hight)/2;
if((ip.compareTo(array[middle].startIp)>=0)&&(ip.compareTo(array[middle
].endIp)<=0))
{
return middle;
}
if (ip.compareTo(array[middle].startIp) < 0)
hight = middle - 1;
else {
low = middle + 1;
}
}
return -1;
}
private class Invalid{
private Invalid()
{
}
}
}
3. Debug t he MaxComput e UDF t o check whet her t he code is run as expect ed.
For more informat ion about how t o debug UDFs, see Perform a local run t o debug t he UDF.
i. Right -click t he MaxComput e UDF script t hat you wrot e and select Run.
ii. In t he Run/Debug Conf igurat ions dialog box, configure t he required paramet ers and click
OK, as shown in t he following figure.
If no error is ret urned, t he code is run successfully. You can proceed wit h subsequent st eps. If
an error is report ed, you can perform t roubleshoot ing based on t he error informat ion displayed
on Int elliJ IDEA.
Not e T he paramet er set t ings in t he preceding figure are provided for reference.
2. In t he Package a jar, submit resource and regist er f unct ion dialog box, configure t he
paramet ers.
For more informat ion about t he paramet ers, see Package a Java program, upload t he package, and
creat e a MaxComput e UDF.
Ext ra resources: You must select t he IP address library files ipv4.t xt and ipv6.t xt t hat you
uploaded in St ep 1. In t his t opic, t he creat ed funct ion is named ipv4_ipv6_at on.
select ipv4_ipv6_aton('116.11.34.15');
select ipv4_ipv6_aton('2001:0250:080b:0:0:0:0:0');
Prerequisites
Prerequisites
Make sure t hat t he following operat ions are performed:
1. Inst all MaxComput e St udio.
2. Est ablish a connect ion t o a MaxComput e project .
3. Creat e a MaxComput e Java module.
Procedure
1. Writ e a UDF in Java
i. In t he left -side navigat ion pane of t he Project t ab, choose src > main > java, right -click java,
and t hen choose New > MaxComput e Java.
ii. In t he Creat e new MaxComput e java class dialog box, click UDF , ent er a class name in t he
Name field, and t hen press Ent er. In t his example, t he class is named Lower.
Sample code:
package <packagename>;
import com.aliyun.odps.udf.UDF;
public final class Lower extends UDF {
public String evaluate(String s) {
if (s == null) {
return null;
}
return s.toLowerCase();
}
}
ii. In t he Run/Debug Conf igurat ions dialog box, configure t he required paramet ers.
MaxComput e project : t he MaxComput e project in which t he UDF runs. T o perform a local run,
select local from t he drop-down list .
MaxComput e t able: t he name of t he MaxComput e t able in which t he UDF runs.
T able columns: t he columns in t he MaxComput e t able in which t he UDF runs.
iii. Click OK. T he following figure shows t he ret urn result .
Sample st at ement :
select Lower_test('ALIYUN');
T he following figure shows t he result t hat t he preceding st at ement ret urns. T he result indicat es
t hat t he Java UDF Lower_t est runs as expect ed.
Prerequisites
Context
T o query t he geolocat ion of an IP address, you can send an HT T P request t o call t he API provided by
t he IP address geolocat ion library of T aobao. Aft er t he API is called, a st ring t hat indicat es t he
geolocat ion of t he IP address is ret urned. T he following figure shows an example of a ret urned st ring.
You cannot send HT T P request s in MaxComput e. You can query geolocat ions of IP addresses in
MaxComput e by using one of t he following met hods:
Execut e SQL st at ement s t o download dat a in t he IP address geolocat ion library t o your on-premises
machine. T hen, send HT T P request s t o query t he geolocat ion informat ion.
Not e T his met hod is inefficient . T he query frequency must be less t han 10 queries per
second (QPS). Ot herwise, query request s are reject ed by t he IP address geolocat ion library of
T aobao.
Download t he IP address geolocat ion library t o your on-premises machine. T hen, query t he
geolocat ion informat ion in t he library.
Not e T his met hod is inefficient and is not suit able for scenarios in which dat a is analyzed
by using dat a warehouses.
Maint ain an IP address geolocat ion library and upload it t o MaxComput e on a regular basis. T hen,
query geolocat ions of IP addresses in t he IP address geolocat ion library.
Not e T his met hod is efficient . You must maint ain t he IP address geolocat ion library on a
regular basis.
T he following cont ent describes t he dat a in t he sample IP address geolocat ion library.
Not e You can also use your own IP address geolocat ion library.
2. Run t he following T unnel command t o upload dat a in t he ipdat a.t xt .ut f8 file t o t he ipresource
t able:
You can execut e t he following st at ement t o check whet her t he dat a in t he file is uploaded:
3. Execut e t he following SQL st at ement t o obt ain t he first 10 dat a records in t he ipresource t able:
Create a UDF
1.
2. Creat e a Pyt hon resource.
i.
ii. In t he Creat e Resource dialog box, ent er a resource name , select Upload t o
MaxComput e , and t hen click Creat e .
iii. Ent er t he following code in t he Pyt hon resource and click t he icon.
Not e If mult iple MaxComput e engines are bound t o t he workspace, select one of
t he engines from t he Engine Inst ance drop-down list .
Parameter Description
Parameter Description
ii. In t he Commit Node dialog box, ent er your comment s in t he Change descript ion field.
iii. Click OK.
4.
5.
Prerequisites
T he MaxComput e client is inst alled. For more informat ion, see Inst all and configure t he MaxComput e
client .
Procedure
1. Run t he following command on t he MaxComput e client t o upload a JAR package t hat exceeds 10
MB:
2. Resources t hat you upload on t he MaxComput e client are not displayed on t he Dat aSt udio page of
t he Dat aWorks console. You must run t he following command t o check whet her t he resource is
uploaded:
--View resources.
list resources;
3. Reduce t he size of t he JAR package. Dat aWorks runs a MapReduce job on t he comput er where t he
MaxComput e client resides. T herefore, you can submit only t he Main funct ion t o Dat aWorks t o run a
MapReduce job.
jar
-resources test_mr.jar,test_ab.jar --A file can be referenced after it is registered on
the MaxCompute client.
-classpath test_mr.jar --Reduce the size of a JAR package by using the following method
: Submit only the Mapper and Reducer that contain the Main function on the gateway. You
do not need to submit third-party dependencies. You can store the resources in the wc_i
n directory of the MaxCompute client.
Prerequisites
T he MaxComput e client is inst alled. For more informat ion, see Inst all and configure t he MaxComput e
client .
Context
You can use one of t he following met hods t o manage t he access permissions of users:
Use packages t o achieve fine-grained access cont rol.
T his met hod is used for dat a sharing and resource aut horizat ion across project s. Aft er you assign t he
developer role t o a user by using a package, t he user has full permissions on all object s in t he
package. T his may cause uncont rollable risks. For more informat ion, see Cross-project resource access
based on packages.
T he following figure shows t he permissions of t he developer role t hat is defined in Dat aWorks.
By default , t he developer role has full permissions on all packages, funct ions, resources, and t ables
in a workspace. T his does not meet t he requirement s for permission management .
T he following figure shows t he permissions t hat are grant ed t o a RAM user t hat is assigned t he
developer role in Dat aWorks.
You cannot grant a specified user t he access permissions on a specific UDF by using package-based
aut horizat ion or by assigning t he developer role in Dat aWorks t o t he user. For example, if you assign
t he developer role t o t he RAM user named [email protected]:ramtest , t he RAM user
has full permissions on all object s in t he current workspace. For more informat ion, see Authorize users.
On t he MaxComput e Management page in t he Dat aWorks console, you can manage t he access
permissions of cust om user roles. On t his page, you can grant permissions on a t able or a project . You
cannot grant permissions on resources or UDFs.
Not e For more informat ion about MaxComput e project s for Dat aWorks workspaces, see
Configure MaxComput e.
Role and project policies allow you t o grant a specified user t he permissions on specific resources.
Not e T o ensure securit y, we recommend t hat you verify role and project policies in a t est
workspace.
You can use a role policy and a project policy t o grant access permissions on a specific UDF t o a
specified user.
T o prevent a user from accessing a specific resource in a workspace, assign t he developer role t o t he
user in t he Dat aWorks console and configure a role policy for t he user t o deny access request s for t he
resource on t he MaxComput e client .
T o allow a user t o access a specific resource, assign t he developer role t o t he user in t he Dat aWorks
console and configure a project policy for t he user t o allow access request s for t he resource on t he
MaxComput e client .
Procedure
1. Creat e a role t hat has no permission t o access a UDF named get region by default .
i. On t he MaxComput e client , run t he following command t o creat e a role named denyudfrole:
ii. Creat e a role policy file t hat cont ains t he following cont ent :
{
"Version": "1", "Statement"
[{
"Effect":"Deny",
"Action":["odps:Read","odps:List"],
"Resource":"acs:odps:*:projects/sz_mc/resources/getaddr.jar"
},
{
"Effect":"Deny",
"Action":["odps:Read","odps:List"],
"Resource":"acs:odps:*:projects/sz_mc/registration/functions/getregion"
}
] }
ii. Run t he show grants; command t o check t he permissions of t he current logon user.
T he result indicat es t hat t he RAM user has t he following t wo roles: role_project _dev and
denyudfrole. role_project _dev is t he default developer role in Dat aWorks.
iii. Check t he permissions of t he RAM user on t he get region UDF and it s dependencies.
T he result indicat es t hat t he RAM user wit h t he developer role in Dat aWorks does not have
read permissions on t he get region UDF. You can perform t he next st ep t o configure a project
policy t o ensure t hat only a specified RAM user can access t he UDF.
{
"Version": "1", "Statement":
[{
"Effect":"Allow",
"Principal":"[email protected]:yangyitest",
"Action":["odps:Read","odps:List","odps:Select"],
"Resource":"acs:odps:*:projects/sz_mc/resources/getaddr.jar"
},
{
"Effect":"Allow",
"Principal":"[email protected]:yangyitest",
"Action":["odps:Read","odps:List","odps:Select"],
"Resource":"acs:odps:*:projects/sz_mc/registration/functions/getregion"
}] }
get policy;
iv. Run t he whoami; command t o check t he current logon user. T hen, run t he show grants;
command t o check t he permissions of t he user.
v. Run an SQL job t o check whet her only t he specified RAM user can access t he specific UDF and
it s dependencies.
T he following result indicat es t hat t he specified RAM user can access t he specific UDF:
T he following result indicat es t hat t he specified RAM user can access t he dependencies of
t he UDF:
Prerequisites
A Dat aWorks workspace is creat ed. In t his example, a workspace in basic mode is used. T he
workspace is associat ed wit h mult iple MaxComput e comput e engines. For more informat ion, see
Creat e a workspace.
T he open source Jieba package is downloaded from Git Hub.
Context
PyODPS nodes are int egrat ed wit h MaxComput e SDK for Pyt hon. You can direct ly edit Pyt hon code and
use MaxComput e SDK for Pyt hon on PyODPS nodes of Dat aWorks. For more informat ion about PyODPS
nodes, see Create a PyODPS 2 node.
T his t opic describes how t o use a PyODPS node t o segment Chinese t ext based on Jieba.
Not ice Sample code in t his t opic is for reference only. We recommend t hat you do not use
t he code in your product ion environment .
v. In t he Creat e Workf low dialog box, specify t he Workf low Name and Descript ion
paramet ers. T hen, click Creat e .
Not ice T he workflow name must be 1 t o 128 charact ers in lengt h, and can cont ain
let t ers, digit s, underscores (_), and periods (.).
Parameter Description
Select the compute engine where the resource resides from the drop-
down list.
Engine T ype
No t e If only one instance is bound to your workspace, this
parameter is not displayed.
Engine Inst ance T he name of the MaxCompute engine to which the task is bound.
Parameter Description
iv. In t he Commit dialog box, specify t he Change descript ion paramet er and click Commit .
3. Creat e a t able t hat is used t o st ore t est dat a.
i. Click t he workflow t hat you creat ed, expand MaxComput e , right -click T able , and t hen select
Creat e T able .
ii. In t he Creat e T able dialog box, specify t he T able Name paramet er and click Creat e .
iii. Click DDL St at ement and ent er t he following DDL st at ement t o creat e a t able:
Not e T he t able in t his example cont ains t wo columns. You can segment t ext in one
column during dat a development .
Not e In t his example, only t he t ext in t he chinese column of t he t est dat a is segment ed.
T herefore, t he result t able cont ains only one column.
ii. In t he Dat a Import Wiz ard dialog box, ent er t he name of t he t est t able jieba_t est t o which
you want t o import dat a, select t he t able, and t hen click Next .
iii. Click Browse , upload t he jieba_test.csv file from your on-premises machine, and t hen click
Next .
iv. Select By Name and click Import Dat a.
7. Creat e a PyODPS 2 node.
i. Click t he workflow, expand MaxComput e , right -click Dat a Analyt ics, and t hen choose
Creat e > PyODPS 2 .
ii. In t he Creat e Node dialog box, specify t he Node Name and Locat ion paramet ers and click
Commit .
Not e
T he node name must be 1 t o 128 charact ers in lengt h, and can cont ain let t ers,
digit s, underscores (_), and periods (.).
In t his example, t he Node Name paramet er is set t o word_split .
def test(input_var):
import jieba
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
result=jieba.cut(input_var, cut_all=False)
return "/ ".join(result)
hints = {
'odps.isolation.session.enable': True
}
libraries =['jieba-master.zip'] # Reference the jieba-master.zip package.
iris = o.get_table('jieba_test').to_df() # Reference the data in the jieba_test ta
ble.
example = iris.chinese.map(test).execute(hints=hints, libraries=libraries)
print(example) # Display the text segmentation result. The result is of the MAP ty
pe.
abci=list(example) # Convert the text segmentation result into the LIST type.
i = 0
for i in range(i,len(abci)):
pq=str(abci[i])
o.write_table('jieba_result',[pq]) # Write the data records to the jieba_resul
t table one by one.
i+=1
else:
print("done")
v. Click t he icon in t he t oolbar. In t he Paramet ers dialog box, select a resource group from t he
Resource Group drop-down list and click OK.
Not e For more informat ion about resource groups f or scheduling , see Overview.
vi. View t he execut ion result of t he Jieba segment at ion program on t he Runt ime Log t ab in t he
lower part of t he page.
8. Creat e and run an ODPS SQL node.
i. Click t he workflow, expand MaxComput e , right -click Dat a Analyt ics, and t hen choose
Creat e > ODPS SQL.
ii. In t he Creat e Node dialog box, specify t he Node Name and Locat ion paramet ers and click
Commit .
Not e T he node name must be 1 t o 128 charact ers in lengt h and can cont ain let t ers,
digit s, underscores (_), and periods (.).
v. Click t he icon in t he t oolbar. In t he Paramet ers dialog box, select a resource group from t he
Resource Group drop-down list and click OK.
Not e For more informat ion about resource groups f or scheduling , see Overview.
vi. In t he Expense Est imat e dialog box, check t he est imat ed cost and click Run.
vii. View t he execut ion result on t he Runt ime Log t ab in t he lower part of t he page.
You can use a PyODPS user-defined funct ion (UDF) t o read t able or file resources t hat are uploaded t o
MaxComput e. In t his case, you must writ e t he UDF as a closure funct ion or a callable class. If you need
t o reference complex UDFs, you can creat e a MaxComput e funct ion in Dat aWorks. For more informat ion,
see Register a MaxCompute function.
In t his t opic, a closure funct ion is used t o reference t he cust om dict ionary file key_words.t xt t hat is
uploaded t o MaxComput e.
1. Click t he workflow, expand MaxComput e , right -click Resource , and t hen choose Creat e > File .
2. In t he Creat e Resource dialog box, configure t he paramet ers and click Creat e . T he following
t able describes t he paramet ers.
Parameter Description
Select the compute engine where the resource resides from the drop-down
list.
Engine T ype
No t e If only one instance is bound to your workspace, this
parameter is not displayed.
Engine Inst ance T he name of the MaxCompute engine to which the task is bound.
T he folder that is used to store the resource. T he default value is the path of
Lo cat io n the current folder. You can modify the path based on your business
requirements.
File T ype No t e If you want to upload a dictionary file from the on-premises
machine to DataWorks, the file must be encoded in UT F-8.
Click Uplo ad , select the key_words.txt file from your on-premises machine,
Uplo ad
and then click Open.
T he name of the resource. T he resource name can contain only letters, digits,
Reso urce Name
periods (.), underscores (_), and hyphens (-).
frequency and part of speech are opt ional. Separat e every t wo part s wit h a space. T he order of
t he t hree part s cannot be adjust ed.
def test(resources):
import jieba
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
fileobj = resources[0]
def h(input_var):# Use the nested function h() to load the dictionary and segment t
ext.
import jieba
jieba.load_userdict(fileobj)
result=jieba.cut(input_var, cut_all=False)
return "/ ".join(result)
return h
hints = {
'odps.isolation.session.enable': True
}
libraries =['jieba-master.zip'] # Reference the jieba-master.zip package.
iris = o.get_table('jieba_test').to_df() # Reference the data in the jieba_test table.
6. Run t he code and compare t he result s before and aft er t he cust om dict ionary is referenced.
Background information
PyODPS provides mult iple met hods t o download dat a t o a local direct ory. You can download dat a t o a
local direct ory for processing and t hen upload t he dat a t o MaxComput e. However, local dat a
processing is inefficient because t he massively parallel processing capabilit y of MaxComput e cannot be
used if you download dat a t o a local direct ory. If t he dat a volume is great er t han 10 MB, we
recommend t hat you do not download dat a t o a local direct ory for processing. You can use one of t he
following met hods t o download dat a t o a local direct ory:
Use t he head, t ail, or t o_pandas met hod. In most cases, use t he head or tail met hod t o obt ain
small volumes of dat a. If you want t o obt ain large volumes of dat a, use t he persist met hod t o st ore
dat a in a MaxComput e t able. For more informat ion, see Execut ion.
Use t he open_reader met hod. You can execut e open_reader on a t able or an SQL inst ance t o obt ain
t he dat a. If you need t o process large volumes of dat a, we recommend t hat you use PyODPS
Dat aFrame or MaxComput e SQL. A PyODPS Dat aFrame object is creat ed based on a MaxComput e
t able. T his met hod provides higher efficiency t han local dat a processing.
Sample code
Convert a JSON st ring t o mult iple rows. Each row consist s of a key and it s value.
For local t est ing, use t he head met hod t o obt ain small volumes of dat a
In [12]: df.head(2)
json
0 {"a": 1, "b": 2}
1 {"c": 4, "b": 3}
In [14]: from odps.df import output
In [16]: @output(['k', 'v'], ['string', 'int'])
...: def h(row):
...: import json
...: for k, v in json.loads(row.json).items():
...: yield k, v
...:
In [21]: df.apply(h, axis=1).head(4)
k v
0 a 1
1 b 2
2 c 4
3 b 3
For online product ion, use t he persist met hod t o st ore large volumes of dat a in a MaxComput e
t able
4.Compute optimization
4.1. Optimize SQL statements
T his t opic describes common scenarios where you can opt imize SQL st at ement s t o achieve bet t er
performance.
Skewed joins
An imbalance of work may occur when you join t ables based on a key t hat is not evenly dist ribut ed.
For example, execut e t he following st at ement t o join a large t able named A and a small t able named
B:
Copy t he Logview URL of t he query and open it in a browser t o go t o t he Logview page. Double-click
t he Job Scheduler job t hat performs t he JOIN operat ion. On t he Long-t ails t ab, you can see t hat long
t ails exist , as shown in t he following figure. T his indicat es t hat dat a is skewed.
T o opt imize t he preceding st at ement , you can use one of t he following met hods:
Use a MAPJOIN st at ement . T able B is a small t able which does not exceed 512 MB in size. In t his
case, you can replace t he preceding st at ement wit h t he following st at ement :
Handle t he skewed key separat ely. If dat a skew occurs because a large number of null key values
exist in bot h t ables, you must filt er out t hese null values or generat e random numbers t o replace
t hem before you perform t he JOIN operat ion. For example, you can replace t he preceding
st at ement wit h t he following st at ement :
T he following example describes how t o ident ify t he key values t hat cause dat a skew:
-- Data skew leads to an imbalance of work when the following statement is executed:
select * from a join b on a.key=b.key;
-- Execute the following statement to view the distribution of key values and identify
the key values that cause data skew:
select left.key, left.cnt * right.cnt from
(select key, count(*) as cnt from a group by key) left
join
(select key, count(*) as cnt from b group by key) right
on left.key=right.key;
An imbalance of work may occur when you perform a GROUP BY operat ion based on a key t hat is not
evenly dist ribut ed.
Assume t hat T able A has t wo fields, which are Key and Value. T he t able cont ains a large amount of
dat a and t he values of t he Key field are not evenly dist ribut ed. Execut e t he following st at ement t o
perform a GROUP BY operat ion on t he t able:
When t he amount of t he dat a in t he t able is large enough, you may find long t rails on t he Logview
page of t he query. T o resolve t he issue, add set odps.sql.groupby.skewindata=true before t he
preceding st at ement t o enable ant i-skew before t he query is performed.
If you use dynamic part it ioning in MaxComput e, one or more reduce t asks are assigned t o each
part it ion t o aggregat e dat a by part it ion. T his brings t he following benefit s:
If t he dat a t o be writ t en t o part it ions is skewed, long t ails may occur during t he reduce st age. Each
part it ion can be assigned a maximum of 10 map t asks. If a larger amount of dat a is t o be writ t en t o a
part it ion t han t he ot her part it ions, long t ails may occur. If you can det ermine t he part it ion t o which
dat a is t o be writ t en, we recommend t hat you do not use dynamic part it ioning. For example, long
t ails may occur if you execut e t he following st at ement t o writ e dat a from a specific part it ion in a
t able t o anot her t able:
In t his case, you can replace t he preceding st at ement wit h t he following st at ement :
For more informat ion about how t o reduce impact s of dat a skew, see Long-tail computing optimization.
T he OVER clause which defines how t o part it ion and sort rows in a t able must be t he same.
Mult iple window funct ions must be execut ed at t he same level of nest ing in an SQL st at ement .
T he window funct ions t hat meet t he preceding condit ions are merged t o be execut ed by one reduce
t ask. T he following SQL st at ement provides an example:
select
rank()over(partition by A order by B desc) as rank,
row_number()over(partition by A order by B desc) as row_num
from MyTable;
O ptimize subqueries
T he following st at ement cont ains a subquery:
SELECT * FROM table_a a WHERE a.col1 IN (SELECT col1 FROM table_b b WHERE xxx);
If t he subquery on t he t able_b t able ret urns more t han 1,000 values from t he col1 column, t he syst em
report s t he following error: records returned from subquery exceeded limit of 1000 . In t his case,
you can replace t he preceding st at ement wit h t he following st at ement :
SELECT a. * FROM table_a a JOIN (SELECT DISTINCT col1 FROM table_b b WHERE xxx) c ON (a.col
1 = c.col1)
Not e
If t he DIST INCT keyword is not used, t he subquery result t able c may cont ain duplicat e
values in t he col1 column. In t his case, t he query on t he a t able ret urns more result s.
If t he DIST INCT keyword is used, only one worker is assigned t o perform t he subquery. If t he
subquery involves a large amount of dat a, t he whole query slows down.
If you are sure t hat t he values t hat meet t he subquery condit ions in t he col1 column are
unique, you can delet e t he DIST INCT keyword t o improve t he query performance.
O ptimize joins
When you join t wo t ables, we recommend t hat you use t he WHERE clause based on t he following rules:
Specify t he part it ion limit s of t he primary t able in t he WHERE clause. We recommend t hat you define
a subquery for t he primary t able t o obt ain t he required dat a first .
Writ e t he WHERE clause of t he primary t able at t he end of t he st at ement .
Specify t he part it ion limit s of t he secondary t able in t he ON clause or a subquery.
Examples:
select * from A join (select * from B where dt=20150301)B on B.id=A.id where A.dt=20150301;
select * from A join B on B.id=A.id where B.dt=20150301; -- We recommend that you do not us
e this statement. The system performs the JOIN operation before it performs partition pruni
ng. This can result in a large amount of data and deteriorate the query performance.
select * from (select * from A where dt=20150301)A join (select * from B where dt=20150301)
B on B.id=A.id;
Background information
When t he JOIN st at ement in MaxComput e SQL is execut ed, t he dat a wit h t he same join key is sent t o and
processed on t he same inst ance. If a key cont ains a large amount of dat a, t he inst ance t akes a longer
t ime t o process t he dat a t han ot her inst ances. Long t ails exist if t he execut ion log shows t hat a few
inst ances in t his JOIN t ask remain in t he execut ing st at e and ot her inst ances are in t he complet ed st at e.
Long t ails caused by dat a skew are common and significant ly prolong t ask execut ion. During
promot ions such as Double 11, severe long t ails may occur. For example, page views of large sellers are
much more t han page views of small sellers. If page view log dat a is associat ed wit h t he seller
dimension t able, dat a is dist ribut ed by seller ID. T his causes some inst ances t o process far more dat a
t han ot hers. In t his case, t he t ask cannot be complet ed due t o a few long t ails.
If you want t o join one large t able and one small t able, you can execut e t he MAP JOIN st at ement t o
cache t he small t able. For more informat ion about t he MAP JOIN st at ement , see SELECT synt ax.
T o join t wo large t ables, deduplicat e dat a first .
T ry t o find out t he cause for t he Cart esian product of t wo large keys and opt imize t hese keys from
t he business perspect ive.
It t akes a long t ime t o direct ly execut e t he LEFT JOIN st at ement for a small t able and a large t able. In
t his case, we recommend t hat you execut e t he MAP JOIN st at ement for t he small and large t ables t o
generat e an int ermediat e t able t hat cont ains t he int ersect ion of t he t wo t ables. T his int ermediat e
t able is not great er t han t he large t able because t he MAP JOIN st at ement filt ers out unnecessary
dat a from t he large t able. T hen, execut e t he LEFT JOIN st at ement for t he small and int ermediat e
t ables. T he effect of t his operat ion is equivalent t o t hat of execut ing t he LEFT JOIN st at ement for
t he small and large t ables.
2. Find your Fuxi inst ance and click t he icon in t he St dOut column t o view t he size of dat a read
by t he inst ance.
For example, Read from 0 num:52743413 size:1389941257 indicat es t hat 1,389,941,257 rows of
dat a are being read when t he JOIN st at ement is execut ed. If an inst ance list ed in Long-T ails reads
far more dat a t han ot her inst ances, a long t ail occurs due t o t he large dat a size.
When you use t he MAP JOIN st at ement , t he JOIN operat ion is performed at t he Map side. T his prevent s
dat a skew caused by uneven key dist ribut ion. T he MAP JOIN st at ement is subject t o t he following
limit s:
T he MAP JOIN st at ement is applicable only when t he secondary t able is small. A secondary t able
refers t o t he right t able in t he LEFT OUT ER JOIN st at ement or t he left t able in t he RIGHT OUT ER JOIN
st at ement .
T he size of t he small t able is also limit ed when t he MAP JOIN st at ement is used. By default , t he
maximum size is 512 MB aft er t he small t able is loaded int o t he memory. You can execut e t he
following st at ement t o increase t he maximum size t o 10,000 MB:
set odps.sql.mapjoin.memory.max=10000
T he MAP JOIN st at ement is easy t o use. You can append /* mapjoin(b) */ t o t he SELECT
st at ement , where b indicat es t he alias of t he small t able or t he subquery. Example:
select /* mapjoin(b) */
a.c2
,b.c3
from
(select c1
,c2
from t1 ) a
left outer join
(select c1
,c3
from t2 ) b
on a.c1 = b.c1;
If hot key values cause a long t ail and t he MAP JOIN st at ement cannot be used because no small
t able is involved, ext ract hot key values. Hot key values in t he primary t able are separat ed from non-
hot key values, processed independent ly, and t hen joined wit h non-hot key values. In t he following
example, t he page view log t able of t he T aobao websit e is associat ed wit h t he commodit y
dimension t able.
i. Ext ract hot key values: Ext ract t he IDs of t he commodit ies whose page views are great er t han
50,000 t o a t emporary t able.
select ...
from
(select *
from dim_tb_itm
where ds = '${bizdate}'
) a
right outer join
(select /* mapjoin(b1) */
b2.*
from
(select item_id
from topk_item
where ds = '${bizdate}'
) b1
right outer join
(select *
from dwd_tb_log_pv_di
where ds = '${bizdate}'
and url_type = 'ipv'
) b2
on b1.item_id = coalesce(b2.item_id,concat("tbcdm",rand())
where b1.item_id is null
) l
on a.item_id = coalesce(l.item_id,concat("tbcdm",rand());
select /* mapjoin(a) */
...
from
(select /* mapjoin(b1) */
b2.*
from
(select item_id
from topk_item
where ds = '${bizdate}'
)b1
join
(select *
from dwd_tb_log_pv_di
where ds = '${bizdate}'
and url_type = 'ipv'
and item_id is not null
) b2
on (b1.item_id = b2.item_id)
) l
left outer join
(select /* mapjoin(a1) */
a2.*
from
(select item_id
from topk_item
where ds = '${bizdate}'
) a1
join
(select *
from dim_tb_itm
where ds = '${bizdate}'
) a2
on (a1.item_id = a2.item_id)
) a
on a.item_id = l.item_id;
iv. Execut e t he UNION ALL st at ement t o merge t he dat a obt ained in Subst eps ii and iii t o
generat e complet e log dat a, wit h commodit y informat ion associat ed.
set odps.sql.skewjoin=true
skewed_key indicat es t he skewed column and skewed_value indicat es t he skewed value of t his
column.
Use SKEWJOIN HINT t o avoid skewed hot key values. For more informat ion about SKEWJOIN HINT , see
SKEWJOIN HINT .
Procedure
Not e Met hod 3 is more efficient t han Met hod 1 and Met hod 2.
In t he following snapshot capt ured on Logview, J5_3_4 is t he Fuxi t ask t hat t ook t he longest t ime t o
execut e.
Click t he J5_3_4 t ask and query t he inst ances of t his t ask on t he t ab t hat appears. T he query result s
show t hat t he J5_3_4#215_0 inst ance t ook t he longest t ime t o execut e and it s I/O records and I/O
byt es are much more t han t hose of ot her inst ances.
In t his case, you can find t hat dat a skew occurs on t he J5_3_4#215_0 inst ance. T he JOIN st at ement
t hat causes dat a skew needs t o be furt her det ermined. Find t he skewed inst ance, and click t he icon in
t he St dOut column. Find a non-skewed inst ance, and click t he icon in t he St dOut column. T he
cont ent in t he St dOut column cannot be complet ely displayed. You can click Download and view
t he complet e informat ion.
In t he following figures, you can find t hat t he value of record count in St reamLineRead7 of t he
skewed inst ance is much great er t han t he value of record count of t he non-skewed inst ance.
T herefore, dat a skew occurs when dat a in St reamLineWrit e7 and SreamLineRead7 is shuffled.
On t he DAG page, right -click t he skewed inst ance and select expand all t o find St reamLineWrit e7
and St reamLineRead7.
You can find t hat dat a skew occurs on St reamLineRead7 in MergeJoin2. MergeJoin2 is generat ed aft er
t he dim_hm_it em and dim_t b_it m_brand t ables are joined and t hen t he joined t able and t he
dim_t b_brand t able are joined.
Use t hese t able names t o find t he skewed t able. T he result shows t hat dat a skew occurs when t he
LEFT OUT ER JOIN st at ement is execut ed and t he t 1 t able is skewed. You can add /*+ skewjoin(t1)
*/ t o t he SQL st at ement t o resolve t he dat a skew issue.
Long t ails are one of t he common issues in dist ribut ed comput ing. T he main cause of a long t ail is
uneven dat a dist ribut ion. As a result , t he workloads of individual nodes differ. T he ent ire job can be
complet ed only aft er t he slowest node processes all it s dat a.
T o prevent one worker from running a large number of jobs, t he jobs must be dist ribut ed t o mult iple
workers.
You can use one of t he following met hods t o handle t his issue:
Rewrit e t he SQL st at ement and add random numbers t o split t he key. Example:
Regardless of combiners, a mapper shuffles dat a t o a reducer, and t he reducer performs t he count
operat ion. T he execut ion plan is in t he following sequence: Mapper > Reducer. However, if t he jobs
of t he long-t ailed key are dist ribut ed again, use t he following st at ement :
T he execut ion plan for t his st at ement is in t he following sequence: Mapper > Reducer > Reducer.
Alt hough more st eps are required for t he execut ion, t he jobs of t he long-t ailed key are processed in
t wo st eps, and t he t ime required may be short er.
Not e If you use t his met hod t o add a reducer execut ion st ep t o handle a long t ail t hat has
slight impact s on your jobs, t he t ime required may be longer.
set odps.sql.groupby.skewindata=true
T his configurat ion is used for general opt imizat ion inst ead of business-specific opt imizat ion.
T herefore, t he opt imizat ion effect may not be opt imal. You can rewrit e SQL st at ement s in a more
efficient way based on your dat a.
Solut ion
-- The original SQL statement, regardless of the case where uid is not specified.
SELECT COUNT(uid) AS Pv
, COUNT(DISTINCT uid) AS Uv
FROM UserLog;
SELECT SUM(PV) AS Pv
, COUNT(*) AS UV
FROM (
SELECT COUNT(*) AS Pv
, uid
FROM UserLog
GROUP BY uid
) a;
T his met hod is t o change DIST INCT t o COUNT . T his way, t he comput ing workloads are dist ribut ed t o
different reducers. Aft er you rewrit e t he st at ement , you can use t he opt imizat ion met hod for GROUP
BY, and t he combiner is involved in t he comput at ion. T his great ly improves t he performance.
Solut ion
If you are sure about t he part it ion t o which dat a is writ t en, you can specify t he part it ion before you
insert t he dat a inst ead of using dynamic part it ions.
Not e Combiners only opt imize execut ion in t he map st ages. Make sure t hat t he result s of t he
execut ion during which combiners are used are t he same as t hose of t he execut ion during which
combiners are not used. WordCount is used in t his example. T he result of passing (KEY,1) t wice
is t he same as t hat of passing (KEY,2) once. For more informat ion, see WordCount . However,
when you calculat e t he average value, you cannot use a combiner t o direct ly combine (KEY,1)
and (KEY,2) t o obt ain (KEY,1.5) .
A t ot al of 1,047 reducers are used. Among t hese reducers, 1,046 reducers have complet ed t heir
calculat ions, but t he last one has not . Aft er MaxComput e det ect s t his issue, it aut omat ically st art s a
new reducer, calculat es t he same dat a, and t hen aggregat es t he result s of t he reducer t hat complet ed
t he calculat ion earlier t o t he final result set .
A large amount of noisy dat a may exist in calculat ions. For example, you need t o calculat e t he dat a
based on visit or IDs t o check t he access records of each user. In t his case, you must filt er out crawler
dat a. Ot herwise, a long t ail may occur due t o t he crawler dat a during calculat ion. It is increasingly
difficult t o ident ify crawler dat a. Similarly, if you want t o use t he xxid field for associat ions, you must
check whet her t he associat ed field is empt y.
Long t ails may occur in some special business scenarios. For example, t he operat ion records of
independent soft ware vendors (ISVs) are great ly different from t hose of individuals in t erms of t he
amount of dat a and behavior. In t his case, you must use specific analysis met hods t o handle t he
issues of import ant cust omers.
If dat a is unevenly dist ribut ed, we recommend t hat you do not use const ant s as t he key of
DIST RIBUT E BY t o sort all t he dat a records.
Background information
When e-commerce companies build dat a warehouses or analyze t heir business, t hey oft en need t o
calculat e met rics such as t he numbers of visit ors, buyers, and regular buyers in a period of t ime. T hese
met rics are calculat ed based on t he dat a t hat is accumulat ed over t he period of t ime.
In general, t hese met rics are calculat ed based on t he dat a in log t ables. For example, you can execut e
t he following st at ement t o calculat e t he number of visit ors for each it em in t he last 30 days:
Not e All t he variables in t he code samples in t his t opic are scheduling variables in Dat aWorks.
T herefore, t he code samples in t his t opic are applicable only t o scheduling nodes in Dat aWorks.
If a large amount of log dat a is generat ed every day, t he preceding SELECT st at ement requires a large
number of map t asks. If more t han 99,999 map t asks are required, t he map t asks fail.
O bjective
T he amount of t he dat a accumulat ed over a long period of t ime is huge. If t he syst em calculat es
met rics based on t he dat a, t he query performance is det eriorat ed. We recommend t hat you creat e an
int ermediat e t able t hat is used t o summarize t he dat a generat ed every day. T his can remove duplicat e
dat a records and reduce t he amount of dat a t o be queried.
Solution
1. Creat e an int ermediat e t able t o summarize t he dat a generat ed every day.
In t his example, you can creat e an int ermediat e t able based on t he dat a in t he it em_id and
visit or_id fields. T he following code provides an example:
2. Summarize t he dat a t hat is accumulat ed over a long period of t ime from t he int ermediat e t able.
T he following code calculat es t he number of visit ors for each it em in t he last 30 days:
select item_id
,count(distinct visitor_id) as uv
,sum(pv) as pv
from mds_itm_vsr_xx
where ds <= '${bdp.system.bizdate} '
and ds >= to_char(dateadd(to_date('${bdp.system.bizdate} ','yyyymmdd'),-29,'dd'),'yyy
ymmdd')
group by item_id;
T o resolve t his issue, you can merge dat a from mult iple part it ions int o one part it ion, which cont ains all
hist orical dat a. T his way, you can accumulat e dat a in an increment al manner and calculat e long-period
met rics based on dat a in one part it ion.
Scenarios
Calculat e t he number of t he regular buyers in t he last day. A regular buyer is a buyer who made a
purchase in a specific period of t ime, for example, in t he last 30 days.
T he following code calculat es t he number of t he regular buyers in a period of t ime:
Improvement :
Creat e and maint ain a dimension t able. T his t able is used t o record t he relat ionship bet ween buyers
and purchased it ems, such as t he first purchase t ime, t he last purchase t ime, t he t ot al number of
purchased it ems, and t he t ot al amount of t he purchases.
Updat e t he dat a in t he dimension t able every day wit h t he dat a in t he billing logs of t he last day.
T o det ermine whet her a buyer is a regular buyer, check whet her t he last purchase t ime of t he buyer is
wit hin t he last 30 days. T his deduplicat es dat a mappings and reduces t he amount of dat a for
calculat ion.
5.Job diagnostics
5.1. Use Logview to diagnose jobs
that run slowly
In most cases, ent erprises need job result s t o be generat ed earlier t han expect ed. T his way, t hey can
make business development decisions based on t he result s at t he earliest opport unit y. In t his case, job
developers must pay at t ent ion t o t he job st at us t o ident ify and opt imize t he jobs t hat run slowly. You
can use Logview of MaxComput e t o diagnose jobs t hat run slowly. T his t opic provides t he causes for
which jobs run slowly and t he relat ed solut ions. T his t opic also describes how t o view informat ion
about t he jobs t hat run slowly.
Background information
Logview of MaxComput e records all logs of jobs and provides guidance for you t o view and debug jobs.
You can obt ain t he Logview URL below Log view in t he job result . MaxComput e provides t wo versions
of Logview. We recommend t hat you use Logview V2.0 because it provides fast er page loading and a
bet t er design st yle. For more informat ion about Logview V2.0, see Logview V2.0.
Insufficient CUs
If t he MaxComput e project uses t he subscript ion billing met hod and a large number of jobs are
submit t ed or a large number of small files are generat ed wit hin a specific period of t ime, all t he
purchased comput e unit s (CUs) are occupied and t he jobs become queued.
Dat a skew
If a large amount of dat a is processed or some jobs are dedicat ed for some special dat a, long t ails
may occur even if most jobs are complet ed.
If t he SQL or user-defined funct ion (UDF) logic is inefficient or paramet er set t ings are not opt imal, a
Fuxi t ask may run for a long period of t ime. However, t he t ime for which each Fuxi inst ance runs is
almost t he same. For more informat ion about t he relat ionships bet ween jobs, Fuxi t asks, and Fuxi
inst ances, see Job details section.
Insufficient CUs
Problem descript ion
If t he CUs are insufficient , t he following issues may occur aft er you submit a job:
T he job may be queued because ot her jobs occupy t he resources of t he resource group. You can
perform t he following st eps t o view t he durat ion for which t he job is queued:
i. Obt ain t he Logview URL in t he job result and open t he URL in t he browser.
ii. On t he SubSt at usHist ory t ab of Logview, find Wait ing for scheduling in t he Descript ion
column and view t he value in t he Lat ency column. T he value indicat es t he durat ion for which t he
job is queued.
Aft er a job is submit t ed, a large number of CUs are required. However, t he resource group cannot
st art all Fuxi inst ances at t he same t ime. As a result , t he job runs slowly. You can perform t he
following st eps t o view t he job st at us:
i. Obt ain t he Logview URL in t he job result and open t he URL in a browser.
ii. In t he Fuxi Inst ance sect ion of t he Job Det ails t ab, click Lat ency chart t o view t he job st at us
diagram.
T he following figure shows t he st at us of a job t hat has sufficient resources. T he lower blue part
in t he diagram remains at approximat ely t he same height , which indicat es t hat all Fuxi inst ances
of t he job st art at approximat ely t he same t ime.
T he following figure shows t he st at us of a job t hat does not have sufficient resources. T he
diagram shows an upward t rend, which indicat es t hat t he Fuxi inst ances of t he job are gradually
scheduled.
Causes
1. Go t o MaxComput e Management .
3. In t he Subscript ion Quot a Groups sect ion, click t he quot a group t hat corresponds t o t he
MaxComput e project .
4. In t he Usage T rend of Reserved CUs chart on t he Resource Consumpt ion t ab, click t he point
at which t he CU usage is t he highest and record t he point in t ime.
5. In t he left -side navigat ion pane, click Jobs. On t he right part of t he page, click t he Job
Management t ab.
6. On t he Job Management t ab, configure T ime Range based on t he point in t ime t hat you recorded,
select Running from t he Job St at us drop-down list , and t hen click OK.
7. In t he job list , click t he icon next t o CPU Ut iliz at ion (%) t o sort jobs by CPU ut ilizat ion in
descending order.
If t he CPU ut ilizat ion of a job is excessively high, click Logview in t he Act ions column and view
I/O Byt es in t he Fuxi Inst ance sect ion. If I/O Byt es is only 1 MB or t ens of KB and mult iple Fuxi
inst ances are running in t he job, a large number of small files are generat ed when t he job is run. In
t his case, you need t o merge t he small files or adjust t he parallelism.
If t he values of CPU Ut ilizat ion (%) are almost t he same, mult iple large jobs are submit t ed at t he
same t ime and t he jobs consume all CUs. In t his case, you must purchase addit ional CUs or use
pay-as-you-go resources t o run jobs.
Solut ions
Merge small files.
Adjust t he parallelism.
T he parallelism of MaxComput e jobs is aut omat ically est imat ed based on t he amount of input dat a
and t he job complexit y. In most cases, you do not need t o manually adjust t he parallelism. If you
adjust t he parallelism t o a higher value, t he job processing speed increases. However, subscript ion
resource groups may be fully occupied. In t his case, jobs are queued t o wait for resources and
t herefore run slowly. You can configure t he odps.st age.mapper.split .size, odps.st age.reducer.num,
odps.st age.joiner.num, or odps.st age.num paramet er t o adjust t he parallelism. For more informat ion,
see SET operations.
Purchase CUs.
For more informat ion about how t o purchase CUs, see Upgrade resource configurations.
Purchase pay-as-you-go resources and use MaxComput e Management t o allow subscript ion project s
t o use t he pay-as-you-go resources.
Data skew
Problem descript ion
Some Fuxi inst ances in a Fuxi t ask cont inue t o run even if most Fuxi inst ances of t he Fuxi t ask st opped.
As a result , long t ails occur.
In t he Fuxi Inst ance sect ion of t he Job Det ails t ab of Logview, you can click Long-T ails t o view t he
Fuxi inst ances t hat have a long t ail.
Cause
T he Fuxi inst ances t hat cont inue t o run process large amount s of dat a or are dedicat ed for special
dat a.
Solut ion
For more informat ion about how t o resolve dat a skew, see Reduce impacts of data skew .
If t he code logic is inefficient , t he following issues may occur aft er you submit a job:
Issue 1: Dat a expansion occurs. T he amount of out put dat a of a Fuxi t ask is significant ly great er t han
t he amount of input dat a.
You can view I/O Record and I/O Byt es in t he Fuxi T ask sect ion t o check t he amount s of input and
out put dat a of a Fuxi t ask. In t he following figure, 1 GB of dat a is changed t o 1 T B aft er t he dat a is
processed. One Fuxi inst ance processes 1 T B of dat a, which reduces dat a processing efficiency.
A Fuxi t ask runs slowly, and t he Fuxi t ask has UDFs. When a t imeout error occurs on a UDF, t he error F
uxi job failed - WorkerRestart errCode:252,errMsg:kInstanceMonitorTimeout, usually caused by
bad udf performance is ret urned. You can perform t he following st eps t o view t he locat ion and
execut ion speed of t he UDF:
i. Obt ain t he Logview URL in t he job result and open t he URL in a browser.
ii. In t he progress chart , double-click t he Fuxi T ask t hat runs slowly or fails t o run. In t he operat or
graph, view t he locat ion of t he UDF, as shown in t he following figure.
iii. In t he Fuxi Inst ance sect ion, click St dOut t o view t he execut ion speed of t he UDF.
Causes
Issue 1: T he business processing logic causes dat a expansion. In t his case, check whet her t he business
logic meet s your business requirement s.
Issue 2: T he UDF code logic does not meet your business requirement s. In t his case, adjust t he code
logic.
Solut ions
Issue 1: Check whet her t he business logic has a defect . If t he logic has a defect , modify t he code. If
t he logic does not have a defect , configure t he odps.st age.mapper.split .size,
odps.st age.reducer.num, odps.st age.joiner.num, or odps.st age.num paramet er t o adjust t he
parallelism. For more informat ion, see SET operat ions.
Issue 2: Check and modify t he UDF code logic. We recommend t hat you preferent ially use built -in
funct ions. If built -in funct ions cannot meet your business requirement s, use UDFs. For more
informat ion about built -in funct ions, see Built -in funct ions.
6.Cost optimization
6.1. Overview
T his t opic describes t he process of cost opt imizat ion.
Ent erprises must cont inually opt imize t heir cost s on MaxComput e in response t o t he changes in big
dat a. You can reference t he following process for cost opt imizat ion:
1. Before you use MaxComput e, make sure t hat you fully underst and t he billing met hods, accurat ely
est imat e t he resources t hat you require, and t hen select an appropriat e billing met hod. For more
informat ion, see Select a billing met hod.
2. T o reduce cost s when you use MaxComput e, opt imize t he resources t hat are used for dat a
comput ing, st orage, uploads, and downloads. For more informat ion, see Opt imize comput ing
cost s, Opt imize st orage cost s, and Opt imize t he cost s of dat a uploads and downloads.
3. View your bills in a t imely manner. Analyze any except ions in t he bills and perform opt imizat ion. For
more informat ion, see Manage cost s.
Billing methods
MaxComput e support s t he following billing met hods:
Subscript ion: Comput ing resources are charged on a mont hly or annual basis. St orage and download
resources are charged on a pay-as-you-go basis.
Pay-as-you-go: St orage, comput ing, and download resources are all charged on a pay-as-you-go
basis.
For more informat ion, see Billing method. You can select a billing met hod wit h t he help of T ot al Cost of
Ownership (T CO) t ools and t he best pract ices of cost est imat ion.
TCO tools
You can use t he following T CO t ools t o est imat e cost s:
MaxComput e price calculat or: T his t ool is suit able for t he subscript ion billing met hod. T o calculat e
t he mont hly cost , ent er t he required comput ing resources and t he volumes of t he dat a you want t o
upload and download.
Cost SQL: T his t ool is suit able for t he pay-as-you-go billing met hod.
You can run t he cost sql command t o est imat e t he cost of an SQL job before you execut e t he SQL
job in a product ion environment . For more informat ion, see Cost est imat ion.
If you use Int elliJ IDEA, you can submit SQL script s for aut omat ic cost est imat ion. For more
informat ion, see Develop and submit an SQL script .
If you use Dat aWorks, you can also est imat e cost s.
Not e
T he cost s of some SQL jobs cannot be est imat ed, such as SQL jobs t hat involve ext ernal
t ables.
T he act ual cost s are subject t o final bills.
1413 USD (T he cost is estimated with the SQL complexity of 1 and the
Pay-as-you-go
execution frequency of once per day.)
If you select t he subscript ion billing met hod, t he cost s vary depending on your business t ype:
Comput e-int ensive scenario: In t his scenario, a large number of CPU resources are required. 160
comput e unit s are used t o process 1 T B of dat a. T he syst em responds t o a request wit hin a few
minut es. T he est imat ed cost is 3768 USD per mont h.
St orage-int ensive scenario: If your jobs are not sensit ive t o t he response speed, we recommend
t hat you purchase a st orage plan. About 50 comput e unit s are used t o process 1 T B of dat a. T he
syst em responds t o a request wit hin a few hours. T he est imat ed cost is 1177.5 USD per mont h.
If you select t he pay-as-you-go billing met hod, t he cost for t he comput ing resources t hat are used
t o process 1 T B of dat a once is about 47.1 USD per day and 1413 USD per mont h. T he prerequisit es
are t hat t he SQL complexit y is 1 and t he dat a is processed once per day. If t he dat a is processed
mult iple t imes per day, t he cost is mult iplied.
When you migrat e dat a t o t he cloud for t he first t ime, we recommend t hat you select t he pay-as-
you-go billing met hod first . Perform a Proof of Concept (POC) t est t o calculat e t he approximat e
number of workers used for your jobs. T hen, calculat e t he number of comput e unit s t hat you need t o
purchase based on t he number of workers.
Billing met hods f or Hadoop users t o migrat e dat a t o t he cloud
Assume t hat a Hadoop clust er has one cont roller node and five comput e nodes. Each node has 32
cores, equivalent t o 32 CPUs. T he t ot al number of CPUs for t he comput e nodes is 160. T he est imat ed
cost of t he clust er is 3768 USD per mont h wit h no discount s or promot ional offers applied.
MaxComput e does not require any cont roller nodes. T he performance of MaxComput e is 80% higher
t han Hive. It frees you from operat ions and maint enance (O&M), which also reduces cost s.
Subscript ion billing met hod for product ion businesses, such as hourly ext ract , t ransform, load (ET L),
and pay-as-you-go billing met hod for aperiodic jobs or ad hoc queries
We recommend t hat you select t he subscript ion billing met hod for periodic comput ing jobs t hat
are frequent ly execut ed and t he pay-as-you-go billing met hod for aperiodic jobs t hat are used t o
process large amount s of dat a. In pay-as-you-go mode, you can choose not t o st ore dat a.
Inst ead, you can read dat a from t ables under ot her account s. T his reduces dat a st orage cost s.
Aut horizat ion is required for comput ing operat ions on t ables under different account s. For more
informat ion, see Create a project-level role.
Subscript ion billing met hod for aperiodic jobs or ad hoc queries and pay-as-you-go billing met hod
for product ion businesses, such as daily ET L
Daily dat a t est ing may cause t he issue of uncont rollable cost s. T o avoid t his issue, you can add
dat a t est ing and aperiodic jobs t o fixed resource groups. T hen, use MaxComput e Management t o
configure cust om development groups and business int elligence (BI) groups. If product ion jobs are
execut ed only once per day, you can add t hem t o a pay-as-you-go resource group.
You can also swit ch bet ween t he subscript ion and pay-as-you-go billing met hods. For more
informat ion, see Switch billing methods.
Not e Before you swit ch t he billing met hod from pay-as-you-go t o subscript ion, evaluat e t he
comput ing performance and cycles of jobs t o det ermine t he number of comput e unit s you need t o
purchase. If t he comput e unit s you purchase are insufficient , t he comput ing cycle of a job may be
prolonged, and t he comput ing performance may not meet your expect at ions. If t his occurs, you
may need t o swit ch t he billing met hod again.
You can est imat e comput ing cost s before you execut e comput ing jobs. For more informat ion, see T CO
tools. You can also configure alert s for resource consumpt ion t o avoid ext ra cost s. If comput ing cost s
are high, you can use t he met hods described in t his t opic t o reduce t he cost s.
Reduce full t able scans. You can use t he following met hods:
Specify t he required paramet ers t o disable t he full t able scan feat ure. You can disable t he feat ure
for a session or project .
Prune columns. Column pruning allows t he syst em t o read dat a only from t he required columns. We
recommend t hat you do not use t he SELECT * st at ement , which t riggers a full t able scan.
Prune part it ions. Part it ion pruning allows you t o specify filt er condit ions for part it ion key columns.
T his way, t he syst em reads dat a only from t he required part it ions. T his avoids t he errors and wast e
of resources caused by full t able scans.
Opt imize SQL keywords t hat incur cost s. T he keywords include JOIN, GROUP BY, ORDER BY, DIST INCT ,
and INSERT INT O. You can opt imize t he keywords based on t he following rules:
Before a JOIN operat ion, you must prune part it ions. Ot herwise, a full t able scan may be
performed. For more informat ion about scenarios in which part it ion pruning is invalid, see
Scenarios where part it ion pruning does not t ake effect .
Use UNION ALL inst ead of FULL OUT ER JOIN.
T ry not t o include GROUP BY in UNION ALL. Use GROUP BY out side UNION ALL.
T o sort t emporarily export ed dat a, sort t he dat a by using t ools such as EXCEL inst ead of ORDER
BY.
T ry not t o use DIST INCT . Use GROUP BY inst ead.
T ry not t o use INSERT INT O t o writ e dat a. Add a part it ion field inst ead. T his reduces SQL
complexit y and saves comput ing cost s.
T ry not t o execut e SQL st at ement s t o view t able dat a. You can use t he t able preview feat ure t o
view t able dat a. T his met hod does not consume comput ing resources. If you use Dat aWorks, you can
preview a t able and query det ails about t he t able on t he Dat a Map page. For more informat ion, see
View t he det ails of a t able. If you use MaxComput e St udio, double-click a t able t o preview it s dat a.
Select an appropriat e t ool for dat a comput ing. MaxComput e responds t o a query wit hin minut es. It is
not suit able for front end queries. Comput ing result s are synchronized t o an ext ernal st orage syst em.
Most users use relat ional dat abases t o st ore result s. We recommend t hat you use MaxComput e for
light weight comput ing jobs and relat ional dat abases, such as ApsaraDB for RDS, for front end queries.
Front end queries require t he real-t ime generat ion of query result s. If t he query result s are displayed in
t he front end, no condit ional clauses are execut ed on t he dat a. T he dat a is not aggregat ed or
associat ed wit h dict ionaries. T he queries do not even include t he WHERE clause.
T he default split size for a mapper is 256 MB. T he split size det ermines t he number of mappers. If
your code logic for a mapper is t ime-consuming, you can use JobConf#setSplitSize t o reduce
t he split size. You must configure an appropriat e split size. Ot herwise, excessive comput ing
resources are required.
By default , t he number of reducers t hat are used t o complet e a job is one fourt h of t he number of
mappers. You can set t he number of t he reducers t o a value t hat ranges from 0 t o 2,000. More
reducers require more comput ing resources, which increases cost s. You must appropriat ely
configure t he number of reducers.
For input t ables t hat cont ain a large number of columns, only a few columns are processed by a
mapper. When you add an input t able, you can specify t he columns t o reduce t he amount of dat a
t hat needs t o be read. For example, t o process dat a in t he c1 and c2 columns, use t he following
configurat ion:
InputUtils.addTable(TableInfo.builder().tableName("wc_in").cols(new String[]{"c1","c2"}).
build(), job);
Aft er t he configurat ion, t he mapper reads dat a only from t he c1 and c2 columns. T his does not
affect t he dat a t hat is obt ained based on column names. However, t his may affect t he dat a t hat is
obt ained based on subscript s.
We recommend t hat you read resources in t he set up st age. T his avoids performance loss caused by
duplicat e resource reads. You can read resources for up t o 64 t imes. For informat ion about , see
Resource usage example.
Java object s are used in each map or reduce st age. You can const ruct Java object s in t he set up st age
inst ead of t he map or reduce st age. T his reduces t he overheads of object const ruct ion.
{
...
Record word;
Record one;
public void setup(TaskContext context) throws IOException {
// Create a Java object in the setup stage. This avoids the repeated creation of Ja
va objects in each map stage.
word = context.createMapOutputKeyRecord();
one = context.createMapOutputValueRecord();
one.set(new Object[]{1L});
}
...
}
/**
* A combiner class that combines map output by sum them.
*/
public static class SumCombiner extends ReducerBase {
private Record count;
@Override
public void setup(TaskContext context) throws IOException {
count = context.createMapOutputValueRecord();
}
@Override
public void reduce(Record key, Iterator<Record> values, TaskContext context)
throws IOException {
long c = 0;
while (values.hasNext()) {
Record val = values.next();
c += (Long) val.get(0);
}
count.set(0, c);
context.write(key, count);
}
}
Appropriat ely select part it ion key columns or cust omize a part it ioner
You can use JobConf#setPartitionColumns t o specify part it ion key columns. T he default part it ion
key columns are defined in t he key schema. If you use t his met hod, dat a is t ransferred t o reducers
according t o t he hash values of t he specified columns. T his avoids long-t ail issues caused by dat a
skew. You can also cust omize a part it ioner if necessary. T he following code shows how t o cust omize
a part it ioner:
import com.aliyun.odps.mapred.Partitioner;
public static class MyPartitioner extends Partitioner {
@Override
public int getPartition(Record key, Record value, int numPartitions) {
// numPartitions indicates the number of reducers.
// This function is used to determine the reducers to which the keys of map tasks are t
ransferred.
String k = key.get(0).toString();
return k.length() % numPartitions;
}
}
jobconf.setPartitionerClass(MyPartitioner.class)
jobconf.setNumReduceTasks(num)
T he large memory of a MapReduce job increases comput ing cost s. We recommend t hat you configure
one CPU core and 4 GB of memory for a MapReduce job and set odps.stage.reducer.jvm.mem t o
4006 for a reducer. A large CPU core-t o-memory rat io (great er t han 1:4) also increases comput ing
cost s.
You can perform t he following operat ions t o opt imize st orage cost s:
If t he minimum period for dat a collect ion is one day, we recommend t hat you use t he dat e field as a
part it ion field. T he syst em migrat es dat a t o t he specified part it ions every day. T hen, it reads t he dat a
from t he specified part it ions for subsequent operat ions.
If t he minimum period for dat a collect ion is one hour, we recommend t hat you use t he combinat ion
of t he dat e and hour fields as a part it ion field. T he syst em migrat es dat a t o t he specified part it ions
every hour. T hen, it reads t he dat a from t he specified part it ions for subsequent operat ions. If dat a
t hat is collect ed on an hourly basis is part it ioned based on dat es, dat a in each part it ion is appended
every hour. As a result , t he syst em reads large amount s of unnecessary dat a, which increases st orage
cost s.
You can use part it ion fields based on your business needs. In addit ion t o t he dat e and t ime fields, you
can use ot her fields t hat have a relat ively fixed number of enumerat ed values, such as channel, count ry,
or province. Alt ernat ively, you can use a combinat ion of t ime and ot her fields as a part it ion field. We
recommend t hat you specify t wo levels of part it ions in a t able. Each t able support s a maximum of
60,000 part it ions.
For example, you can execut e t he following st at ement t o creat e a t able wit h t he lifecycle of 100 days.
If t he last modificat ion of t he t able or a part it ion occurred more t han 100 days ago, MaxComput e
delet es t he t able or part it ion.
CREATE TABLE test3 (key boolean) PARTITIONED BY (pt string, ds string) LIFECYCLE 100;
T he lifecycle t akes a part it ion as t he smallest unit . If some part it ions in a part it ioned t able reach t he
lifecycle t hreshold, t hese part it ions are delet ed. Part it ions t hat do not reach t he lifecycle t hreshold are
not affect ed.
You can execut e t he following st at ement t o modify t he lifecycle set t ings for an exist ing t able. For
more informat ion, see Lifecycle management operations.
T ables t hat are not accessed wit hin t he last t hree mont hs
Non-part it ioned t ables t hat are not accessed wit hin t he last mont h
T ables t hat do not consume st orage resources
You can use an int ernal net work, such as t he classic net work or VPC, t o upload or download dat a at
no cost . For more informat ion about how t o configure net works, see Endpoints.
If you creat e a subscript ion ECS inst ance, you can use a dat a synchronizat ion t ool such as T unnel t o
synchronize dat a from MaxComput e t o t he ECS inst ance. T hen, download t he dat a t o your local
direct ory. For more informat ion, see Export SQL execution results.
Separat e uploads of small files consume t oo many comput ing resources. We recommend t hat you
upload a large number of small files at a t ime. For example, if you call T unnel SDK, we recommend
t hat you upload files when t he cache of t he files reaches 64 MB.
Bill det ails: You can view bill det ails on t he Billing Management page of t he Alibaba Cloud
Management Console.
Usage records: Each usage record cont ains t he complexit y and met ering informat ion of an SQL
st at ement , as well as det ails about daily st orage and download t raffic.
Command-line int erface (CLI): You can use a CLI t o reproduce operat ion scenarios and det ermine t he
causes of high cost s incurred by SQL st at ement s.
Bill details
We recommend t hat you regularly view your bills t o opt imize cost s in a t imely manner. You can view bill
det ails in t he Alibaba Cloud Management Console. If you select t he subscript ion billing met hod, bills are
generat ed at 12:00 t he next day. If you select t he pay-as-you-go billing met hod, bills are generat ed at
09:00 t he next day. For more informat ion, see View billing details.
Usage records
If t he bill amount of a project reaches t housands of dollars on a given day and is a mult iple of t he
normal bill amount , you must view t he bill det ails. You can download usage records t o view det ails
about except ion records. For more informat ion, see View billing details.
Met ering informat ion about a st orage fee is pushed every hour. T o calculat e a st orage fee, obt ain t he
t ot al number of byt es and calculat e t he average value over a 24-hour period. T hen, use t he t iered
pricing met hod t o obt ain t he st orage fee.
T he calculat ion of met ering informat ion depends on t he end t ime of each t ask. If a t ask is complet e in
t he early morning of t he next day aft er it st art s, t he met ering informat ion of t his t ask is included in t he
calculat ion for t he day t he t ask is complet e.
You are not charged for t he resources t hat are used t o download dat a over an int ernal net work, such
as t he classic net work. T he resources t hat are used t o upload dat a are also free of charge. You are
charged only for t he resources t hat are used t o download dat a over t he Int ernet .
CLI
If an abnormal SQL st at ement is det ect ed, you can use a CLI t o reproduce t he operat ion scenario.
You can check usage records or run t he show p; command t o obt ain t he ID of t he inst ance on
which abnormal dat a is det ect ed. T hen, run t he wait InstanceId command t o obt ain t he Logview
URL of t he inst ance. T he logs of t he SQL st at ement are displayed in Logview. You can view t he logs
t o det ermine t he causes of high cost s.
Not e You can obt ain informat ion generat ed only in t he last seven days in Logview.
You can also run t he desc instance instid command t o show informat ion about t he SQL
st at ement in t he console.
TUNNEL DOWNLOAD
table_name
e:/table_name.txt;
TUNNEL DOWNLOAD
table_name
e:/table_name.txt;
Configure the
public endpoint of
Download data over the MaxCompute:
T UNNEL DOWNLOAD Yes
Internet. http://dt.cn-
shanghai.maxcompute.
aliyun.com.
TUNNEL UPLOAD
T UNNEL UPLOAD Upload data. No e:/table_name.txt
table_name;
INSERT OVERWRITE
TABLE table_name
PARTITION
INSERT (sale_date='20180122
Update data. Yes
OVERWRIT E...SELECT ') SELECT shop_name,
customer_id,
total_price FROM
sale_detail;
Query table
DESC T ABLE No DESC table_name;
information.
DROP TABLE if
DROP T ABLE Delete a table. No exists table_name;
CREATE TABLE if
not exists
table_name (key
CREAT E T ABLE Create a table. No
string ,value
bigint) PARTITIONED
BY(p string);
CREATE TABLE if
not exists
CREAT E T ABLE...SELECT Create a table. Yes
table_name AS SELECT
* FROM a_tab;
SET
Configure session
SET FLAG No odps.sql.allow.fulls
settings.
can=true;
JAR -l
Execute a MapReduce com.aliyun.odps.mapr
JAR MR Yes
job. ed.example.WordCount
wc_in wc_out
ADD jar
ADD data\resources\mapre
Add a resource. No
JAR/FILE/ARCHIVE/T ABLE duce-examples.jar -
f;
GET RESOURCES
GET RESOURCES Download resources. No odps-udf-
examples.jar d:\;
CREATE FUNCTION
CREAT E FUNCT IONS Create a function. No
test_lower ;
DROP FUNCTION
DROP FUNCT IONS Delete a function. No
test_lower;
CREATE EXTERNAL
TABLE IF NOT EXISTS
ambulance_data_csv_e
xternal…LOCATION
Create an external
CREAT E EXT ERNAL T ABLE No 'oss://oss-cn-
table.
shanghai-
internal.aliyuncs.co
m/oss-odps-
test/Demo/'
SELECT recordId,
patientId, direction
SELECT [EXT ERNAL] FROM
Read an external table. Yes
T ABLE ambulance_data_csv_e
xternal WHERE
patientId > 25;
Show information
SHOW INST ANCE/SHOW SHOW
about the instance that No
P INSTANCES/SHOW P
the current user creates.
STATUS
Return the status of the
ST AT US INST ANCE No 20131225123302267gk3
specified instance.
u6k4y2
KILL
Stop the specified
KILL INST ANCE No 20131225123302267gk3
instance.
u6k4y2
Background information
MaxComput e is a big dat a analyt ics plat form. Comput ing resources of MaxComput e support t wo t ypes
of billing met hods: subscript ion and pay-as-you-go. You are charged based on MaxComput e project s
on a daily basis, and daily bills are generat ed before 06:00 of t he next day. For more informat ion about
billable it ems and billing met hods of MaxComput e, see Billing method.
Alibaba Cloud provides informat ion about t he MaxComput e bill fluct uat ions (cost increases in most
cases) during dat a development or before a version release of MaxComput e. You can analyze t he bill
fluct uat ions and opt imize jobs in your MaxComput e project s based on t he analysis result s. You can
download t he usage records of all commercial services on t he Billing Management page in t he Alibaba
Cloud Management Console. For more informat ion about how t o obt ain and download bills, see View
billing details.
Project Id: a MaxComput e project of your Alibaba Cloud account or a MaxComput e project of t he
Alibaba Cloud account t o which t he current RAM user belongs.
Met eringId: a billing ID, which indicat es t he ID of a st orage t ask, an SQL comput ing t ask, an
upload t ask, or a download t ask. T he ID of an SQL comput ing t ask is specified by Inst anceId, and
t he ID of an upload or download t ask is specified by T unnel SessionId.
Met eringT ype: a billing t ype. Valid values: St orage, Comput at ionSql, UploadIn, UploadEx,
DownloadIn, and DownloadEx.
St orage: t he amount of dat a t hat is read per hour. Unit : byt es.
St art T ime or EndT ime: t he t ime when a job st art ed t o run or t he t ime when a job st opped. Only
st orage dat a is obt ained on an hourly basis.
SQLInput (Byt e): t he SQL comput at ion it em. T his field specifies t he amount of input dat a each
t ime an SQL st at ement is execut ed. Unit : byt es.
SQLComplexit y: t he complexit y of SQL st at ement s. T his field is one of t he SQL billing fact ors.
UploadEx or DownloadEx(Byt e): t he amount of dat a t hat is uploaded or downloaded over t he
Int ernet . Unit : byt es.
MRComput e(Core*Second): t he billable hours of a MapReduce job or a Spark job, which are
calculat ed by using t he following formula: Number of cores × Number of seconds . Aft er
calculat ion, you must convert t he calculat ion result int o billable hours.
Input OT S(Byt e) or Input OSS(Byt e): t he amount of dat a t hat is read from T ablest ore or OSS by
using ext ernal t ables. Unit : byt es. T hese fields are used when fees for ext ernal t ables are
generat ed.
T o upload a CSV file t hat cont ains usage records of bills t o MaxComput e, you must make sure t hat
t he number and dat a t ypes of columns in t he CSV file are t he same as t he number and dat a t ypes
of columns in t he maxcomput efee t able. Ot herwise, t he dat a upload fails.
Not e
For more informat ion about t he configurat ions of T unnel commands, see T unnel
commands.
You can also upload usage records of bills by using t he dat a import feat ure of
Dat aWorks. For more informat ion, see Import dat a by using Dat a Int egrat ion.
3. Execut e t he following st at ement t o check whet her all usage records are uploaded:
Not e Cost s of an SQL job = Amount of input dat a × Complexit y of SQL st at ement s × Unit
price (USD 0.0438/GB)
-- Sort SQL jobs based on sqlmoney to analyze the costs of SQL jobs.
SELECT to_char(endtime,'yyyymmdd') as ds,feeid as instanceid
,projectid
,computationsqlcomplexity -- SQL complexity
,SUM((computationsqlinput / 1024 / 1024 / 1024)) as computationsqlinput -- Amo
unt of input data (GB)
,SUM((computationsqlinput / 1024 / 1024 / 1024)) * computationsqlcomplexity * 0
.0438 AS sqlmoney
FROM maxcomputefee
WHERE TYPE = 'ComputationSql'
AND to_char(endtime,'yyyymmdd') >= '20190112'
GROUP BY to_char(endtime,'yyyymmdd'),feeid
,projectid
,computationsqlcomplexity
ORDER BY sqlmoney DESC
LIMIT 10000
;
T o reduce t he cost s of large jobs, you can reduce t he amount of dat a t hat you want t o read
and t he complexit y of SQL st at ement s.
You can summarize daily dat a based on t he ds field and analyze t he t rend of t he cost s of SQL
jobs in a specified period of t ime. For example, you can creat e a line chart in an EXCEL file or by
using t ools, such as Quick BI, t o display t he t rend.
You can perform t he following st eps t o locat e t he node t hat you want t o opt imize based on
t he execut ion result :
a. Obt ain t he ID of a job inst ance.
b. Ent er t he ret urned Logview URL in a web browser and press Ent er t o view t he informat ion
about t he SQL job.
For more informat ion about how t o use Logview t o view informat ion about jobs, see Use
Logview to view job information.
In Logview, find t he job whose informat ion you want t o view and click XML in t he SourceXML
column t o view t he job det ails. In t he following figure, SKYNET _NODENAME indicat es t he
name of t he Dat aWorks node. T his paramet er is displayed only for t he jobs t hat are run by
t he scheduling syst em. T his paramet er is left empt y for ad hoc queries. Aft er you obt ain t he
node name, you can quickly locat e t he node in t he Dat aWorks console t o opt imize t he node
or view t he node owner.
2. Analyze t he t rend of t he number of jobs. In most cases, a surge in t he number of jobs due t o
repeat ed operat ions or invalid set t ings of scheduling at t ribut es result s in cost increases.
T he execut ion result shows t he t rend of t he number of jobs t hat were submit t ed t o MaxComput e
and were successfully run from January 12, 2019 t o January 14, 2019.
St orage cost s increased on January 12, 2019 and decreased on January 14, 2019.
T o reduce st orage cost s, we recommend t hat you configure a lifecycle for t ables and delet e
unnecessary t emporary t ables.
For Int ernet -based dat a downloads or cross-region dat a downloads in your MaxComput e project ,
you are charged based on t he amount of dat a t hat is downloaded.
Not e Cost s of a download job = Amount of downloaded dat a × Unit price (USD
0.1166/GB)
Not e Comput ing fees for MapReduce jobs on t he day = T ot al billable hours on t he day ×
Unit price (USD 0.0690/Hour/Job)
6. Analyze t he cost s of jobs t hat involve T ablest ore ext ernal t ables or OSS ext ernal t ables.
Not e Comput ing fees for an SQL job t hat involves ext ernal t ables = Amount of input
dat a × Unit price (USD 0.0044/GB).
-- Analyze the costs of SQL jobs that involve Tablestore external tables.
SELECT TO_CHAR(starttime,'yyyymmdd') AS ds
,projectid
,(input_ots/1024/1024/1024)*1*0.0044 AS ots_fee
FROM maxcomputefee
WHERE type = 'ComputationSql'
AND TO_CHAR(starttime,'yyyymmdd') >= '20190112'
GROUP BY TO_CHAR(starttime,'yyyymmdd')
,projectid
,input_ots
ORDER BY ots_fee DESC
;
-- Analyze the costs of SQL jobs that involve OSS external tables.
SELECT TO_CHAR(starttime,'yyyymmdd') AS ds
,projectid
,(input_oss/1024/1024/1024)*1*0.0044 AS ots_fee
FROM maxcomputefee
WHERE type = 'ComputationSql'
AND TO_CHAR(starttime,'yyyymmdd') >= '20190112'
GROUP BY TO_CHAR(starttime,'yyyymmdd')
,projectid
,input_oss
ORDER BY ots_fee DESC
;
Not e Comput ing fees for Spark jobs on t he day = T ot al billable hours on t he day × Unit
price (USD 0.1041/Hour/Job)
7.Security management
7.1. Set a RAM user as the super
administrator for a MaxCompute
project
T his t opic describes how t o set a RAM user as t he super administ rat or for a MaxComput e project , and
provides suggest ions on how t o manage members and permissions.
Background information
T o ensure dat a securit y, t he Alibaba Cloud account of a project is used only by aut horized personnel.
Common users can only log on t o MaxComput e as RAM users. A project owner must be t he Alibaba
Cloud account , and some operat ions can only be performed by t he project owner, such as set t ing a
project flag and configuring cross-project resource sharing by using packages. If you use a RAM user,
make sure t hat it has been grant ed t he super administ rat or role.
T he built -in management role Super_Administ rat or has been added t o MaxComput e. T his role has
permissions on all t ypes of resources in a project and project management permissions. For more
informat ion about permissions, see Role planning and management .
A project owner can grant t he Super_Administ rat or role t o a RAM user. As a super administ rat or, t he
RAM user has t he permissions needed t o manage t he project , such as common project flag set t ing
permissions and permissions on managing all resources.
Authorization methods
We recommend t hat you grant t he Super_Administ rat or role t o a RAM user t hat has t he permissions t o
creat e a project . T his way, t he RAM user can manage bot h Dat aWorks workspaces and MaxComput e
project s t hat are associat ed wit h t hese Dat aWorks workspaces.
Not e
For informat ion about how t o aut horize a RAM user t o creat e project s, see Grant a RAM user
t he permissions t o perform operat ions in t he Dat aWorks console.
T o ensure dat a securit y, we recommend t hat you clarify t he responsibilit ies of owners of
RAM users. Make sure t hat each RAM user belongs t o one developer.
Only one RAM user can be grant ed t he Super_Administ rat or role in a project . You can grant
t he Admin role t o ot her RAM users t hat require basic management permissions.
Aft er you select a RAM user and use t he RAM user t o creat e a project , t he project owner is st ill t he
Alibaba Cloud account , who can grant t he Super_Administ rat or role t o t he RAM user in t he following
ways:
Grant t he Super_Administ rat or role on t he MaxComput e client .
Assume t hat user [email protected] is t he owner of t he project _a project , and user Allen is a RAM user
under [email protected].
i. Run t he following commands t o grant t he Super_Administ rat or and Admin roles as user
-- Open project_a.
use project_a;
-- Add the RAM user Allen to project_a.
add user [email protected]:Allen;
-- Grant the Super_Administrator role to Allen.
grant super_administrator TO [email protected]:Allen;
-- Grant the Admin role to Allen.
grant admin TO [email protected]:Allen;
ii. Run t he following command t o view t he permissions as t he aut horized RAM user:
show grants;
If t he Super_Administ rat or role is in t he command out put , t he aut horizat ion succeeded.
Not e In t he not e block, click Ref resh t o synchronize t he RAM users under t he
current Alibaba Cloud account t o t he Account t o be added sect ion.
c. Find t he role t hat you want t o grant t o t he user and click Member management in t he
Operat ion column. In t he Member management dialog box, select t he members you want t o
add from t he Account t o be added sect ion and click t he right wards arrow t o add t hem t o
t he Added account sect ion.
show grants;
If t he Super_Administ rat or role is in t he command out put , t he aut horizat ion succeeded.
Usage notes
Member management
MaxComput e support s t he Alibaba Cloud account and RAM users. T o ensure dat a securit y, we
recommend t hat you only add RAM users under t he project owner as project members.
T he Alibaba Cloud account is used t o cont rol RAM users, such as revoking or updat ing t heir
credent ials. T his ensures dat a securit y in t he case of personnel t ransfers and resignat ions.
Not e If you use Dat aWorks t o manage project members, you can add only RAM users
under t he project owner as project members.
RAM users can be added by t he Alibaba Cloud account and t he super administ rat or. If you want t o
add RAM users t o a project as t he super administ rat or, wait unt il t he RAM users are creat ed by t he
Alibaba Cloud account .
We recommend t hat you only add t he users who need t o develop dat a, namely, users who need t o
run jobs, in t he current project as project members. For users who require dat a int eract ions, you can
use packages t o share resources across project s. T his reduces t he complexit y of member
management because fewer members are added t o t he project .
If an employee who has a RAM user is t ransferred t o anot her posit ion or resigns, t he RAM user wit h
t he Super_Administ rat or role needs t o remove t he RAM user of t he employee from t he project , and
t hen not ify t he project owner t o revoke it s credent ials. If an employee who has a RAM user wit h
t he Super_Administ rat or role is t ransferred t o anot her posit ion or resigns, t he Alibaba Cloud
account must be used t o remove t he RAM user and revoke it s credent ials.
Permission management
We recommend t hat you manage permissions by role. Permissions are associat ed wit h roles, and
roles are associat ed wit h users.
We recommend t hat you use t he principle of least privilege t o avoid securit y risks caused by
excessive permissions.
If you need t o use cross-project dat a, we recommend t hat you share resources by using packages.
In t his way, resource providers only need t o manage packages, which avoids t he ext ra cost s caused
by t he management of addit ional members.
Not e A RAM user who has been grant ed t he Super_Administ rat or role has t he permissions
t o query and manage all resources in a project . T herefore, no addit ional permissions need t o be
grant ed t o t he RAM user.
You can use t he view provided by t he MaxComput e met adat a service t o audit permissions. For more
informat ion, see Metadata views.
Cost management
For more informat ion, see View billing details. RAM users can query t he billing det ails only aft er t he
Alibaba Cloud account grant s t hem t he permissions t o access Billing Management . For informat ion
about how t o grant permissions, see Grant permissions to a RAM role. T he following permissions are
required:
AliyunBSSFullAccess: t he permissions t o manage Billing Management .
AliyunBSSReadOnlyAccess: t he access and read-only permissions on Billing Management .
AliyunBSSOrderAccess: t he permissions t o view, pay for, and cancel orders in Billing Management .
Not e T he view provided by t he met adat a service only ret ains dat a generat ed in t he last
15 days. If you need t o st ore dat a for a longer period of t ime, we recommend t hat you
regularly read and save t he dat a locally.
Prerequisites
T he MaxComput e client is inst alled. For more informat ion, see Inst all and configure t he MaxComput e
client .
Context
If a user is assigned a built -in role and you want t o manage t he permissions of t he user in a fine-grained
manner, we recommend t hat you use t he policy-based permission management mechanism inst ead of
t he access cont rol list (ACL) mechanism. For more informat ion about built -in roles, see Users and roles.
For more informat ion about t he policy-based permission management mechanism, see Policy-based
access control and download control.
T he policy-based access cont rol mechanism is used t o manage permissions based on roles. T his
mechanism allows you t o grant or revoke operat ion permissions on project objects, such as t ables, for
roles. T he operations include read and writ e operat ions. Aft er you assign a role t o a user, t he permissions
grant ed t o or revoked from t he role also t ake effect on t he user. For more informat ion about t he
GRANT and REVOKE synt ax, see Policy-based access control and download control.
T his operat ion can be performed only by t he project owner or users assigned t he Super_Administ rat or
or Admin role.
Sample st at ement :
For more informat ion about how t o creat e a role, see Role planning and management .
3. Execut e t he GRANT st at ement t o grant t he delet e_t est role t he permission t hat prohibit s t he
role from delet ing all t ables whose names st art wit h t b_.
Sample st at ement :
For more informat ion about t he GRANT synt ax, see t he "Policy-based access cont rol by using t he
GRANT st at ement " sect ion in Policy-based access control and download control.
4. Execut e t he GRANT st at ement t o assign t he delet e_t est role t o t he RAM user Alice.
Sample st at ement :
If you do not know t he Alibaba Cloud account t o which t he RAM user belongs, you can execut e t he
LIST USERS; st at ement on t he MaxComput e client t o obt ain t he account . For more informat ion
about how t o assign a role t o a user, see Role planning and management .
Sample st at ement :
[roles]
role_project_admin, delete_test -- Alice is assigned th
e delete_test role.
Authorization Type: Policy -- The authorization me
thod is Policy.
[role/delete_test]
D projects/mcproject_name/tables/tb_*: Drop -- Alice is not allowed
to delete the tables whose names start with tb_ in the project. D indicates Deny.
[role/role_project_admin]
A projects/mcproject_name: *
A projects/mcproject_name/instances/*: *
A projects/mcproject_name/jobs/*: *
A projects/mcproject_name/offlinemodels/*: *
A projects/mcproject_name/packages/*: *
A projects/mcproject_name/registration/functions/*: *
A projects/mcproject_name/resources/*: *
A projects/mcproject_name/tables/*: *
A projects/mcproject_name/volumes/*: *
Authorization Type: ObjectCreator
AG projects/mcproject_name/tables/local_test: All
AG projects/mcproject_name/tables/mr_multiinout_out1: All
AG projects/mcproject_name/tables/mr_multiinout_out2: All
AG projects/mcproject_name/tables/ramtest: All
AG projects/mcproject_name/tables/wc_in: All
AG projects/mcproject_name/tables/wc_in1: All
AG projects/mcproject_name/tables/wc_in2: All
AG projects/mcproject_name/tables/wc_out: All
For more informat ion about how t o view user permissions, see Query permissions by using MaxCompute
SQL.
6. Log on t o t he MaxComput e client as Alice and execut e t he DROP TABLE st at ement t o delet e t he
T he following result s are ret urned. T he result s indicat e t hat t he permission t akes effect . If t he
t ables are delet ed, t he permission does not t ake effect . In t his case, you must check whet her t he
preceding st eps are correct ly performed.
T his operat ion can be performed only by t he project owner or users assigned t he Super_Administ rat or
or Admin role. You can use one of t he following met hods t o revoke t he permission from t he RAM user
Alice based on your business requirement s.
Sample st at ement :
For more informat ion about t he REVOKE synt ax, see t he "Policy-based access cont rol by using
t he GRANT st at ement " sect ion in Policy-based access control and download control.
iii. Execut e t he SHOW GRANTS st at ement t o view t he permissions of t he RAM user Alice. Sample
st at ement :
[roles]
role_project_admin, delete_test -- The delete_test ro
le is retained.
Authorization Type: Policy -- The permission is
revoked.
[role/role_project_admin]
A projects/mcproject_name: *
A projects/mcproject_name/instances/*: *
A projects/mcproject_name/jobs/*: *
A projects/mcproject_name/offlinemodels/*: *
A projects/mcproject_name/packages/*: *
A projects/mcproject_name/registration/functions/*: *
A projects/mcproject_name/resources/*: *
A projects/mcproject_name/tables/*: *
A projects/mcproject_name/volumes/*: *
Authorization Type: ObjectCreator
AG projects/mcproject_name/tables/local_test: All
AG projects/mcproject_name/tables/mr_multiinout_out1: All
AG projects/mcproject_name/tables/mr_multiinout_out2: All
AG projects/mcproject_name/tables/ramtest: All
AG projects/mcproject_name/tables/tb_test: All
AG projects/mcproject_name/tables/wc_in: All
AG projects/mcproject_name/tables/wc_in1: All
AG projects/mcproject_name/tables/wc_in2: All
AG projects/mcproject_name/tables/wc_out: All
For more informat ion about how t o view user permissions, see Query permissions by using
MaxCompute SQL.
iv. Log on t o t he MaxComput e client as Alice and execut e t he DROP TABLE st at ement t o delet e
t he t ables whose names st art wit h t b_.
Sample st at ement :
Sample st at ement :
For more informat ion about how t o revoke a role from a user, see Role planning and management .
iii. Execut e t he SHOW GRANTS st at ement t o view t he permissions of t he RAM user Alice. Sample
st at ement :
[roles]
role_project_admin -- The delete_test role
is revoked.
Authorization Type: Policy
[role/role_project_admin]
A projects/mcproject_name: *
A projects/mcproject_name/instances/*: *
A projects/mcproject_name/jobs/*: *
A projects/mcproject_name/offlinemodels/*: *
A projects/mcproject_name/packages/*: *
A projects/mcproject_name/registration/functions/*: *
A projects/mcproject_name/resources/*: *
A projects/mcproject_name/tables/*: *
A projects/mcproject_name/volumes/*: *
Authorization Type: ObjectCreator
AG projects/mcproject_name/tables/local_test: All
AG projects/mcproject_name/tables/mr_multiinout_out1: All
AG projects/mcproject_name/tables/mr_multiinout_out2: All
AG projects/mcproject_name/tables/ramtest: All
AG projects/mcproject_name/tables/wc_in: All
AG projects/mcproject_name/tables/wc_in1: All
AG projects/mcproject_name/tables/wc_in2: All
AG projects/mcproject_name/tables/wc_out: All
iv. Log on t o t he MaxComput e client as Alice and execut e t he DROP TABLE st at ement t o delet e
t he t ables whose names st art wit h t b_.
Sample st at ement :
Sample st at ement :
If OK is ret urned, t he role is delet ed. For more informat ion about how t o delet e a role, see Role
planning and management .