Abinitio Vectors: Created and Presented by Avishek Gupta Roy
Abinitio Vectors: Created and Presented by Avishek Gupta Roy
Abinitio Vectors: Created and Presented by Avishek Gupta Roy
Vector
Index
Index: Index is a pointer value pointing to a specific position of a vector. The value of index starts
from 0 and can extend up to any integer value incremented by 1.
DML and data types
A Typical Vector DML
record
string("|") merchant_name;
string("|")[6] transaction_flag;
decimal("|")[6] purchase_amount;
decimal("|")[6] sale_amount;
string("\n") newline;
End;
Note:
The value inside [ ] describes the vector length and is placed immediately after the data type. The length of the vector
can also be a variable value based on the value of an column. In the below example the length of the vectors for
the amount fields shall depend of the value contained in the field length_of_amount_vector:
record
string("|") merchant_name;
decimal("|") length_of_amount_vector;
decimal("|")[length_of_amount_vector] purchase_amount;
decimal("|")[length_of_amount_vector] sale_amount;
string("\n") newline;
End;
Understanding Data
A Typical Raw data from a File:
Pantaloons|Y|Y|Y|Y|Y|Y|6532|8451|7854|7598|7594|9845|7584|4851|
2561|8546|9865|10653|
Shoppers Stop|N|N|N|Y|Y|Y|0|0|0|7584|7542|7548|0|0|0|8965|10596|
15240|
A Vector Representation of this data:
record
string("|") merchant_name;
string("|")[6] transaction_flag;
decimal("|")[6] purchase_amount;
decimal("|")[6] sale_amount;
string("\n") newline;
End;
Normalizing a Vector
Normalizing Record with fixed vector length:
record record
end; end;
Transformation for Normalize:
out :: length(in) =
begin
out :: 6;
end;
begin
out.purchase_amount :: in.purchase_amount[index];
out.sale_amount :: in.sale_amount[index];
out.newline :: in.newline;
end;
Explanation : In the above transform index is parameter which is equal to the vector position. Here the length of the
vector is 6, so for any record the value of index would start from 0 and increment up to 5, there by creating 6
records and assigning the 1 st of the vector to the 1st record, 2 nd value to the 2nd record and so on. Finally we
would get 6 records from 1 single record for each merchant.
Normalizing Record with fixed vector length:
record record
end;
Transformation for Normalize:
out :: length(in) = Source Data:
begin
out::in.length_of_amount_vector;
end;
begin
out.merchant_name :: in.merchant_name;
out.transaction_flag :: in.transaction_flag[index];
out.purchase_amount :: in.purchase_amount[index];
out.sale_amount :: in.sale_amount[index];
End;
record record
end; end;
Transformation for denormalize using Rollup:
out::rollup(in)= Source Data:
begin
out.merchant_name :: in.merchant_name;
out.transaction_flag :: accumulation(in.transaction_flag);
out.purchase_amount :: accumulation(in.purchase_amount);
out.sale_amount :: accumulation(in.sale_amount);
out.newline :: in.newline;
End;
Target Data:
requirement is to have the data sorted before rollup. The key field
for the rollup is/are the common field / fields in the record (In this
the data) the vector fields should is defined with the proper length
should not contain record count exceeding the vector length when
record record
end;
Transformation to denormalize to Variable vector using Rollup:
out::rollup(in)= Source Data:
begin
out.merchant_name :: in.merchant_name;
out.number_of_tran :: count(1);
out.transaction_flag :: accumulation(in.transaction_flag);
out.purchase_amount :: accumulation(in.purchase_amount);
out.sale_amount :: accumulation(in.sale_amount);
out.newline :: in.newline;
end;
Target Data:
store the length of the vector (which is a group by count on the key).
Note: One very important thing to notice here is the vector definition of
string("|")[int] transaction_flag;
Here in the vector length definition we define the length with no fixed
record record
end
Normalize transform for Vector within a vector:
out :: length(in) =
begin
out :: in.period_tran_count * 4;
end;
begin
out.Name :: in.Name;
out.Month :: in.period[index/4].Month;
out.Amount :: in.period[index/4].Amount[index%4];
out.newline :: in.newline;
end;
Explanation: For this example where we have vector with in a vector the total number of records after normalize would
be the product of the lengths of both the vectors. Let us take an example with the source data for Farahan.
Farahan spends money during two months, January and February. Based on the source dml, he and all others can
make 4 transactions each month (fixed length vector). So the total number of records on normalization would be 4 *
2 = 8.
Now for the data population part using index: For Farhan the value for index would range from 0 – 7 (8 records). The
value if Index / 4 and index % 4 is given for all the value of index in range (0 – 7).
In this way the fields evaluate the values for index and and populates
each value of index in proper position.
Vector Functions
Here are some Vector functions:
Allocate function:
This function with synonymous with allocate_with_defaults() in the current versions. It allocates specific values to both
vector and non vector elements. This function is supplied with no arguments. It comes handy when initializing a
vector with a specific operation.
vector_sum function:
This function is used to add all the values inside a vector. Basically it generates the summation of all the elements in a
vector.
vector_product function:
This function returns the product of all the values in a vector. Usage: out.b = vector_product(vector_field);
vector_difference function:
This function returns the elements of the first vector which are not present in the second vector based on a key field .
Below is an example:
let (string “|”) [3] vector_1 = [vector “a”, “b”, “c”, “d”];
We can write this code as a statement in the xfr for normalize or reformat.
Code:
begin
max_length_of_inner_vector = in.period[0].amount_tran_count;
max_length_of_inner_vector = in.period[i].amount_tran_count;
end;
Some Points to remember
A vector length is specified by [ ] after the data definition for a field.
The vector examples used in this presentation are all delimited
vectors. Apart from delimited vectors. There can also be fixed
column size vectors.
Vectors are a very good way of storing data as it eliminates
redundancy. But it should be remembered that they are not very
good performers when transforming data. For performing large
transforms it is advised to either normalize the data first or read the
data in non vector denormal form.
Apart from the type of vectors mentioned in the sheet, there is also
a kind of vector by the name of delimited vector whose length is
defined by delimiter characters in the data. This is not a very
common practice in the real scenarios and hence are seldom used.
THANK YOU