Big Data Testing
Big Data Testing
Testing
Big data
Mobile BI
Big Data
3 Vs
Volume , Velocity and Variety
Silent 4th V
Value
HDFS CLI
Mkdir It will take path uris as argument and creates directory or
directories.
hadoop fs -mkdir <paths>
Eg- hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs -mkdir
hdfs://nn1.example.com/user/hadoop/dir
Ls - Lists the contents of a directory
hadoop fs -ls <args>
Eg- hadoop fs -ls /user/hadoop/dir1
Put -Copy single src file, or multiple src files from local file system to the
Hadoop data file system
hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Get - Copies/Downloads files to the local file system
hadoop fs -get <hdfs_src> <localdst>
No SQL Characteristics
1. No particular schema
2. Highly scalable
3. Availability
4. Speed of access
Reference architecture
Different layers
1. Source to Hadoop data ingestion
(EL)
2. Hadoop processing using
mapreduce, Pig hive (T)
3. Loading of processed data from
hadoop to EDW.
4. BI reports and analytics using the
processed data stored in hadoop via
Hive
Validation of Hadoop
processing
Validation of reports
(Hive/EDW)
Validating by firing queries against
HDFS using HIVE
Validating by running SQL against
the EDW
Normal report testing approach.
Challanges
Volume- Prepare compare script to validate the data.
To reduce the time we can run all the comparison
script in parallel just like data is processed in
mapreduce.
https://community.informatica.com/solutions/file_tabl
e_compare_utility_hdfs_hive
Sample data ensuring maximum scenarios is covered.
Variety- Unstructured data can be converted to
structured form and then compared.
Summary
1. Data ingestion into hadoop via Sqoop,Flume,Kafka
2.Data processing within Hadoop using
mapreduce,Pig , Hive. (or ETL tools like Informatica
Big data edition,Talend)
3.Reporting using reporting tools like Tableau,
Microstrategy.(via HIVE)
4.Loading of data from Hadoop to EDW
(Teradata/Oracle) or analytical database
(GreenPlum/Netezzea)
Use cases
1. ETL processing moved to Hadoop to take
advantage of processing of structured/unstructured
data
2. Machine learning over Hadoop. Eg
Recommendation engine (Amazon,Flipkart)
3.Fraud detection in credit card or insurance industry
4. Retail Understanding customers buying pattern,
Market basket analysis etc.
Initial data
Transaction 1: cracker, icecream, soda
Transaction 2: chicken, pizza, coke, bread
Transaction 3: baguette, soda, hering, cracker,
soda
Transaction 4: bourbon, coke, turkey,bread
Transaction 5: sardines, soda, chicken, coke
Transaction 6: apples, peppers, avocado, steak
Transaction 7: sardines, apples, peppers, avocado,
steak
Data setup
< (cracker, icecream), (cracker, soda),(icecream
and soda) >
< (chicken, pizza), (chicken, coke), (chicken,
bread),(pizza,coke) .. >
< (baguette, soda), (baguette, hering), (baguette,
cracker), (baguette, soda) >
< (bourbon, coke), (bourbon, turkey) >
< (sardines, soda), (sardines, chicken), (sardines,
coke)
>
Map phase
< ((cracker, icecream),1), ((cracker,
soda),1),((icecream and soda),1)>
<((chicken, pizza),1), ((chicken, coke),1),
((chicken, bread),1),((cracker,soda),1)>
Output
((cracker,icecream),1)
((cracker,soda),1) ((cracker,soda),1)
((key) , value)
Reduce phase
Input ((cracker,icecream),<1,1,1.>)
((cracker,soda) ,<1,1>)
Output ((cracker,icecream),540)
((cracker,soda), 240)
Result1.Icecream should be placed nearby to cracker.
2.Keep some combo offers for the combination to increase
sale.
What next
1. Learn Java and mapreduce (not
mandatory but will definitely help)
2. Learn Hadoop .(Hadoop
Definitive Guide by Tom white,
Hadoop in Action Chuck lam)
4.Learn some stuff related to Pig and
Hive.
5. Plenty of tutorials in net
3. Install a free VM
(Cloudera/Hortonworks)
Thank you
Questions