This repo is a hands on lab for Spark MLlib based scalable machine learning on Google Cloud, powered by Dataproc Serverless Spark and showcases integration with Vertex AI AIML platform. The focus is on demystifying the products and integration (and not about a perfect model), and features a minimum viable end to end machine learning use case.
The lab is fully scripted (no research needed), with (fully automated) environment setup, data, code, commands, notebooks, orchestration, and configuration. Clone the repo and follow the step by step instructions for an end to end MLOps experience.
Expect to spend ~8 hours to fully understand and execute if new to GCP and the services and at least ~6 hours otherwise.
L300 - framework (Spark), services/products, integration
The intended audience is anyone with (access to Google Cloud and) interest in the usecase, products and features showcased.
Knowledge of Apache Spark, Machine Learning, and GCP products would be beneficial but is not entirely required, given the format of the lab. Access to Google Cloud is a must unless you want to just read the content.
Simplify your learning and adoption journey of our product stack for scalable data science with -
- Just enough product knowledge of Dataproc Serverless Spark & Vertex AI integration for machine learning at scale on Google Cloud
- Quick start code for ML at scale with Spark that can be repurposed for your data and ML experiments
- Terraform for provisioning a variety of Google Cloud data services in the Spark ML context, that can be repurposed for your use case
Telco Customer Churn Prediction with a Kaggle dataset and Spark MLLib, Random Forest Classifer
For your convenience, all the code is pre-authored, so you can focus on understanding product features and integration.
Complete the lab modules in a sequential manner. For a better lab experience, read all the modules and then start working on them.
Although the ML usecase in this lab does not need a custom container image, the lab includes container image creation and usage for the purpose of education.
Shut down/delete resources when done to avoid unnecessary billing.
# | Google Cloud Collaborators | Contribution |
---|---|---|
1. | Anagha Khanolkar | Creator |
2. | Dr. Thomas Abraham | ML consultation, testing, best practices and feedback |
3. | Rob Vogelbacher Proshanta Saha |
ML consultation |
4. | Ivan Nardini Win Woo |
ML consultation, inspiration through samples and blogs |
Community contribution to improve the lab is very much appreciated.
If you have any questions or if you found any problems with this repository, please report through GitHub issues.