Manage Your Data Science Project Structure in Early Stage
Manage Your Data Science Project Structure in Early Stage
Manage Your Data Science Project Structure in Early Stage
towardsdatascience.com/manage-your-data-science-project-structure-in-early-stage-95f91d4d0600
22 de setembro de
2018
Jupyter Notebook (or Colab, databrisk’s notebook etc) provide a very efficient way for
building a project in a short time. We can create any python class and function in the
notebook without re-launching kernel. Helping to shorten the waiting time for that. It is
good for small scale project and experiment. However, it may not good for long term
growth.
1/7
“white and multicolored building scale model” by Alphacolor 13 on Unsplash
Intuitively, velocity is the major consideration when we are doing Proof of Concept (PoC).
When working on one of previous project, I explore whether or not text summarization
technique can be applied to my data science problem. I do not intend to build any
“beautiful” or “structure” for my PoC because I do not know it will be useful or not.
In other word, I use “quick-and-dirty” way to the PoC as no one care whether it is well
structured or the performance is optimized. From technical view, I may create function
inside Notebook. The benefit is that I do not need to handle any externalize file (e.g.
loading other python file).
2/7
Packaging Solution
After several runs of PoC, solution should be identified. Code refactor is necessary from
this moment due to multiple reasons. It does not just organized it well but it benefit to
achieve a better prediction.
Modularization
3/7
“editing video screengrab” by Vladimir Kudinov on Unsplash
First, you have to make your experiment repeatable. When you finalized (at least the
initial version), you may need to work with other team member to combine result or
building an ensemble model. Other members may need to checkout and review your
code. It is not a good idea if they cannot reproduce your work.
Another reason is hyper parameter tuning. I do not focus on hyper parameter tuning in
early stage as it may spend too many resource and time. If solution is confirmed then it
should be a good time to finding a better hyper parameter before launching the
prediction service. Not only tuning number of neuron in neural network on
dimension of embeddings layer but also different architecture (for instance,
comparing GRU, LSTM and Attention Mechanism) or other reasonable changes.
Therefore, modularized your processing, training, metrics evaluation function are
important steps to manage this kind of tuning.
Furthermore, you need to ship your code to different cluster (e.g. Spark, or self-build
distributed system) for tuning. Due to some reasons, I did not use dist-keras (distributed
Keras training library). I built an simple cluster by myself to speed up the distributed
training while it can support Keras, scikit-learn or any other ML/DL libraries. It will be
better to package implementation into python instead of notebook.
Recommendation
After delivered several projects, I revisited papers and project structure template and
4/7
following Agile project management strategy. I will shape my project structure as
following style. It makes me to have well-organized project and much less pain in every
project iteration or sprint.
Rounded rectangle means folder. python code, model file, data and notebook should be
put under corresponding folder.
src
Stores source code (python, R etc) which serves multiple scenarios. During data
exploration and model training, we have to transform data for particular purpose. We
have to use same code to transfer data during online prediction as well. So it better
separates code from notebook such that it serves different purpose.
preparation: Data ingestion such as retrieving data from CSV, relational database,
NoSQL, Hadoop etc. We have to retrieve data from multiple sources all the time so
we better to have a dedicated function for data retrieval.
processing: Data transformation as source data do not fit what model needs all the
time. Ideally, we have clean data but I never get it. You may say that we should
have data engineering team helps on data transformation. However, we may not
know what we need under studying data. One of the important requirement is
both off-line training and online prediction should use same pipeline to reduce
misalignment.
modeling: Model building such as tackling classification problem. It should not just
include model training part but also evaluation part. On the other hand, we have to
think about multiple models scenario. Typical use case is ensemble model such as
combing Logistic Regression model and Neural Network model.
5/7
test
In R&D, data science focus on building model but not make sure everything work well in
unexpected scenario. However, it will be a trouble if deploying model to API. Also, test
cases guarantee backward compatible issue but it takes time to implement it.
Test case for asserting python source code. Make sure no bug when changing code.
Rather than using manual testing, automatic testing is an essential puzzle of
successful project. Teammates will have confidence to modify code assuming that
test case help to validate code change do not break previous usage.
model
Folder for storing binary (json or other format) file for local use.
Storing intermediate result in here only. For long term, it should be stored in
model repository separately. Besides binary model, you should also store model
metadata such as date, size of training data.
data
Folder for storing subset data for experiments. It includes both raw data and processed
data for temporary use.
raw: Storing the raw result which is generated from “preparation” folder code. My
practice is storing a local subset copy rather than retrieving data from remote
data store from time to time. It guarantees you have a static dataset for rest of
action. Furthermore, we can isolate from data platform unstable issue and network
latency issue.
processed: To shorten model training time, it is a good idea to persist processed
data. It should be generated from “processing” folder.
notebook
eda: Exploratory Data Analysis (aka Data Exploration) is a step for exploring what
you have for later steps. For short term purpose, it should show what you
explored. Typical example is showing data distribution. For long term, it should
store in centralized place.
poc: Due to some reasons, you have to do some PoC (Proof-of-Concept). It can be
show in here for temporary purpose.
modeling: Notebook contains your core part which including model building and
training.
evaluation: Besides modeling, evaluation is another important step but lots of
people do not aware on it. To get trust from product team, we have to demonstrate
how good does the model.
6/7
Take Away
To access project template, you can visit this github repo.
Aforementioned is good for small and medium size data science project.
For large scale data science project, it should include other components such as
feature store and model repository. Will write a blog for this part later.
Modified it according to your situation.
About Me
I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial
Intelligence , especially in NLP and platform related. You can reach me from Medium
Blog, or Github.
Reference
Microsoft Data Science Project Template
7/7