Data Science Workflow

In the previous article, you had an overview of data science.  Research shows, proper implementation of data science workflow can achieve the same results in much less time. This article will break down this workflow into a set of steps.

What is Data Science workflow?

A workflow is a set of defined steps to achieve a goal. Most real-world problems are super complex and very difficult to track. The workflow helps to break the problem into smaller chunks and measure the progress of the overall project. 


There are no standard workflow designs and different industry has their customised way of following a data sc. process. However, the CRISP-DM model is quite popular.

(image source: wiki)

There are 6 major steps in this workflow:


Business Understanding

The idea is here to understand the domain or business. Data science is not about just gathering data and publishing the results. 



In a nutshell, know your audience first.



Data Understanding

Once the objective is clear, the next part is to collect data and check for data quality. 


Can you have a single source of data?

How can you improve data quality?


Business and Data Understanding generally happen together as one might need to go back to business to understand data and vice versa.



Data Preparation

After data collection, one needs to combine data from different sources into a single source or file, clean the data and perform the necessary data transformation. Generally, it's the most time-consuming of all the steps.



Good data preparation can enhance the performance of a project significantly. 



Modeling

Modeling is the phase where one generally uses standard algorithms to train on the prepared data. Machine learning algorithms are applied to datasets to learn patterns and generate rules. Models are the output of these algorithms.


If the data is not well prepared, models can be biased and the same will reflect with predictions- so modeling and data preparation are dependent on each other.


Do you observe some algorithms perform better than others?



Evaluation

Modeling without evaluation does not make sense- therefore once you shortlist the better-performing algorithms, you should test the output of these models.


There is no rule of 80-20 split, can be 70-30 or other values.

If yes, can that be measured and improved by visiting the modeling/data preparation phase?


Answers to these will help you understand how good is your evaluation. If you are satisfied, the final stage will be the deployment.



Deployment


Once you deploy your data science model, business stakeholders can test and see the performance.



How much time do I need to spend?

It totally depends on the project. Same time, the proportion also changes based on the exposure and scope of the project. Most real projects would have a lot of emphasis on data collection and preparation, however, learning projects might have already prepared data.


Considering real projects, the below time break-up will be a close approximation:

What next?

You have now a brief idea of the steps you need to perform to complete your data science project. A possible next step will be to select a dataset from your topics of interest and apply what you learned.