While all the steps are equally important, let us discuss each step, in further detail.
Get started with the first stage of CRISP–ML(Q):
This stage is pivotal for the success of the project because Garbage-In & Garbage-Out!
If the business problem is incorrectly understood, then the organizations digress from the problem at hand, eventually leading them into catastrophic situations. Appropriate definition of business problem will lead to assessing as-is processes from right perspective and thereby the Data Mining Problem will get clearly defined. To make projects a success, Project Charter, which is the first document created on any project, is prepared & signed-off by the sponsor. This serves as a mandatory input in crafting a detailed project management plan.
KPMG survey says that most of the organizations (~70%) have experienced project failure (at least one) in the last 1 year.
IBM CTO says that most of the projects (~87%) in the space of Data Science sees the light into production.
The Four Key Steps of Business Understanding Phase of CRISP-ML(Q)
For data-driven decision making, one should understand Data Collection and the various sources which generate the data.
There are wide sources from which data is generated and collected including Primary Data Sources (Surveys, Design of Experiments, Simulations), Secondary Data Sources (RDBMS, Industrial IoT sensors, etc). Once done, team should describe the data to document data dictionary, so that every team member is well informed on the variables and data being used for further analysis.
IDC forecasts connected IoT devices to generate ~79.4 Zeta Bytes of data in 2025.
By 2025, the explosion in the data is expected to reach 175 zettabytes. Another research says that the amount of data will get doubled every year from 2022.
The Two Key Steps of Data Understanding Phase of CRISP-ML(Q)
Majority of the effort is spent on this step. More the effort invested in this stage, easier will be the model building stage.
One of the phases in Data Preparation has been given a lot of names such as Data Munging, Data Wrangling, Data Cleansing, etc. Looking at the wide variety of the data sources and formats and ensuring that we bring all of these to a common format is going to ensure that the next steps are a smooth sail. Exploring the data and performing Descriptive Statistics will help strike gold and these insights will help business address low hanging problems for lightning fast results. This will also help list down the quick wins, short-term wins and long-term wins. Another aim of this step is to ensure that we have data in a format needed for Building Models. Also, we shortlist the critical few features from trivial many using various feature selection techniques alongside carefully curated features as a result of feature engineering.
60% to 80% of the data scientists’ time is spent on data cleansing.
15% to 25% of the data scientists’ time is spent on modelling.
5% to 15% of the time is spent in deploying model into production.
The Five Key Steps of Data Preparation Phase of CRISP-ML(Q)
Proof of the cake is in the eating. In this stage we identify which of the various machine learning models will be applicable to the address the Business problem at hand. That will in turn help organizations take proactive as well as strategic decisions.
Given the business problems and data availability, we must decide on what Data Mining Supervised or Unsupervised Learning techniques will be apt. Deciding the evaluation techniques given the objective is going to help business lay trust in production usage of the solutions. Resorting to various regularization techniques in machine learning and hyper parameter tuning in modelling will be imperative for the success. Finally meeting the success criteria will define the success of the project.
Unsupervised Learning
Supervised Learning
Semi-Supervised Learning
Forecasting and Time Series
Self-Supervised Learning
Reinforcement Learning
38.76% is the growth rate of global machine learning market between 2020 and 2030. – Market Research Future
38% is the number of US jobs that will be automated by 2030 - PWC
The Six Key Steps of Machine Learning Phase of CRISP-ML(Q)
Identifying the metric for measuring the models efficacy & performance is a key factor to ensuring its usability in production environment.
Experimentation is the key to identifying the right business fit model with right parameters. Models with least errors need not always be right for the business problem being solved. Selecting the model which is both accurate as well as something which is aligned with business objectives & aligning with business constraints is critical.
60% to 80% accuracy is good enough to get started on social sciences projects.
10% improvement on accuracy in the initial year of model upgradation.
The Four Steps of Evaluation Phase of CRISP-ML(Q)
Deployment of solution in the most cost effective & performance efficient manner is yet another key factor.
Identifying the “Resource Requirement” for deployment in line with customer constraints is imperative for the smooth operations of the deployed solution.
Transitioning from Development to Test environment and eventually into Production should be performed in a seamless manner with appropriate testing. Understand the infrastructure requirements including the servers, business continuity planning, disaster recovery planning, etc., is key to handle any unforeseen situations. Risk management, which is performed throughout the project has to be effectively implemented. If any risk realizes then triggering the risk response plan will tease away customer dissatisfaction.
Deciding on cloud vs on-premise and suggesting a deployment strategy so that models are scalable, reliable, secure and maintainable are critical factors for the success.
94% of enterprises use cloud services
67% of enterprise infrastructure is now cloud-based
The Five Key Steps of Model Deployment Phase of CRISP-ML(Q)
Estimated Number of Projects Deployed (Mean)
It is a journey not a destination. Continual course correction of the deployed models and sometimes retraining it goes a long way in achieving customer delight.
Defining the maintenance strategy is going to define the closure of the project yet start of a new journey. This is a cyclical process. Various industry standards are in place with respect to when a model needs to be retrained. One must also account for factors leading to Model Drift and decay.
When to retrain the model?
The data new data is ~ 20% of the Training data
The accuracy in production changes by more than 5%
When there are substantial policy changes in inter and intra organization that are bound to affect the model assumptions
The Five Key Aspects of Monitoring & Maintenance Phase of CRISP-ML(Q)