360DigiTMG

CRoss Industry Standard Process for Machine Learning with Quality Assurance – CRISP–ML(Q)
Process Model which is just apt for Machine Learning and Analytics Projects
Six Step Process Model, which is a structured approach in handling Data Science as well as Artificial Intelligence Projects

While all the steps are equally important, let us discuss each step, in further detail.

Get started with the first stage of CRISP–ML(Q):

If the business problem is incorrectly understood, then the organizations digress from the problem at hand, eventually leading them into catastrophic situations. Appropriate definition of business problem will lead to assessing as-is processes from right perspective and thereby the Data Mining Problem will get clearly defined. To make projects a success, Project Charter, which is the first document created on any project, is prepared & signed-off by the sponsor. This serves as a mandatory input in crafting a detailed project management plan.

KPMG survey says that most of the organizations (~70%) have experienced project failure (at least one) in the last 1 year.

IBM CTO says that most of the projects (~87%) in the space of Data Science sees the light into production.

The Four Key Steps of Business Understanding Phase of CRISP-ML(Q)

Define the Scope of the ML Application
- Business Problem
- Business Objectives
- Business Constraints
Define Success Criteria
- Business Success Criteria
- ML Success Criteria
- Economic Success Criteria
Feasibility
- ML Technology Applicability
- Legal Constraints
- Requirements on the Application
Project Charter is the key document

There are wide sources from which data is generated and collected including Primary Data Sources (Surveys, Design of Experiments, Simulations), Secondary Data Sources (RDBMS, Industrial IoT sensors, etc). Once done, team should describe the data to document data dictionary, so that every team member is well informed on the variables and data being used for further analysis.

IDC forecasts connected IoT devices to generate ~79.4 Zeta Bytes of data in 2025.

By 2025, the explosion in the data is expected to reach 175 zettabytes. Another research says that the amount of data will get doubled every year from 2022.

The Two Key Steps of Data Understanding Phase of CRISP-ML(Q)

Data Collection
- Data Version Control
Data Quality Verification
- Data Description
- Data Requirements
- Data Verification

One of the phases in Data Preparation has been given a lot of names such as Data Munging, Data Wrangling, Data Cleansing, etc. Looking at the wide variety of the data sources and formats and ensuring that we bring all of these to a common format is going to ensure that the next steps are a smooth sail. Exploring the data and performing Descriptive Statistics will help strike gold and these insights will help business address low hanging problems for lightning fast results. This will also help list down the quick wins, short-term wins and long-term wins. Another aim of this step is to ensure that we have data in a format needed for Building Models. Also, we shortlist the critical few features from trivial many using various feature selection techniques alongside carefully curated features as a result of feature engineering.

60% to 80% of the data scientists’ time is spent on data cleansing.

15% to 25% of the data scientists’ time is spent on modelling.

5% to 15% of the time is spent in deploying model into production.

The Five Key Steps of Data Preparation Phase of CRISP-ML(Q)

Selecting Data
- Feature Selection
- Data Selection
- Imbalanced Classes Verification
Cleaning Data
- Noise Reduction
- Data Imputation
Construct Data
- Feature Engineering
- Data Augmentation
Standardize Data
- File Format
- Feature Scaling
Exploratory Data Analysis / Descriptive Analytics

Given the business problems and data availability, we must decide on what Data Mining Supervised or Unsupervised Learning techniques will be apt. Deciding the evaluation techniques given the objective is going to help business lay trust in production usage of the solutions. Resorting to various regularization techniques in machine learning and hyper parameter tuning in modelling will be imperative for the success. Finally meeting the success criteria will define the success of the project.

Unsupervised Learning

Supervised Learning

Semi-Supervised Learning

Forecasting and Time Series

Self-Supervised Learning

Reinforcement Learning

38.76% is the growth rate of global machine learning market between 2020 and 2030. – Market Research Future

38% is the number of US jobs that will be automated by 2030 - PWC

The Six Key Steps of Machine Learning Phase of CRISP-ML(Q)

Research for Similar Problems
Define Quality Measures of the Model
- Performance
- Robustness
- Scalability
- Explain ability
- Model Complexity
- Resource Demand
Model Selection
- Using Unlabelled Data and Pre-trained Models
- Ensemble Methods
Incorporate Domain Knowledge
Model Compression
Assure Reproducibility
- Method Reproducibility
- Result Reproducibility
- Experiments Documentation

Experimentation is the key to identifying the right business fit model with right parameters. Models with least errors need not always be right for the business problem being solved. Selecting the model which is both accurate as well as something which is aligned with business objectives & aligning with business constraints is critical.

Industry Standards for Continual Improvement:

1. First time model deployment

60% to 80% accuracy is good enough to get started on social sciences projects.

2. One year from time of model deployment

10% improvement on accuracy in the initial year of model upgradation.

3. Every year after one year

The Four Steps of Evaluation Phase of CRISP-ML(Q)

Validate Performance
Determine Robustness
Increase Explain ability for ML Practitioner & End User
Compare Results with Defined Success Criteria

Identifying the “Resource Requirement” for deployment in line with customer constraints is imperative for the smooth operations of the deployed solution.

Transitioning from Development to Test environment and eventually into Production should be performed in a seamless manner with appropriate testing. Understand the infrastructure requirements including the servers, business continuity planning, disaster recovery planning, etc., is key to handle any unforeseen situations. Risk management, which is performed throughout the project has to be effectively implemented. If any risk realizes then triggering the risk response plan will tease away customer dissatisfaction.

Deciding on cloud vs on-premise and suggesting a deployment strategy so that models are scalable, reliable, secure and maintainable are critical factors for the success.

94% of enterprises use cloud services

67% of enterprise infrastructure is now cloud-based

The Five Key Steps of Model Deployment Phase of CRISP-ML(Q)

Define Inference Hardware
Model Evaluation Under Production Condition
Assure User Acceptance & Usability
Minimize the Risks of Unforeseen Errors
Deployment Strategy

Average Number of AI or ML Projects Deployed

Estimated Number of Projects Deployed (Mean)

Defining the maintenance strategy is going to define the closure of the project yet start of a new journey. This is a cyclical process. Various industry standards are in place with respect to when a model needs to be retrained. One must also account for factors leading to Model Drift and decay.

When to retrain the model?