Home / Blog / Data Science / Amazon SageMaker - Amazon is the front runner in AutoML

Amazon SageMaker - Amazon is the front runner in AutoML

June 23, 2023
92

Meet the Author : Mr. Bharani Kumar

Bharani Kumar Depuru is a well known IT personality from Hyderabad. He is the Founder and Director of Innodatatics Pvt Ltd and 360DigiTMG. Bharani Kumar is an IIT and ISB alumni with more than 17 years of experience, he held prominent positions in the IT elites like HSBC, ITC Infotech, Infosys, and Deloitte. He is a prevalent IT consultant specializing in Industrial Revolution 4.0 implementation, Data Analytics practice setup, Artificial Intelligence, Big Data Analytics, Industrial IoT, Business Intelligence and Business Management. Bharani Kumar is also the chief trainer at 360DigiTMG with more than Ten years of experience and has been making the IT transition journey easy for his students. 360DigiTMG is at the forefront of delivering quality education, thereby bridging the gap between academia and industry.

The Data Preparation phase of Amazon SageMaker includes:

SageMaker Data Wrangler - This capability entails data exploration and preparation for ML. Data wrangler's process for data preparation comprises importing data from numerous data sources and warehouses, obtaining data from many tables of data sources, assisting in the understanding of data bias vs. target leaks, and investigating various data visualisation charts. One should not worry about using common data transformation methods because there are over 300 data transformations built-in. Additionally, by enabling customised transformations written in Python, SQL, and PySpark, one may enhance the data engineering phases.

SageMaker Clarity - This functionality includes discovering bias and explaining model predictions. Feature importance explanation based on model prediction and understanding training data bias help developers immensely. Understanding data bias which includes imbalance in training data due to sampling error of certain categories & measuring data bias is performed by SageMaker Clarity. Also, this option of SageMaker integrates well with Data Wrangler, Experiments, and Model Monitor options of Amazon SageMaker. This ensures that bias & importance of features is identified both in trained models as well as inference data while the model is in production.

SageMaker Processing - The flexibility to fully manage data processing is a part of this functionality. With the help of this capability, you may use an AWS fully managed compute instance rather than a local computer to run your scripts and containers. Because data processing and model assessment are scalable and developers are free to run any code at any stage of the model building lifecycle, the developer's life is made easier. After being processed by SageMaker Data Wrangler, features are processed and saved.

Learn the core concepts of Data Science Course video on Youtube:

SageMaker Clarity to compute bias metrics & feature importance.

A few of the pre-training metrics are provided here.

Conditional Demographic Disparity in Labels (CDDL) -This parameter has a range of -1 to +1. Are the percentage of rejected results higher than acceptable results for the underprivileged class for a certain class. For instance: In banking settings, the privileged class/group is the male gender, and the disadvantaged class/group is the female gender, if 40% of the clients who applied for the loan were female and 60% were male. If 58% of loan applications are denied and 72% are approved, then this is not a case of demographic discrepancy in the males group (i.e., 60% of all classes) from overall rejected applications. Additionally, this indicates that out of the total number of loan applications that are refused, only 28% are approved for the women group (i.e., 40% of all classes). Demographic imbalance is seen here.

Class Imbalance (CI): This metric ranges from -1 to +1. Is Advantaged group representation more than the disadvantaged group.

Different in Positive Proportions in Labels (DPL): This metric ranges from -1 to +1. Is one class having a high proportion of desirable outcomes in training data.

Jensen-Shannon Divergence (JS): This metric ranges from 0 to infinity. How much do the label distributions of different classes diverge from each other? Applicable if the label is categorical in 2 classes or multi-classes or continuous cases.

Kullback-Leibler Divergence (KL): This metric ranges from 0 to infinity. How much do the label distributions of advantaged class diverge from disadvantaged class?

Kolmogorov-Smirnov Distance (KS): This metric ranges from 0 to 1. This is equal to the maximum divergence in a label across the classes of a dataset. Click here to learn Data Science Course in Bangalore

Lp Norm (LP): This metric ranges from 0 to infinity. In KL & JS, we have probability distribution differences, instead in Lp norm we use norm differences.

Total Variance Distance (TVD): This is half of the L1 distance and is non-negative.

Out of all these CI & DPL detect negative bias & the rest all do not change based on which class is advantaged and which one is disadvantaged.

§ A few of the post-training metrics are provided here.

· Different in positive proportions in predicted labels (DPPL): This is very similar to DPL, except that we measure whether bias in the dataset increases or decreases after the training is completed. This measure is very similar to something called Calders & Verwer (CV score), with the expectation that DPPL does extra normalization. This measure is also called a Group Discrimination Score and a few calls this Statistical Parity.

· Disparate Impact (DI): This is the ratio version of DPPL.

· Difference in Conditional Outcomes (DCO): This metric calculated the difference in proportions of accepted vs rejected classes. This has again two measures:

o Difference in Conditional Acceptance (DCA): Ratio of observed labels to predicted labels

o Difference in Conditional Rejection (DCR): Same as DCA but for rejected classes.

· Recall Difference (RD): Higher recall for an advantaged class means that model is better at segregation true positives from false negatives, in comparison to the disadvantaged class.

Difference in Label Rates (DLR): Here we have again two measures:

o Difference in Acceptance Rates (DAR): This measures whether the loan acceptance rate is the same for both advantaged and disadvantaged classes. This is also the difference in the precision of both classes - advantaged and disadvantaged. This is also sometimes called “conditional use accuracy equality” or “predictive parity”. DAR is related to DCA.

o Difference in Rejection Rates (DRR): This measures whether the loan rejection rate is the same for both advantaged and disadvantaged classes. DRR is related to DCR.

· Accuracy Difference (AD): Measure the accuracy for each of the classes and then assesses which one is more accurate.

· Treatment Equality (TE): It is the ratio of False Negatives to False Positives.

· Conditional Demographic Disparity of Predicted Labels (CDDPL): Same as CDDL with the expectation that this is for predicted labels.

· Counterfactual Difference (CD): It measures whether there is any difference in model predictions if we flip the advantaged and disadvantaged classes. These are also called casual discrimination scores or counterfactual sensitivity.

· FlipTest (FT): This is a black-box technique using the discriminator of GAN (generative adversarial network).

o SageMaker Model Monitor to compute data drift

o SageMaker Processing for data exploration & feature engineering

SageMaker Feature Store - SageMaker Feature Store - A feature repository is included in this functionality for machine learning inference and training purposes. This supports data governance and enables ML developers to store, edit, query, and exchange ML features with other ML developers. It provides offline and online feature storage for use with real-time, low-latency ML inference. Versioning is also included so that earlier versions may be quickly accessed.

Model Building phase of Amazon SageMaker includes:

SageMaker Autopilot - This feature allows for the automated creation of fully transparent machine learning models. This automatically investigates, converts, trains, and tunes based on the data. 'Y' must be chosen when selecting the data that is stored in the S3 bucket. IPython notebooks are also made public so that we may comprehend how Autopilot functions inside. Additionally, one may view every model created together with accuracy evaluations and then put the finest model into use.

SageMaker JumpStart - This functionality includes pre-built Machine Learning solutions and a model zoo. This has a collection of over 150 open source deep learning models which were built to handle computer vision as well as NLP problems. This also has reference architecture for end-to-end systems which can be deployed in AWS. One can deploy the solution with one click and explore how things work on a production system in the cloud.

SageMaker Studio Notebooks - This functionality includes Jupyter notebooks with elastic compute and an option for sharing. SageMaker Studio interface is built based on JupyterLab along with additional features. One gets to choose the notebook kernel as well as compute the instance to run the notebook. A lot of SageMaker instances are supported in Studio Notebook. Click here to learn Data Science Course in Hyderabad

Training Algorithms - This functionality helps build highly scalable algorithms without coding. SageMaker offers 17 scalable built-in algorithms which are optimized for infrastructure. These algorithms can support tabular data as well as handle computer vision and NLP use cases. Just select the algorithm and hyperparameters and SageMaker assigns an instance, takes your data, and performs training on the Docker container. By infrastructure-optimized, we mean that the code is capable of running on a distributed environment.

Model Training & Tuning phase of Amazon SageMaker includes:

Managed Training - This functionality supports distributed infrastructure management. On-demand compute resources can be used from anywhere and one gets a feeling that the training is performed in a serverless environment. For Deep Learning, you can use GPU instances. You submit a training job and it is carried out on a server and the server is shut down after the training. Training metrics and logs can be monitored via Amazon CloudWatch. So one can focus on building models rather than focus on infrastructure. Click Here Data Science Course

Distributed Training Libraries - Large datasets and model training are supported by this functionality. Huge volumes of data are required for deep learning, therefore a single GPU instance may not be enough. This is where approaches like Data Parallelism and Model Parallelism from SageMaker distributed training come into play. Model Parallelism divides models that are too large to fit on a single GPU into pieces over many GPUs, whereas Data Parallelism distributes massive data across numerous instances. All of this is done to serve as training. To get the same result, we only need a few lines of code.

SageMaker Debugger - This functionality allows debugging and profile training runs. What are the problems which training the model and how the compute resources are utilised should be understood and fixed. This ensures that training jobs run without failing so that time is not wasted. Training metrics and compute resource details are captured and an interactive dashboard is created which can be visually seen and actions can be taken. Click Here Data Science Course in Chennai

SageMaker Experiments - This functionality allows to capture, organise and compare every step. One can track the work performed across the ML lifecycle. Once the trials and experiments grow beyond a manageable point then we are notified. SageMaker Studio allows users to create charts to visualise the trials and experiments and compare the model performance.

Model Deployment & MLOps phase of Amazon SageMaker includes:

Managed Deployment - High throughput, extremely low latency, and fully controlled environments are all made possible. We don't need to worry about managing, deploying, or scaling the computer instances in order to do model inference. An inference should be returned by the deployed model quickly. Your model is deployed onto a fully managed ML instance using the SageMaker model hosting capability. It is possible to configure the auto scaling option for endpoints so that SageMaker spins up new instances to manage growing traffic to the models. For instance, by using multi-model endpoints rather than a single instance of each model for each geographic, one might make use of an underutilised endpoint's computational power and cut hosting costs.
SageMaker Model Monitor - This functionality captures and maintains the accuracy of deployed models. This feature captures data to compute the statistics for baseline purposes and thereafter monitors the data drift for live endpoints based on a defined periodicity. These drift details are visualised on a dashboard. Alarms can be set whenever data drift or model drift scenarios arise.
SageMaker Pipelines - This functionality allows for orchestration and automation of pipelines. Data preprocessing, model training, tuning, evaluation, deployment, monitoring, etc., are a series of steps in any data pipeline. Business logic can be incorporated into the pipeline to maintain the model's quality. Audit trails for models are created so that entire pipeline information is maintained. ML pipelines can be run at any time or we can schedule for them to run at defined time intervals or based on response to trigger events.
SageMaker Projects and Model Registry - With the aid of this feature, MLOps for ML workflow may be enabled and managed. By adhering to CI/CD (continuous integration / continuous delivery) best practices, SageMaker projects consolidate all machine learning assets into a single place to guarantee that models in production have reproducibility, auditability, and governance. Templates on MLOps are made accessible so that you may start filling out your ML process after deploying the template.
1. Version control by maintaining code repositories
2. Model training is automated through ML pipelines
3. Process for code building to ensure that code committed to the whole code base works
4. Model deployment quality control gate as well as a model registry for version control
5. Finally an automated model deployment process

With all of these features and many more in research, AutoML will certainly become the gold standard and Amazon is leading the race in the AutoML world.

Click here to learn Data Science Course, Data Science Course in Hyderabad, Data Science Course in Bangalore