|
This innovative feature empowers customers to seamlessly manage HyperPod clusters on EKS, leveraging Amazon’s robust infrastructure and trusted environment optimized for large-scale model training. Amazon SageMaker HyperPod enables efficient scaling across over 1,000 artificial intelligence (AI) accelerators, resulting in up to a 40% reduction in training time.
Amazon SageMaker HyperPod now enables customers to manage their clusters using a Kubernetes-inspired interface. This integration enables effortless transitions between Slurm and Amazon EKS to streamline the management of multiple workloads, inclusive of training, tuning, testing, and prediction processes. The CloudWatch Observability EKS add-on provides comprehensive monitoring capabilities, offering real-time insights into critical metrics such as CPU usage, network traffic, disk performance, and other low-level node metrics, all accessible from a single, intuitive dashboard. This elevated observability encompasses a comprehensive view of resource utilisation across the entire cluster, nodes, pods, and containers, empowering efficient troubleshooting and optimisation through granular insights.
Hugging Face’s Transformers library has grown to become a go-to solution for AI startups and enterprises seeking to efficiently implement and deploy large-scale models. These tools are perfectly suitable, providing and facilitating software program optimizations that significantly reduce training time by up to 20%. SageMaker HyperPod automates the detection and remediation of defects, allowing data scientists to train models without interruption over extended periods. This allows data scientists to focus on modeling and growth, rather than managing underlying infrastructure.
Combining Amazon EKS with Amazon SageMaker HyperPod leverages the advantages of Kubernetes, which has gained popularity in machine learning (ML) workloads due to its scalability and extensive open-source ecosystem. Organisations standardise on Kubernetes to construct functions required for generative AI use cases, as it enables the reuse of capabilities across environments while meeting assembly compliance and governance requirements. The latest announcement enables customers to efficiently scale and optimize resource utilization across more than 1,000 AI accelerators. This flexibility amplifies developers’ skillsets, streamlines containerized application management, and enables scalable resources for complex FM coaching and inference tasks.
Amazon EKS helps strengthen the resilience of Amazon SageMaker HyperPods through rigorous health checks, automated node recovery, and job resume capabilities, ensuring uninterrupted training for large-scale or long-running jobs? Job administration may be streamlined using a non-compulsory, Kubernetes-environment-designed tool, which clients can also utilize via their own command-line interfaces. By integrating with our platform, you gain unparalleled visibility into cluster performance, uncovering valuable insights that highlight efficient resource allocation, optimal system health, and real-time usage trends. While information scientists can leverage tools such as Kubeflow to streamline their machine learning processes, The mixing also features Amazon SageMaker managed MLflow, providing a robust solution for experiment tracking and model management.
To create an Amazon SageMaker HyperPod cluster at an exceptional level of scalability, a cloud administrator uses the intuitive interface and leverages the fully managed service, thereby eliminating the tedious tasks associated with building and optimizing machine learning infrastructure. Amazon Elastic Container Service for Kubernetes (EKS) manages and automates the deployment of HyperPod nodes, providing customers with trusted Kubernetes management capabilities and expertise.
I establish the scenario’s foundation, verifying all prerequisites are met, and create an Amazon EKS cluster with a single control plane node, strictly adhering to best practices as outlined in the official documentation, while configuring the cluster with a secure VPC setup and sufficient storage resources.
You can create and manage Amazon SageMaker HyperPod clusters using either the AWS Management Console or the AWS CLI? Using the AWS CLI, I define my cluster configuration within a JSON file. I selected the previously created Amazon EKS cluster as the orchestrator for my SageMaker HyperPod Cluster. I will improve the text in a different style as a professional editor.
I then created clusters of employee nodes named “worker-group-1”, with a non-public IP address. Subnet,
NodeRecovery
set to Computerized
to enable seamless computerized node restoration and further OnStartDeepHealthChecks
I add InstanceStress
and InstanceConnectivity
To enable comprehensive wellness assessments.
cat > eli-cluster-config.json << EOL
{
"ClusterName": "example-hp-cluster",
"Orchestrator": {
"Eks": {
"ClusterArn": "${EKS_CLUSTER_ARN}"
}
},
"InstanceGroups": [
{
"InstanceGroupName": "worker-group-1",
"InstanceType": "ml.p5.48xlarge",
"InstanceCount": 32,
"LifeCycleConfig": {
"SourceS3Uri": "s3://${BUCKET_NAME}",
"OnCreate": "on_create.sh"
},
"ExecutionRole": "${EXECUTION_ROLE}",
"ThreadsPerCore": 1,
"OnStartDeepHealthChecks": [
"InstanceStress",
"InstanceConnectivity"
],
},
....
],
"VpcConfig": {
"SecurityGroupIds": [
"$SECURITY_GROUP"
],
"Subnets": [
"$SUBNET_ID"
]
},
"ResilienceConfig": {
"NodeRecovery": "Computerized"
}
}
EOL
You’ll be able to add storage to provision and seamlessly mount an additional drive on HyperPod nodes.
To create a cluster using the Karpenter API, I run the following AWS CLI command:
AWS SageMaker creates a cluster with the specified configuration. To initiate this process, I'll provide the command in a revised style:
aws sagemaker create-cluster --cli-input-file "file:///path/to/eli-cluster-config.json"
The AWS command returns the Amazon Resource Name (ARN) of the newly created HyperPod cluster.
{
"ClusterArn": "arn:aws:sagemaker:us-east-2:ACCOUNT-ID:cluster/wccy5z4n4m49"
}
I verify that the HyperPod cluster stands at the desired configuration, waiting for the standing adjustments to take effect. InService
.
I can monitor cluster efficiency and wellness metrics using advanced analytics tools.
Identifying key considerations for leveraging Amazon EKS support in Amazon SageMaker HyperPod requires a deeper dive into the benefits and best practices.
This integration provides a more robust coaching environment featuring comprehensive health checks, automatic node recovery, and seamless job resume functionality. SageMaker HyperPod automatically detects, diagnoses, and recovers from faults, enabling you to repeatedly train machine learning models for weeks or months without interruption. The innovative technology has the potential to significantly reduce coaching hours by up to 40 percent.
Provides granular insights into performance and activity within containerized applications and microservices through comprehensive metrics and logging. This feature enables comprehensive monitoring of cluster performance and overall health.
This launch introduces a tailored HyperPod CLI for streamlined job management, integrates Kubeflow’s Coaching Operators for scalable training, leverages Kueue for efficient scheduling, and enhances collaboration through seamless integration with SageMaker Managed MLflow for comprehensive experiment tracking. The platform also seamlessly integrates with SageMaker’s distributed training libraries, leveraging Model Parallel and Data Parallel optimization techniques to significantly accelerate the training process. These libraries, combined with auto-resumption of jobs, enable environmentally friendly and uninterrupted training of massive models.
This integration amplifies developers’ proficiency and flexibility in handling FM workloads, thereby empowering them to tackle complex projects with confidence. Information scientists can seamlessly share computing capabilities across both training and inference tasks. You must leverage your existing Amazon EKS clusters or establish and integrate new ones with HyperPod compute, utilize your custom tools for job submission, queueing, and monitoring.
Before getting started with Amazon SageMaker HyperPod on Amazon EKS, you’ll find valuable resources available, including the documentation, tutorials, and the GitHub repository. The launch of this service is primarily available within AWS regions where Amazon SageMaker HyperPod is available, excluding Europe (London). Visit our website for detailed pricing information.
The weblog post was a joint endeavour. We would like to extend our gratitude to Manoj Ravi, Adhesh Garg, Tomonori Shimomura, Alex Iankoulski, Anoop Saha, and the entire staff for their invaluable efforts in collecting and fine-tuning the information presented here. The cumulative expertise of these individuals proved pivotal in crafting a comprehensive piece.
–