Streamline basis mannequin training and refinement via innovative Amazon SageMaker HyperPod recipes.

December 4, 2024

291

Instantly, we’re enabling top-tier data scientists and developers from all skill levels to kick-start training and fine-tuning frameworks in mere minutes with unparalleled speed and efficiency. We will optimize recipes for entry-level use in popular publicly accessible frameworks like Google’s TensorFlow, PyTorch, or Keras.

At AWS re:Invent 2023, our goal is to reduce the time spent coaching FMs by up to 40% and scale across more than 1,000 compute resources in parallel with pre-configured distributed coaching libraries. You’ll uncover accelerated compute resources for training, craft optimal training plans, and execute training workloads across varying capacity blocks based on available compute resources with SageMaker HyperPod.

AWS’s SageMaker HyperPod recipes leverage a tried-and-tested coaching stack, streamlining the process by reducing manual experimentation with various model configurations, thus minimizing weeks of iterative analysis and testing. The recipes streamline numerous critical processes, comparable to preparing training data sets, leveraging distributed training methods, automated checkpointing for faster recovery from failures, and overseeing the entire training pipeline.

By implementing a simple recipe adjustment, you can effortlessly switch between GPU- or Tainium-based configurations to maximize training efficiency while minimizing costs? You can simply run workloads in manufacturing settings on SageMaker HyperPod or execute SageMaker training jobs.

To get started, go to our website to explore and browse a wide range of coaching recipes for popular publicly accessible FM templates.

To optimize performance on a specific occasion or scenario, simply modify the recipe’s parameters to suit your dataset’s unique characteristics within a clustered configuration. Then, execute the recipe using a concise one-line command to achieve state-of-the-art efficiency?

After cloning the repository, it is crucial to modify the recipe config.yaml file to define the mannequin and clustering configuration in a timely manner.

Here is the rewritten text in a different style: Cloning the SageMaker HyperPod Recipes repository: `git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git` Change into the cloned directory: `cd sagemaker-hyperpod-recipes` Install project dependencies using pip3: `$ pip3 setup -r requirements.txt` Navigate to the recipes_collections subdirectory: `cd ./recipes_collections` Edit the config.yaml file in a Vim editor: `vim config.yaml`

The recipes assist in culinary experimentation. Here is the rewritten text:

By arranging a clustering algorithm (Slurm orchestrator), a model identifier (Meta Llama 3.1-405B language model), and an event type (ml.p5.48xlargeWith that data stored, you have a comprehensive repository for tracking your progress, analyzing results, and making informed decisions about future training.

defaults: # Assisted modes: slurm, k8s, sm_jobs # Identify of mannequin to be educated debug: false  # Set to True for debugging the launcher configuration or other supported cluster cases base_results_dir:  # Location(s) to store outcomes, checkpoints, logs, and more

Optional adjustments can be made to model-specific coaching parameters within this YAML file, which details the optimal configuration, including the number of accelerator units, event type, training precision, parallelization, and sharding strategies, as well as the optimizer and logging options for tracking experiments.

run:   identify: llama-405b   results_dir: ${base_results_dir}/${.identify}   time_limit: "6-00:00:00" restore_from_path: null coach:   units: 8   num_nodes: 2   accelerator: gpu   precision: bf16   max_steps: 50   log_every_n_steps: 10   ... exp_manager:   exp_dir: # location for TensorBoard logging   identify: helloworld    create_tensorboard_logger: True   create_checkpoint_callback: True   checkpoint_callback_params:     ...   auto_checkpoint: True # for automated checkpointing use_smp: True  distributed_backend: smddp # optimized collectives # Begin coaching from pretrained mannequin mannequin:   model_type: llama_v3   train_batch_size: 4   tensor_model_parallel_degree: 1   expert_model_parallel_degree: 1   # different model-specific params

To deploy a SageMaker HyperPod-based solution using Slurm, ensure you assemble the SageMaker HyperPod cluster according to the provided documentation.

Connect with the SageMaker HyperPod’s head node, access the Slurm controller, and replicate the refined recipe.

You subsequently run a helper file to generate a Slurm submission script for the job, which you then utilize for a dry run to scrutinize the content before commencing the training job.

python3 primary.py --config-path=recipes_collection --config-name=config

Upon completing the coaching process, the trained model is automatically saved to your designated storage area.

To deploy this SageMaker HyperPod recipe with Amazon EKS, simply clone the repository from GitHub, establish the necessary dependencies, and modify the recipe accordingly.cluster: k8s) in your laptop computer. Connect to the EKS cluster using SSH from your laptop, then navigate to the directory containing the recipe file and run it using: `kubectl apply -f recipe.yaml`.

$ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora  --persistent-volume-claims fsx-claim:information  --override-parameters  '{   "recipes.run.identify": "hf-llama3-405b-seq8k-gpu-qlora",   "recipes.exp_manager.exp_dir": "/information/<your_exp_dir>",   "cluster": "k8s",   "cluster_type": "k8s",   "container": "658645717510.dkr.ecr.<area>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",   "recipes.mannequin.information.train_dir": "<your_train_data_dir>",   "recipes.mannequin.information.val_dir": "<your_val_data_dir>", }'

You can also train recipes on SageMaker coaching jobs using. Working with PyTorch coaching scripts on SageMaker coaching jobs involves overriding coaching recipes.

... recipe_overrides = {     "run": {         "results_dir": "/choose/ml/mannequin",     },     "exp_manager": {         "exp_dir": "",         "explicit_log_dir": "/choose/ml/output/tensorboard",         "checkpoint_dir": "/choose/ml/checkpoints",     },        "mannequin": {         "information": {             "train_dir": "/choose/ml/enter/information/prepare",             "val_dir": "/choose/ml/enter/information/val",         },     }, } pytorch_estimator = PyTorch(            output_path=<output_path>,            base_job_name=f"llama-recipe",            function=<function>,            instance_type="p5.48xlarge",            training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",            recipe_overrides=recipe_overrides,            sagemaker_session=sagemaker_session,            tensorboard_output_config=tensorboard_output_config, ) ...

As training advances, automated checkpointing saves mannequin progress points, facilitating rapid recovery from coaching errors and seamless occasion restarts.

Amazon SageMaker HyperPod recipes are now directly accessible within. To receive additional instruction, visit our website at [insert URL] and attend one of our upcoming seminars.

Ship innovative SageMaker HyperPod recipes to accelerate machine learning workflows or leverage expert AWS support through dedicated help channels.

—

In your Python application, you might need to store sensitive information like API keys, database credentials, or other secrets securely. One way to achieve this is by using environment variables. This tutorial will show you how to create a `.env` file, how to use it with Python, and the benefits of this approach.

Streamline basis mannequin training and refinement via innovative Amazon SageMaker HyperPod recipes.

Related Articles

iRobot Roomba Historical past: How a Focus Group Modified It

Easy methods to Assess and Select the Proper AI-SOC Platform

Construct an Finish-to-Finish AI Net App with Google Genkit

LEAVE A REPLY Cancel reply

Latest Articles

iRobot Roomba Historical past: How a Focus Group Modified It

Easy methods to Assess and Select the Proper AI-SOC Platform

Construct an Finish-to-Finish AI Net App with Google Genkit

Cisco UCS C880A M8 HGX B300 AI Server for AI Workloads

Anthropic releases Claude Haiku 4.5, a value efficient different to Claude Sonnet 4 and 4.5