Thursday, December 5, 2024

Streamline basis mannequin training and refinement via innovative Amazon SageMaker HyperPod recipes.

Instantly, we’re enabling top-tier data scientists and developers from all skill levels to kick-start training and fine-tuning frameworks in mere minutes with unparalleled speed and efficiency. We will optimize recipes for entry-level use in popular publicly accessible frameworks like Google’s TensorFlow, PyTorch, or Keras.

At AWS re:Invent 2023, our goal is to reduce the time spent coaching FMs by up to 40% and scale across more than 1,000 compute resources in parallel with pre-configured distributed coaching libraries. You’ll uncover accelerated compute resources for training, craft optimal training plans, and execute training workloads across varying capacity blocks based on available compute resources with SageMaker HyperPod.

AWS’s SageMaker HyperPod recipes leverage a tried-and-tested coaching stack, streamlining the process by reducing manual experimentation with various model configurations, thus minimizing weeks of iterative analysis and testing. The recipes streamline numerous critical processes, comparable to preparing training data sets, leveraging distributed training methods, automated checkpointing for faster recovery from failures, and overseeing the entire training pipeline.

By implementing a simple recipe adjustment, you can effortlessly switch between GPU- or Tainium-based configurations to maximize training efficiency while minimizing costs? You can simply run workloads in manufacturing settings on SageMaker HyperPod or execute SageMaker training jobs.


To get started, go to our website to explore and browse a wide range of coaching recipes for popular publicly accessible FM templates.

To optimize performance on a specific occasion or scenario, simply modify the recipe’s parameters to suit your dataset’s unique characteristics within a clustered configuration. Then, execute the recipe using a concise one-line command to achieve state-of-the-art efficiency?

After cloning the repository, it is crucial to modify the recipe config.yaml file to define the mannequin and clustering configuration in a timely manner.

Here is the rewritten text in a different style:

Cloning the SageMaker HyperPod Recipes repository: `git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git`
Change into the cloned directory: `cd sagemaker-hyperpod-recipes`
Install project dependencies using pip3: `$ pip3 setup -r requirements.txt`
Navigate to the recipes_collections subdirectory: `cd ./recipes_collections`
Edit the config.yaml file in a Vim editor: `vim config.yaml`

The recipes assist in culinary experimentation. Here is the rewritten text:

By arranging a clustering algorithm (Slurm orchestrator), a model identifier (Meta Llama 3.1-405B language model), and an event type (ml.p5.48xlargeWith that data stored, you have a comprehensive repository for tracking your progress, analyzing results, and making informed decisions about future training.

defaults:
# Assisted modes: slurm, k8s, sm_jobs
# Identify of mannequin to be educated
debug: false  # Set to True for debugging the launcher configuration or other supported cluster cases
base_results_dir:  # Location(s) to store outcomes, checkpoints, logs, and more

Optional adjustments can be made to model-specific coaching parameters within this YAML file, which details the optimal configuration, including the number of accelerator units, event type, training precision, parallelization, and sharding strategies, as well as the optimizer and logging options for tracking experiments.

run:
  identify: llama-405b
  results_dir: ${base_results_dir}/${.identify}
  time_limit: "6-00:00:00"
restore_from_path: null
coach:
  units: 8
  num_nodes: 2
  accelerator: gpu
  precision: bf16
  max_steps: 50
  log_every_n_steps: 10
  ...
exp_manager:
  exp_dir: # location for TensorBoard logging
  identify: helloworld 
  create_tensorboard_logger: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    ...
  auto_checkpoint: True # for automated checkpointing
use_smp: True 
distributed_backend: smddp # optimized collectives
# Begin coaching from pretrained mannequin
mannequin:
  model_type: llama_v3
  train_batch_size: 4
  tensor_model_parallel_degree: 1
  expert_model_parallel_degree: 1
  # different model-specific params

To deploy a SageMaker HyperPod-based solution using Slurm, ensure you assemble the SageMaker HyperPod cluster according to the provided documentation.

Connect with the SageMaker HyperPod’s head node, access the Slurm controller, and replicate the refined recipe.

You subsequently run a helper file to generate a Slurm submission script for the job, which you then utilize for a dry run to scrutinize the content before commencing the training job.

python3 primary.py --config-path=recipes_collection --config-name=config

Upon completing the coaching process, the trained model is automatically saved to your designated storage area.

To deploy this SageMaker HyperPod recipe with Amazon EKS, simply clone the repository from GitHub, establish the necessary dependencies, and modify the recipe accordingly.cluster: k8s) in your laptop computer. Connect to the EKS cluster using SSH from your laptop, then navigate to the directory containing the recipe file and run it using: `kubectl apply -f recipe.yaml`.

$ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora 
--persistent-volume-claims fsx-claim:information 
--override-parameters 
'{
  "recipes.run.identify": "hf-llama3-405b-seq8k-gpu-qlora",
  "recipes.exp_manager.exp_dir": "/information/<your_exp_dir>",
  "cluster": "k8s",
  "cluster_type": "k8s",
  "container": "658645717510.dkr.ecr.<area>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
  "recipes.mannequin.information.train_dir": "<your_train_data_dir>",
  "recipes.mannequin.information.val_dir": "<your_val_data_dir>",
}'

You can also train recipes on SageMaker coaching jobs using. Working with PyTorch coaching scripts on SageMaker coaching jobs involves overriding coaching recipes.

...
recipe_overrides = {
    "run": {
        "results_dir": "/choose/ml/mannequin",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/choose/ml/output/tensorboard",
        "checkpoint_dir": "/choose/ml/checkpoints",
    },   
    "mannequin": {
        "information": {
            "train_dir": "/choose/ml/enter/information/prepare",
            "val_dir": "/choose/ml/enter/information/val",
        },
    },
}
pytorch_estimator = PyTorch(
           output_path=<output_path>,
           base_job_name=f"llama-recipe",
           function=<function>,
           instance_type="p5.48xlarge",
           training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",
           recipe_overrides=recipe_overrides,
           sagemaker_session=sagemaker_session,
           tensorboard_output_config=tensorboard_output_config,
)
...

As training advances, automated checkpointing saves mannequin progress points, facilitating rapid recovery from coaching errors and seamless occasion restarts.

Amazon SageMaker HyperPod recipes are now directly accessible within. To receive additional instruction, visit our website at [insert URL] and attend one of our upcoming seminars.

Ship innovative SageMaker HyperPod recipes to accelerate machine learning workflows or leverage expert AWS support through dedicated help channels.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles