A Step-by-Step Information to Finetuning GPT-OSS-20B

Ever for the reason that launch of the unique GPT-2 fashions, OpenAI has not too long ago shared 2 open-source fashions with the open group. Extra not too long ago, the discharge of gpt-oss-20B and gpt-oss-120B marked an thrilling step ahead, receiving robust suggestions from the open-source ecosystem. These fashions launched a number of architectural improvements that make them significantly environment friendly and succesful.

On the identical time, giant reasoning fashions like OpenAI’s o3 have proven how producing structured chains of thought can enhance response accuracy and high quality. Nonetheless, regardless of their strengths, present OSS fashions nonetheless face a notable disadvantage: they wrestle with deep reasoning, multi-step logic, and superior math.

Which is why fine-tuning and alignment are important. By leveraging reinforcement studying and curated datasets, we will form OSS fashions to purpose extra reliably and keep away from biased or unsafe outputs. On this weblog, we’ll discover easy methods to fine-tune gpt-oss-20B utilizing a multilingual considering dataset from Hugging Face, enabling the mannequin to ship extra correct, logical, and reliable outcomes throughout various contexts.

Finetuning LLMs

High quality-tuning a giant language mannequin (LLM) might sound complicated, however the course of could be damaged down into 5 clear steps. On this weblog, we’ll stroll via these levels utilizing gpt-oss-20B and a multilingual considering dataset from Hugging Face to enhance reasoning efficiency and alignment.

1. Setup

Step one is to put in all the required libraries and dependencies. This contains frameworks like Hugging Face Transformers, Speed up, PEFT (Parameter-Environment friendly High quality-Tuning), TRL, and different utilities that may assist us run coaching effectively on GPUs.

2. Put together the Dataset

Subsequent, we’ll obtain and preprocess the multilingual considering dataset from Hugging Face. The info must be formatted into an instruction-response model, guaranteeing the mannequin learns step-by-step reasoning throughout completely different languages.

3. Put together the Mannequin

We’ll then load the bottom gpt-oss-20B mannequin and configure it for fine-tuning. As an alternative of updating all parameters (which might be extraordinarily resource-intensive), we’ll use LoRA (Low-Rank Adaptation). This memory-efficient method updates solely small adapter weights whereas preserving the principle mannequin frozen.

4. High quality-Tuning

With the whole lot in place, we prepare the mannequin on our reasoning dataset. Throughout this section, reinforcement studying strategies may also be utilized to align the mannequin’s conduct, scale back biases, and encourage protected, logical outputs.

5. Inference

Lastly, we take a look at the fine-tuned mannequin by producing reasoning responses in a number of languages. This permits us to guage how nicely the mannequin handles complicated, multi-step logic throughout various linguistic contexts.

The {Hardware}

The parameters (the weights the mannequin has learnt) are the first memory-consuming issue when fine-tuning a mannequin. Since every parameter is 4 bytes, a 20-billion-parameter mannequin already requires round 80 GB of reminiscence simply to retailer the weights in commonplace 32-bit precision. This quantity decreases to about 40 GB and even 10 GB only for the weights if we use smaller codecs, akin to 16-bit or 4-bit. Nonetheless, coaching additionally requires optimiser states, gradients, and short-term buffers, which add much more on prime of weights alone.

Rookies discover that LoRA or QLoRA fine-tuning is a helpful answer. You prepare solely small adapter layers and freeze the unique weights reasonably than updating all 20B parameters. The adapters add little or no, and the frozen base mannequin might solely require about 10 GB with quantisation (4-bit). Working on a single high-end GPU (akin to 48GB or 80GB playing cards) is possible as a result of this technique sometimes matches inside 30 to 50 GB of GPU reminiscence. LoRA/QLoRA is rather more efficient and practical than full fine-tuning, which is why most individuals working with 20B+ fashions use it.

With a view to perceive the complete strategy of fine-tuning the GPT OSS 20 B mannequin, we’ll fastidiously go over every of those steps.

So, Runpod, a GPU provider, will probably be used for the {hardware} configuration. You possibly can entry RunPod from the next hyperlink: https://console.runpod.io/

We’ll be utilizing the H100 SXM GPU mannequin, which has 80 GB of VRAM.

To be protected for our pod setting, we may also enhance the scale of the Container Disc and Quantity Disc to 200 GB within the template.

We will now deploy our pod after overriding these settings, which may take as much as two minutes to arrange. Subsequent, we will choose the Jupyter Pocket book choice, which is able to take us to a brand new tab with a Jupyter pocket book setting that’s similar to those we work on regionally.

Arrange for High quality-Tuning

The Jupyter Pocket book setting could be very straightforward to make use of, the place we will open ipynb, py, and different kinds of recordsdata together with the terminal.

Step 1: Setup

Earlier than diving into fine-tuning, let’s arrange a clear setting to keep away from dependency points. We’ll use uv, a contemporary Python bundle supervisor that makes it straightforward to deal with digital environments and installations.

Create a Digital Atmosphere

Open your terminal and run the next instructions:

# Set up uv if not already put in pip set up uv   # Create a digital setting uv venv   # Activate the setting supply .venv/bin/activate

Moreover, if wanted, you may run these instructions too

apt-get replace && apt-get improve -y

Set up the Dependencies

With the digital setting activated, the following step is to put in all of the required libraries. These will give us the instruments to load the mannequin, apply parameter-efficient fine-tuning, and prepare successfully on GPUs.

Run the next instructions inside your terminal:

# Set up PyTorch with CUDA 12.8 assist %pip set up torch --index-url https://obtain.pytorch.org/whl/cu128   # Set up Hugging Face libraries and different instruments %pip set up "trl>=0.20.0" "peft>=0.17.0" "transformers>=4.55.0" trackio

Right here’s what every of those does:

torch -> The deep studying framework powering our coaching.
trl -> Hugging Face’s library for coaching with reinforcement studying (nice for alignment duties).
peft -> Parameter-Environment friendly High quality-Tuning, enabling strategies like LoRA.
transformers -> Core library for working with LLMs.
trackio -> Light-weight experiment monitoring to observe coaching progress.

As soon as these are put in, we’re prepared to maneuver on to dataset preparation.

We may also be logging into our Hugging Face utilizing the entry tokens, which we will get from our Profile Settings.

If you happen to get any points with respect to git, then you may run this command

!git config –world credential.helper retailer

Step 2: Put together the Dataset

Earlier, we mentioned how the gpt-oss fashions, regardless of their effectivity, should not significantly robust in deep reasoning and multi-step logic. To deal with this, we’ll fine-tune gpt-oss-20B on a specialised dataset that strengthens its reasoning means.

For this, we’ll use the Multilingual Considering Dataset out there on Hugging Face. This dataset is designed to check and prepare reasoning expertise throughout a number of languages, making it a great selection for bettering each logical accuracy and cross-lingual generalization.

Downloading the Dataset

We will probably be utilizing Multilingual-Considering, which is a reasoning dataset the place the chain-of-thought has been translated into a number of languages, akin to French, Spanish, and German. By fine-tuning openai/gpt-oss-20b on this dataset, it would study to generate reasoning steps in these languages, and thus its reasoning course of could be interpreted by customers who communicate these languages.

We will probably be utilizing solely the messages column of this dataset for our coaching.

Be aware:- You can also make a dataset by yourself, which ought to be much like this format and enclosed inside a dictionary that has key-value pairs, which might permit the mannequin to know what the query is and the suitable reply to this query.

We will fetch the dataset straight utilizing the Hugging Face datasets library:

https://huggingface.co/datasets/HuggingFaceH4/Multilingual-Considering

from datasets import load_dataset dataset = load_dataset("HuggingFaceH4/Multilingual-Considering", cut up="prepare") dataset

This can be a small dataset of 1,000 examples, however that is normally greater than adequate for fashions like openai/gpt-oss-20b, which have undergone in depth post-training.

Step 3: Tokenize and Format the Dataset

Earlier than coaching, we have to course of the dataset right into a format the mannequin understands. Since gpt-oss-20B is a chat-style mannequin, we’ll use its chat template to transform the dataset into conversational textual content that may be tokenized.

Load the Tokenizer

We begin by loading the tokenizer of the gpt-oss-20B mannequin:

from transformers import AutoTokenizer # Load the tokenizer for GPT-OSS-20B tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b")

Apply Chat Template

Every instance within the multilingual reasoning dataset accommodates structured messages. We will use the tokenizer’s chat template to transform these right into a plain-text dialog that the mannequin can prepare on:

# Take the primary instance from the dataset messages = dataset[0]["messages"] # Convert structured messages into plain-text dialog dialog = tokenizer.apply_chat_template(messages, tokenize=False) print(dialog)

The gpt-oss fashions had been educated on the Concord response format for outlining dialog constructions, producing reasoning output, and structuring operate calls.

Step 4: Put together the Mannequin

Now that our dataset is prepared, let’s put together the gpt-oss-20B mannequin for fine-tuning. Since coaching a 20B parameter mannequin straight could be very resource-intensive, we’ll make use of two key strategies:

Quantization – reduces reminiscence utilization and quickens inference by storing weights in decrease precision (right here we use MXFP4 quantization, which is utilized loads for OpenAI fashions).
LoRA (Low-Rank Adaptation) – allows parameter-efficient fine-tuning by coaching solely small adapter layers whereas preserving many of the mannequin frozen.

Learn extra: Finetuning LLMs with Llora

Load the Base Mannequin with Quantization

What’s MXFP4?

MXFP4 (Combined 4-bit Floating Level) is a quantization format developed to cut back reminiscence utilization and enhance inference velocity in large-scale autoregressive fashions like gpt-oss-20B.
Not like easy integer quantization (like INT8/INT4), MXFP4 makes use of a realized combined floating-point illustration, which preserves extra of the unique mannequin’s numerical precision.

Why GPT fashions particularly?

GPT-style fashions (decoder-only transformers) are extraordinarily weight-heavy, particularly in consideration and feed-forward layers.
MXFP4 is optimized for these architectures by specializing in linear layers + consideration projections, which dominate the reminiscence footprint.

Benefits

Reminiscence Environment friendly: Reduces VRAM necessities massively (20B parameter fashions match on fewer GPUs).
Velocity: Permits sooner inference by decreasing precision with out shedding a lot high quality.
Accuracy Retention: Performs higher than naïve INT4 quantization, particularly on long-context reasoning duties, the place precision issues extra.

import torch from transformers import AutoModelForCausalLM, Mxfp4Config # Configure MXFP4 quantization quantization_config = Mxfp4Config(dequantize=True) # Mannequin kwargs for environment friendly coaching model_kwargs = dict(     attn_implementation="keen",     torch_dtype=torch.bfloat16,     quantization_config=quantization_config,     use_cache=False,     device_map="auto", ) # Load GPT-OSS-20B with quantization mannequin = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b", **model_kwargs)

Fast Check: Producing a Response

Earlier than fine-tuning, let’s be certain that the mannequin is working:

messages = [     {"role": "user", "content": "¿Cuál es la capital de Australia?"}, ] input_ids = tokenizer.apply_chat_template(     messages,     add_generation_prompt=True,     return_tensors="pt", ).to(mannequin.machine) output_ids = mannequin.generate(input_ids, max_new_tokens=512) response = tokenizer.batch_decode(output_ids)[0] print(response)

At this stage, the mannequin ought to return a response in Spanish (ans: Canberra).

Add LoRA Adapters

Now we combine LoRA adapters to allow parameter-efficient fine-tuning.

from peft import LoraConfig, get_peft_model # LoRA configuration peft_config = LoraConfig(     r=8,  # rank     lora_alpha=16,     target_modules="all-linear",     target_parameters=[               # Specific layers we want to adapt - you can edit with any other layers too         "7.mlp.experts.gate_up_proj",         "7.mlp.experts.down_proj",         "15.mlp.experts.gate_up_proj",         "15.mlp.experts.down_proj",         "23.mlp.experts.gate_up_proj",         "23.mlp.experts.down_proj",     ], ) # Apply LoRA to the mannequin peft_model = get_peft_model(mannequin, peft_config) # Print trainable parameters peft_model.print_trainable_parameters()

We will choose different mannequin layers additionally. You possibly can take a look at the mannequin parameters data through this hyperlink – https://huggingface.co/openai/gpt-oss-20b/blob/major/mannequin.safetensors.index.json

This setup ensures that solely a small fraction of the mannequin’s parameters will probably be up to date throughout coaching, preserving GPU reminiscence necessities manageable whereas nonetheless permitting the mannequin to study reasoning expertise.

Step 5: High quality-Tuning

With the mannequin ready and LoRA adapters utilized, we’re able to fine-tune gpt-oss-20B on the Multilingual Reasoning Dataset. For this, we’ll use Hugging Face’s TRL (Transformers Reinforcement Studying) library, which gives a easy SFTTrainer class for supervised fine-tuning.

Outline Coaching Configuration

We’ll configure the coaching with a studying price, batch measurement, logging frequency, and scheduler.

from trl import SFTConfig # Coaching arguments training_args = SFTConfig(     learning_rate=2e-4,     gradient_checkpointing=True,     num_train_epochs=1,     logging_steps=1,     per_device_train_batch_size=4,     gradient_accumulation_steps=4,     max_length=2048,                          # if you'd like make it extra gentle weight you may scale back this quantity     warmup_ratio=0.03,     lr_scheduler_type="cosine_with_min_lr",     lr_scheduler_kwargs={"min_lr_rate": 0.1},     output_dir="gpt-oss-20b-multilingual-reasoner",     report_to="trackio",   # Logs coaching metrics     push_to_hub=True,      # Push outcomes to Hugging Face Hub )

If you want to trace the logs of the mannequin coaching, you may also go for WandB.

Initialize Coach

from trl import SFTTrainer # Create coach coach = SFTTrainer(     mannequin=peft_model,     args=training_args,     train_dataset=dataset,     processing_class=tokenizer, ) # Begin fine-tuning coach.prepare()

Monitor Logs

You possibly can monitor your coaching progress with Trackio:

!trackio present --project "gpt-oss-20b-multilingual-reasoner"

or in Python:

import trackio trackio.present(undertaking="gpt-oss-20b-multilingual-reasoner")

Coaching Time

On a single H100 GPU, coaching takes about 17 to 18 minutes. On much less highly effective {hardware}, the time could also be longer relying on GPU reminiscence and compute velocity.

Save and Push Mannequin

As soon as coaching completes, save the fine-tuned mannequin regionally and push it to the Hugging Face Hub:

# Save mannequin regionally coach.save_model(training_args.output_dir) # Push to Hugging Face Hub coach.push_to_hub(dataset_name="skhamzah123/GPT-OSS-20B_FT")

Now your mannequin is stay and shareable, prepared for reasoning duties in a number of languages.

Step 6: Inference

After fine-tuning, we will now generate reasoning responses in a number of languages utilizing our gpt-oss-20B multilingual mannequin. We first load the bottom gpt-oss-20B mannequin, merge it with our fine-tuned LoRA adapters, after which generate responses utilizing a structured chat template.

I’d recommend you restart your kernel earlier than operating these cells, since an excessive amount of reminiscence is already saved in your RAM, and it would trigger your kernel to crash.

from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained("openai/gpt-oss-20b") # Load the bottom mannequin model_kwargs = dict(attn_implementation="keen", torch_dtype="auto", use_cache=True, device_map="auto") base_model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b", **model_kwargs).cuda() # Merge fine-tuned weights with base mannequin peft_model_id = "skhamzah123/gpt-oss-20b-multilingual-reasoner" mannequin = PeftModel.from_pretrained(base_model, peft_model_id) mannequin = mannequin.merge_and_unload() # Outline language and immediate REASONING_LANGUAGE = "Hindi" # edit this to another language SYSTEM_PROMPT = f"reasoning language: {REASONING_LANGUAGE}" USER_PROMPT = "¿Cuál es el capital de Australia?" messages = [     {"role": "system", "content": SYSTEM_PROMPT},     {"role": "user", "content": USER_PROMPT}, ] input_ids = tokenizer.apply_chat_template(     messages,     add_generation_prompt=True,     return_tensors="pt", ).to(mannequin.machine) gen_kwargs = {"max_new_tokens": 512, "do_sample": True, "temperature": 0.6} output_ids = mannequin.generate(input_ids, **gen_kwargs) response = tokenizer.batch_decode(output_ids)[0] print(response)

We use Keen Consideration/Flash Consideration to make inference light-weight and quick, decreasing reminiscence utilization whereas nonetheless dealing with lengthy sequences effectively. The merge_and_unload() step combines the LoRA adapters with the bottom mannequin in order that inference runs with out further adapter overhead. By specifying the reasoning language within the system immediate, the mannequin can generate step-by-step reasoning in a number of languages, demonstrating the effectiveness of multilingual fine-tuning.

Conclusion

High quality-tuning gpt-oss-20B demonstrates how open-source giant language fashions could be tailored to carry out complicated reasoning throughout a number of languages whereas remaining reminiscence and compute environment friendly. By leveraging strategies like LoRA for parameter-efficient fine-tuning and MXFP4 quantization, we had been in a position to improve reasoning capabilities with out requiring huge GPU sources.

Utilizing the Multilingual Considering Dataset allowed the mannequin to study step-by-step logic in several languages, making it extra versatile and aligned for real-world functions. With cautious dataset preparation, mannequin configuration, and inference optimization (Keen/Flash Consideration), OSS fashions could be protected, correct, and performant, bridging the hole between open-source flexibility and sensible utility.

This workflow not solely highlights the energy of open-source LLMs but in addition gives a sensible blueprint for anybody seeking to fine-tune giant fashions for reasoning, alignment, and multilingual capabilities.

Plenty of the sources on this article have been taken from the OpenAI cookbook. Please seek advice from it for extra particulars.

Incessantly Requested Questions

Q1. How a lot GPU reminiscence do I must fine-tune gpt-oss-20B?

A. With LoRA or QLoRA, you may fine-tune it on a single 80GB GPU (like an H100). Full fine-tuning, nonetheless, requires 300GB+ of GPU reminiscence and multi-GPU setups.

Q2. Why use MXFP4 quantization as an alternative of INT4?

A. MXFP4 preserves higher numerical precision, particularly for long-context reasoning, whereas nonetheless decreasing reminiscence and rushing up inference in comparison with commonplace INT4 quantization.

Q3. Can I fine-tune gpt-oss-20B by myself dataset?

A. Sure. Simply format your dataset in an instruction-response model (as dictionaries with “query” and “reply” pairs) so the mannequin can study structured reasoning from it.

Information Scientist @ Analytics Vidhya | CSE AI and ML @ VIT Chennai
Obsessed with AI and machine studying, I am wanting to dive into roles as an AI/ML Engineer or Information Scientist the place I could make an actual impression. With a knack for fast studying and a love for teamwork, I am excited to deliver revolutionary options and cutting-edge developments to the desk. My curiosity drives me to discover AI throughout numerous fields and take the initiative to delve into knowledge engineering, guaranteeing I keep forward and ship impactful initiatives.