Qwen fashions, developed by Alibaba, have proven sturdy efficiency in each code completion and instruction duties. On this weblog, we’ll present how one can register and deploy Qwen fashions on Databricks utilizing an method just like that for Llama-based architectures. By following these steps, you’ll be able to make the most of Databricks’ basis mannequin (Provisioned Throughput) endpoints, which profit from low latency and excessive throughput.
Desk of Contents
- Motivation: Why Serve Qwen Fashions on Databricks?
- The Core Concept
- Implementation: Annotated Code Walkthrough
- Efficiency and Limitations
- Abstract and Subsequent Steps
Motivation: Why Serve Qwen Fashions on Databricks?
For a lot of enterprise workloads, Databricks is a one-stop platform to coach, register, and serve massive language fashions (LLMs). With Databricks Mosaic AI Mannequin Serving one can simply deploy fine-tuned or base fashions and make the most of them for real-time or batch inference duties.
The lately launched Qwen 2.5 sequence of fashions present sturdy efficiency in code completion and instruction duties. Qwen 2.5 fashions on the time of their launch beat equally sized fashions on customary benchmarks reminiscent of MMLU, ARC-C, MATH, HumanEval, and multilingual benchmarks reminiscent of Multi-Examination and Multi-Understanding. Qwen 2.5 Coder fashions present comparable features on coding benchmarks. This will likely present prospects with sturdy motivation for deploying these fashions in Databricks Mannequin Serving to energy their use circumstances.
Serving a Qwen mannequin on Databricks includes 4 steps:
- Run a pocket book to transform the Qwen mannequin information to be suitable with the Llama structure and Databricks mannequin serving
- Register the Qwen mannequin in Unity Catalog
- Deployed the registered mannequin in Databricks Basis Mannequin Serving
- Conduct high quality testing on the deployment, reminiscent of both guide testing or working customary benchmarks straight in opposition to the endpoint
The Core Concept
Databricks basis mannequin serving supplies optimized efficiency for fashions reminiscent of Meta’s Llama fashions. Clients can deploy these fashions with provisioned throughput and obtain low latency and excessive throughput. Whereas the Qwen fashions’ underlying mannequin construction is similar to the Llama fashions’ construction, sure modifications are required in an effort to make the most of Databricks’ mannequin serving infrastructure. The steps under clarify how prospects could make the mandatory modifications.
Implementation: Annotated Code Walkthrough
Half 1) Rewrite Qwen’s weights and config to be in step with Llama fashions.
The steps in modify_qwen.py take a Qwen2.5 mannequin and rewrite it to be in step with the Llama structure that’s optimized for provisioned throughput on Databricks. Listed below are the important thing steps within the code:
- Load Qwen State Dict: Accumulate
.safetensors
from the unique Qwen listing. - Copy & Alter Weights: Insert zero biases for consideration outputs the place Llama expects them.
- Rewrite the Config: Replace fields like
"architectures"
,"model_type"
to"llama"
, and take away Qwen-specific flags. - Copy Tokenizer Information: Guarantee we carry over
tokenizer.json
,merges.txt
, and so forth. - Create Closing Output Folder: The information within the new listing make it appear to be a typical Llama mannequin.
On the finish of this step, you will have a Llama-compatible Qwen mannequin. You could possibly load the mannequin in vLLM and it ought to deal with it as a Llama mannequin and have the ability to generate code or observe directions, relying on which mannequin you used.
Tip: You need to use huggingface_hub.snapshot_download to fetch the one of many Qwen fashions reminiscent of Qwen/Qwen2.5-Coder-7B-Instruct from Hugging Face to a listing earlier than performing the conversion.
Half 2) Register and Serve Qwen on Databricks
Subsequent we’ll deal with how you can log and serve the “Qwen as Llama” mannequin on Databricks. That is dealt with by register_qwen.py. The steps right here be sure that the mannequin has the configuration that mannequin serving expects for a Llama mannequin. The important thing steps:
- Specifying the trail to the transformed mannequin from earlier.
- Modifying tokenizer configs (particularly eradicating
chat_template
and settingtokenizer_class
). - Adjusting
config.json
to replicate Llama-compatible sequence lengths. - Updating the mannequin with Llama-like metadata earlier than logging.
- Registering the mannequin with MLflow, so it may be served on a GPU endpoint.
As soon as this pocket book is run the mannequin will likely be registered in Unity Catalog, navigate to the mannequin and click on “Serve this mannequin” to arrange the endpoint. You need to see the choice to arrange the endpoint with provisioned enter at completely different tokens/second charges.
Testing the Endpoint
As soon as the endpoint is prepared you’ll be able to conduct some primary assessments to confirm it’s working correctly. Suppose that now we have deployed the Qwen2.5-Coder-7B mannequin after performing the above conversion and registration. This mannequin is able to both finishing a chunk of code or performing fill-in-the-middle. Let’s use it to finish a easy sorting operate. Beneath the “Use” dropdown click on “Question” and enter the next request:
The textual content within the response comprises the remainder of the implementation:
For a extra quantitative method you can generate completions for the HumanEval duties. Then run its analysis to get the cross@1 metric and examine in opposition to the printed outcomes.
Efficiency and Limitations
- Guide Chat Formatting
Since we take away Qwen’s built-in chat template, you could manually format system/person/assistant messages in your shopper code. This ensures the mannequin can nonetheless interpret dialog turns correctly. - Max Place Embeddings
We setmax_position_embeddings
to 16000 tokens to suit inside sure Databricks constraints. If Qwen initially supported extra, you may lose some most context size. Nonetheless, you’ll nonetheless achieve provisioned throughput help.
Abstract and Subsequent Steps
Whereas Databricks doesn’t help Qwen fashions straight on provisioned throughput mannequin serving right now, the above technique permits you to register and serve these fashions efficiently by aligning them to be suitable with the Llama fashions’ structure. This workaround is especially helpful in case your workforce requires Qwen’s capabilities but in addition needs the comfort of Databricks mannequin serving endpoints and provisioned throughput.
Key Takeaway
- The Qwen and Llama fashions share sufficient architectural similarities that, with a couple of minor modifications (specifically, to the tokenizer config and mannequin metadata), Databricks’ mannequin serving infrastructure can readily serve the Qwen fashions utilizing provisioned throughput.
Future Concerns
- We encourage you to maintain an eye fixed out for official Qwen help on Databricks mannequin serving.
- Consider efficiency overhead from forcibly limiting context dimension.
- For those who depend on chat prompting, bear in mind to manually format your prompts on the shopper facet.
Acknowledgments
- hiyouga’s llamafy_qwen.py for the preliminary instance that supplied the idea for the Qwen conversion.
- The Databricks engineering workforce for clarifying the inner serving constraints.
- All of the neighborhood members who examined and refined the method.