Hey r/MachineLearning! Last week, Microsoft released Phi-4, a 14B open-source model that rivals OpenAI's GPT-4-o-mini. I managed to find & fix 4 bugs impacting its output quality. You might remember me previously from fixing 8 bugs in Google's Gemma model! :)
I'm going to walk you through how I found & fixed the bugs. Phi-4's benchmarks were amazing, however many users reported weird or just wrong outputs. Since I maintain the open-source project called 'Unsloth' (fine-tuning LLMs 2x faster with 70% less VRAM) with my brother, I firstly tested Phi-4 for inference and found many errors. Our GitHub repo: https://github.com/unslothai/unsloth
This time, the model had no implementation issues (unlike Gemma 2) but did have problems in the model card. For my first inference run, I randomly found an extra token which is obviously incorrect (2 eos tokens is never a good idea). Also during more runs, I found there was an extra assistant prompt which is once again incorrect. And, lastly, from past experience with Unsloth's bug fixes, I already knew fine-tuning was wrong when I read the code.
These bugs caused Phi-4 to have some drop in accuracy and also broke fine-tuning runs. Our fixes are now under review by Microsoft to be officially added to Hugging Face. We uploaded the fixed versions to https://huggingface.co/unsloth/phi-4-GGUF
Here’s a breakdown of the bugs and their fixes:
1. Tokenizer bug fixes
The Phi-4 tokenizer interestingly uses <|endoftext|> as the BOS (beginning of sentence), EOS (end of sentence) and PAD (padding) tokens. The main issue is the EOS token is wrong - it should be <|im_end|>. Otherwise, you will get <|im_end|><|endoftext|> in generations.
2. Fine-tuning bug fixes
The padding token should be a designated pad token like in Llama (<|finetune_right_pad_id|>) or we can use an untrained token - for example we use <|dummy_87|>, fixing infinite generations and outputs.
3. Chat template issues
The Phi-4 tokenizer always adds an assistant prompt - it should only do this if prompted by add_generation_prompt. Most LLM serving libraries expect non auto assistant additions, and this might cause issues during serving.
We dive deeper into the bugs in our blog: https://unsloth.ai/blog/phi4
Do our Fixes Work?
Yes! Our fixed Phi-4 uploads show clear performance gains, with even better scores than Microsoft's original uploads on the Open LLM Leaderboard.
https://preview.redd.it/d8hew26e06ce1.png?width=2366&format=png&auto=webp&s=173c23feacc625566271470839fe7a5e25eb860e
Some redditors even tested our fixes to show greatly improved results in:
https://preview.redd.it/qx50pkq706ce1.png?width=1579&format=png&auto=webp&s=437da2cabdbf98ef5a8b8cbdc5592907a20e2316
https://preview.redd.it/sw1o3a3yt4de1.png?width=2326&format=png&auto=webp&s=fc6bfc45d14134d45f332ba58bbd1de049f5776b
We also made a Colab notebook fine-tune Phi-4 completely for free using Google's free Tesla T4 (16GB) GPUs: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Phi_4-Conversational.ipynb
Thank you for reading this long post and hope you all found this insightful! If you have any questions, please feel free to ask! :)
How I found the bugs:
- I first downloaded the original Phi-4 from https://huggingface.co/microsoft/phi-4, and tested inference out. Weirdly I found
<|im_start|>assistant<|im_sep|>
to be appended at the even with add_generation_prompt = False
in Hugging Face, so I theorized there was a chat template problem. Adding assistant prompts by default can break serving libraries.
- And yes, https://huggingface.co/microsoft/phi-4/blob/f957856cd926f9d681b14153374d755dd97e45ed/tokenizer_config.json#L774 had by default added the assistant prompt - I first fixed this!
- I then found
<|endoftext|>
to be used for the BOS, EOS and PAD tokens, which is a common issue amongst models - I ignored the BOS, since Phi-4 did not have one anyways, but changed the PAD token to <|dummy_87|>
. You can select any of the tokens since they're empty and not trained. This counteracts issues of infinite generations during finetuning.
- For Llama-fication, I used torch.allclose to confirm all tensors are in fact equivalent. I also used some fake random data to check all activations are also mostly similar bitwise. I also uploaded the model to the HF Open LLM Leaderboard to confirm if the original Phi-4 arch and the new Llama-fied models are equivalent.
- Finally I verified all finetuning runs with Unsloth in a Colab Notebook to confirm all runs were correct.
[–]yoracale 86 points87 points88 points (1 child)
[–]hugganao 1 point2 points3 points (0 children)
[–]SirBlobfish 30 points31 points32 points (3 children)
[–]danielhanchen[S] 16 points17 points18 points (2 children)
[–]SirBlobfish 2 points3 points4 points (1 child)
[–]danielhanchen[S] 0 points1 point2 points (0 children)
[–]asraniel 9 points10 points11 points (1 child)
[–]danielhanchen[S] 11 points12 points13 points (0 children)
[–]__bee_07 7 points8 points9 points (1 child)
[–]danielhanchen[S] 1 point2 points3 points (0 children)
[–]Thrumpwart 5 points6 points7 points (1 child)
[–]yoracale 7 points8 points9 points (0 children)
[–]projekt_treadstoneStudent 6 points7 points8 points (1 child)
[–]danielhanchen[S] 2 points3 points4 points (0 children)
[–]Inevitable_Mistake32 2 points3 points4 points (1 child)
[–]danielhanchen[S] 0 points1 point2 points (0 children)
[–]sherlock_holmes14 4 points5 points6 points (1 child)
[–]danielhanchen[S] 1 point2 points3 points (0 children)
[–]InevitablePrompt7613 2 points3 points4 points (1 child)
[–]yoracale 1 point2 points3 points (0 children)
[–]jprobichaud 1 point2 points3 points (1 child)
[–]danielhanchen[S] 0 points1 point2 points (0 children)
[–]_Bia 1 point2 points3 points (1 child)
[–]danielhanchen[S] 0 points1 point2 points (0 children)
[–]Arophous comment score below threshold-38 points-37 points-36 points (3 children)
[–]danielhanchen[S] 43 points44 points45 points (2 children)
[–]amejin 7 points8 points9 points (1 child)
[–]danielhanchen[S] 2 points3 points4 points (0 children)