Rendered at 06:55:55 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
nl 5 hours ago [-]
If you are going to go to the bother of fine tuning for trivial problems like subject classification then I think you'll find Scikit Learn with a SGDClassifier on 2-grams will do probably just as well and be under 1MB for the trained classifier.
You can train it in under a minute, and it will work perfectly well on embedded devices.
Small LLMs are good choices for text classification in two cases:
- If you next to provide in-context examples and classifier based on them.
- Your classification goes beyond simple subject-type classifiers. For example, multiple choice question answering is classification where small LLM will work but traditional ML methods won't/
djsjajah 4 hours ago [-]
Not with 800 examples. If you are going to consider an ngram model, I think you are better off getting a frontier llm to write you an absurd regex.
nl 30 minutes ago [-]
Hmm maybe. Turns out the author trained a logistic-regression classifier on the embeddings too, but didn't report the results:
there are models between 2-grams and 600m param models that would be good options. i don't expect a 2-gram to do very well here. also i'm not sure why this model isn't a fine choice if it solves their problem
throwa356262 1 hours ago [-]
What would you suggest instead?
deepsquirrelnet 4 hours ago [-]
If you want to go deeper on language models, try these project ideas:
- GEPA prompt tuning Qwen 0.6B (or GEPA, then GRPO)
- Use an embedding model and train a classifier (MLP, logistic, svm)
- Use a larger LLM to generate a synthetic dataset (beware of lack of diversity, mine "seed text" from real sources first)
- Synthetically generate "hard examples" where more than one category may be valid and DPO tune your preferred responses
mickael-kerjean 4 hours ago [-]
If you are interested in small language model to fine tune, gemma3:270m is quite interesting for its size
zwaps 2 hours ago [-]
Has anyone compared recently doing something like ModernBERT plus classifier vs. full or lora FT of a small LM like qwen?
kamranjon 4 minutes ago [-]
I have! I recently compared Gemma 1b to ModernBERT Large for a binary classification task and ModernBERT was the clear winner. It learned faster and performed the task better by a significant margin by the end of training. It seems the bidirectional encoder only architecture works really well for classification tasks, and I think it is related to being bidirectional whereas decoder only models like Gemma (or Qwen) can only “look backwards”. I used a mixture of FFT and LoRA as well as a mixture of CE Loss and SupCon Loss.
pj_mukh 2 hours ago [-]
“As an example, the question “When did we replace our pool pump?” will be mapped to a category called “pool” before querying the Index database.”
Cool write up! Really appreciate it but incidentally how does this categorization help you get better retrieval results?
mettamage 1 hours ago [-]
Categorization allows for retrieval strategy
throwa356262 1 hours ago [-]
Are 0.6b models useful without fine tuning?
Half of the times I ask qwen 0.6b "what is 1 + 2?" it ends up in a thinking loop of "but wait, the user is asking me to ..."
kamranjon 1 minutes ago [-]
A fun thing I do with Qwen 3.5 0.8b is to take a screenshot of the Hackernews homepage and ask it to give me a JSON representation of the data and it does surprisingly well. With a well structured prompt I think it could be made to be pretty reliable tool for that type of task out of the box.
electroglyph 3 hours ago [-]
existing embedding models like alibaba's modernbert tune or one of the jina v5s would probably map query to category automatically. (i.e. store embeddings of each category and calculate cosine sim for each incoming query vs. categories and pick the closest)
also, you could stick a classifier head on a BERT model as another option.
doubtfuluser 2 hours ago [-]
But why using an encoder model instead of a BERT based model? For a pure classification that should be easier to train and work quite well
abhashanand1501 2 hours ago [-]
Do small language models run on cpus or you still need a gpus to run them?
nextaccountic 3 hours ago [-]
> The model invents new categories (e.g. apartments) and doesn’t stick to the provided list of allowed categories
Can this specific failure mode be solved by providing a grammar that the output must adhere to? (Not sure if Qwen has this feature, it's used for eg. to ensure the output is parseable json)
thomascountz 15 minutes ago [-]
Yes, you can use constrained decoding like logit masking to force all invalid tokens in the vocabulary to -inf, and effectively be removed from selection. I believe llama.cpp exposes this by accepting a formatted grammar.
nl 3 hours ago [-]
It can.
It's something that is implemented by the thing that runs the model - eg Llama.cpp - rather than the model itself.
Note that it is hard to make work if you turn thinking on because the grammar gets complicated quickly (I don't recall if Qwen 0.6B can do thinking).
aesthesia 1 hours ago [-]
Thinking shouldn't be too hard to deal with---just let the model generate freely until it hits a </think> token, then do constrained decoding, right?
jszymborski 5 hours ago [-]
I think the Qwen 0.6B is so cool. It is super fast and as illustrated here it has a clear niche, esp. when fine-tuned.
I'm also interested in it as a student for distillation.
You can train it in under a minute, and it will work perfectly well on embedded devices.
Small LLMs are good choices for text classification in two cases:
- If you next to provide in-context examples and classifier based on them.
- Your classification goes beyond simple subject-type classifiers. For example, multiple choice question answering is classification where small LLM will work but traditional ML methods won't/
https://github.com/thelgevold/fine-tuned-classifier/blob/mai...
- Zero-shot encoders like tasksource or GliNER
- Natural language inference: https://huggingface.co/blog/dleemiller/nli-xenc-ways-to-use
- GRPO training
- GEPA prompt tuning Qwen 0.6B (or GEPA, then GRPO)
- Use an embedding model and train a classifier (MLP, logistic, svm)
- Use a larger LLM to generate a synthetic dataset (beware of lack of diversity, mine "seed text" from real sources first)
- Synthetically generate "hard examples" where more than one category may be valid and DPO tune your preferred responses
Cool write up! Really appreciate it but incidentally how does this categorization help you get better retrieval results?
Half of the times I ask qwen 0.6b "what is 1 + 2?" it ends up in a thinking loop of "but wait, the user is asking me to ..."
also, you could stick a classifier head on a BERT model as another option.
Can this specific failure mode be solved by providing a grammar that the output must adhere to? (Not sure if Qwen has this feature, it's used for eg. to ensure the output is parseable json)
It's something that is implemented by the thing that runs the model - eg Llama.cpp - rather than the model itself.
Note that it is hard to make work if you turn thinking on because the grammar gets complicated quickly (I don't recall if Qwen 0.6B can do thinking).
I'm also interested in it as a student for distillation.