This is an open ended question.
I’m not looking for a specific answer , just what people know about this topic.
I’ve asked this question on Huggingface discord as well.
But hey, asking on lemmy is always good, right? No need to answer here. This is a repost, essentially.
This might serve as an “update” of sorts from the previous post: https://lemmy.world/post/19509682
//—//
Question;
FLUX model uses a combo of CLIP+T5 to create a text_encoding.
CLIP is capable if doing both image_encoding and text_encoding.
T5 model seems to be strictly text-to-text.
So I can’t use the T5 to create image_encodings. Right?
https://huggingface.co/docs/transformers/model_doc/t5
But nonetheless, the T5 encoder is used in text-to-image generation.
So surely, there must be good uses for the T5 in creating a better CLIP interrogator?
Ideas/examples on how to do this?
I have 0% knowledge on the T5 , so feel free to just send me a link someplace if you don’t want to type out an essay.
//----//
For context;
I’m making my own version of a CLIP interrogator : https://colab.research.google.com/#fileId=https%3A//huggingface.co/codeShare/JupyterNotebooks/blob/main/sd_token_similarity_calculator.ipynb
Key difference is that this one samples the CLIP-vit-large-patch14 tokens directly instead of using pre-written prompts.
I text_encode the tokens individually , store them in a list for later use.
I’m using the method shown in this paper, the “NND-Nearest neighbor decoding” .
Methods for making better CLIP interrogators: https://arxiv.org/pdf/2303.03032
T5 encoder paper : https://arxiv.org/pdf/1910.10683
Example from the notebook where I’m using the NND method on 49K CLIP tokens (Roman girl image) :
Most similiar suffix tokens : "vfx "
most similiar prefix tokens : “imperi-”
qwen2.5 just came out, and looks amazing. You can try it with ollama.
Wow , yeah I found a demo here: https://huggingface.co/spaces/Qwen/Qwen2.5
A whole host of LLM models seems to be released. Thanks for the tip!
I’ll see if I can turn them into something useful 👍