CLIPTextEncodeFlux is an advanced text encoding node in ComfyUI, specifically designed for the Flux architecture. It uses a dual-encoder mechanism (CLIP-L and T5XXL) to process both structured keywords and detailed natural language descriptions, providing the Flux model with more accurate and comprehensive text understanding for improved text-to-image generation quality.
This node is based on a dual-encoder collaboration mechanism:
- The
clip_linput is processed by the CLIP-L encoder, extracting style, theme, and other keyword features—ideal for concise descriptions. - The
t5xxlinput is processed by the T5XXL encoder, which excels at understanding complex and detailed natural language scene descriptions. - The outputs from both encoders are fused, and combined with the
guidanceparameter to generate unified conditioning embeddings (CONDITIONING) for downstream Flux sampler nodes, controlling how closely the generated content matches the text description.
Inputs
| Parameter | Data Type | Input Method | Default | Range | Description |
|---|---|---|---|---|---|
clip | CLIP | Node input | None | - | Must be a CLIP model supporting the Flux architecture, including both CLIP-L and T5XXL encoders |
clip_l | STRING | Text box | None | Up to 77 tokens | Suitable for concise keyword descriptions, such as style or theme |
t5xxl | STRING | Text box | None | Nearly unlimited | Suitable for detailed natural language descriptions, expressing complex scenes and details |
guidance | FLOAT | Slider | 3.5 | 0.0 - 100.0 | Controls the influence of text conditions on the generation process; higher values mean stricter adherence to the text |
Outputs
| Output Name | Data Type | Description |
|---|---|---|
CONDITIONING | CONDITIONING | Contains the fused embeddings from both encoders and the guidance parameter, used for conditional image generation |
Usage Examples
Prompt Examples
-
clip_l input (keyword style):
- Use structured, concise keyword combinations
- Example:
masterpiece, best quality, portrait, oil painting, dramatic lighting - Focus on style, quality, and main subject
-
t5xxl input (natural language description):
- Use complete, fluent scene descriptions
- Example:
A highly detailed portrait in oil painting style, featuring dramatic chiaroscuro lighting that creates deep shadows and bright highlights, emphasizing the subject's features with renaissance-inspired composition. - Focus on scene details, spatial relationships, and lighting effects
Notes
- Make sure to use a CLIP model compatible with the Flux architecture
- It is recommended to fill in both
clip_landt5xxlto leverage the dual-encoder advantage - Note the 77-token limit for
clip_l - Adjust the
guidanceparameter based on the generated results