DeepSeek Unveils Janus-Pro: A New AI Image Generator

DeepSeek launches Janus-Pro—an open-source AI image model competing with Dall-E 3 and Stable Diffusion. Despite mixed test results, it aims to deliver high-quality visuals, improved text rendering, and robust multimodal features.

bymagicteam

February 14, 2025

A young woman’s face split down the middle, with one side set against a dark background and the other transitioning into a lush, green landscape.

“A striking half-and-half composition merging a serene countryside backdrop with the natural beauty of a woman’s face.

DeepSeek Unveils Janus-Pro: A New AI Image Generator

DeepSeek has garnered global attention in recent days for its open-source R-1 model, a more affordable alternative to OpenAI’s o1. Even before the excitement around R-1 has waned, the Chinese startup has introduced yet another open-source AI image model, Janus-Pro. According to DeepSeek, this new system surpasses OpenAI’s Dall-E 3 and Stability AI’s Stable Diffusion in several benchmark tests. The question is whether Janus-Pro truly lives up to these claims or if it is simply another AI model riding the hype wave.

What Is Janus-Pro?

Janus-Pro can both understand and generate images from text prompts. Built as an enhanced version of the original Janus model, Janus-Pro incorporates improved training methods, a larger dataset, and a more extensive architecture. Notably, it produces more stable outputs in response to short prompts and claims to deliver higher visual quality, greater detail, and limited text-generation capabilities within images.

Demonstrating Performance

Prompt: “The face of a beautiful girl”
- Comparison images published by DeepSeek suggest that Janus-Pro 7B yields more convincing facial features than the older Janus release.
Prompt: “A clear image of a blackboard with a clean, dark green surface and the word ‘Hello’ written precisely and legibly in the center with bold, white chalk letters.”
- The Janus-Pro version appears to handle text within images more effectively than its predecessor, although it may still face limitations.

Janus-Pro is available in two sizes—1 billion and 7 billion parameters—both generating images at a 384×384 resolution. Commercial users can access it under a permissive license.

Technical Overview

Janus-Pro distinguishes between multimodal understanding (analyzing images) and visual generation (creating images), aiming to prevent conflicts between these two tasks.

Multimodal Understanding
- SigLIP Encoder: Extracts high-dimensional semantic features from images.
- Understanding Adaptor: Maps these semantic features to the large language model’s (LLM) input space.
Visual Generation
- VQ Tokenizer: Converts images into discrete IDs.
- Generation Adaptor: Translates those token IDs back into the LLM’s input space for final image creation.

Benchmark Scores

GenEval: Janus-Pro 7B reportedly scores 0.80, outperforming Dall-E 3 and Stable Diffusion 3 Medium.
DPG-Bench: Achieves 84.19, surpassing other methods and indicating strong capability in following dense text-to-image instructions.

How Does Janus-Pro Compare to Dall-E 3 or Stable Diffusion?

DeepSeek’s internal benchmarks suggest that Janus-Pro outperforms Dall-E 3 and Stable Diffusion. However, sample side-by-side comparisons often show that Dall-E 3 produces more accurate faces, body proportions, and text in images:

Prompt: “A photo of a herd of red sheep on a green field.”
- The Dall-E 3 output appeared more coherent than the Janus-Pro image.
Prompt: “A beautiful 35 year old woman of average build wearing a pink tulle dress sits on the ground in front of the Eiffel Tower…”
- Janus-Pro struggled with proportions, whereas Dall-E 3 displayed more precise visual details.
Prompt: “An image of a little boy holding a white board with the text ‘AI is awesome!’”
- Dall-E 3 produced clearer text, while Janus-Pro’s letters were somewhat distorted.

It is possible that specific fine-tuning or parameters might improve Janus-Pro’s outputs. By default settings, however, Dall-E 3 often seems to provide more polished results.

For those seeking a superior AI image generator, the Flux Pro 1.1 Ultra within Flux Labs AI is frequently cited as among the best. This open-weight model allows custom fine-tuning on user-provided images.

Getting Started with Janus-Pro

DeepSeek has made Janus models freely available on HuggingFace, supporting broader academic and commercial research:

Janus-1.3B
JanusFlow-1.3B
Janus-Pro-1B
Janus-Pro-7B

Note that Janus-Pro 7B uses nearly 15GB of memory. For those not wishing to run the model locally, a Gradio demo is provided on HuggingFace, enabling text-to-image and image captioning directly in the browser.

Example: Multimodal Understanding

Users can upload an image and prompt Janus-Pro to explain it. For instance, providing a “buff Doge vs. Cheems” meme yields a breakdown of how each Doge represents advanced or simple visual encoding approaches, respectively. This feature has potential for auto-captioning or generating alternative text.

Sample Code Snippet

DeepSeek offers an inference script to generate images from text. The process involves:

Loading the Janus-Pro-7B model into memory.
Encoding text prompts using VLChatProcessor.
Storing output tokens for each generated image.
Decoding tokens into a final 384×384 image.

Users can adapt this script for custom workflows or integrate Janus-Pro into existing pipelines.

Final Thoughts

Although DeepSeek promotes Janus-Pro as a competitor to Dall-E 3, real-world testing indicates that Janus-Pro may lag behind in generating consistently high-quality images. Its 384×384 resolution and associated reconstruction losses can result in outputs with less detail than some might expect. Nonetheless, Janus-Pro’s open-source availability underscores DeepSeek’s intent to innovate and drive competition in the AI image arena. As the company continues refining its technology, its commitment to accessible, open development could pose a disruptive force in the broader marketplace.