DeepSeek Unveils Janus-Pro: A New AI Image Generator
DeepSeek has garnered global attention in recent days for its open-source R-1 model, a more affordable alternative to OpenAI’s o1. Even before the excitement around R-1 has waned, the Chinese startup has introduced yet another open-source AI image model, Janus-Pro. According to DeepSeek, this new system surpasses OpenAI’s Dall-E 3 and Stability AI’s Stable Diffusion in several benchmark tests. The question is whether Janus-Pro truly lives up to these claims or if it is simply another AI model riding the hype wave.
What Is Janus-Pro?
Janus-Pro can both understand and generate images from text prompts. Built as an enhanced version of the original Janus model, Janus-Pro incorporates improved training methods, a larger dataset, and a more extensive architecture. Notably, it produces more stable outputs in response to short prompts and claims to deliver higher visual quality, greater detail, and limited text-generation capabilities within images.
Demonstrating Performance
- Prompt: “The face of a beautiful girl”
- Comparison images published by DeepSeek suggest that Janus-Pro 7B yields more convincing facial features than the older Janus release.
- Prompt: “A clear image of a blackboard with a clean, dark green surface and the word ‘Hello’ written precisely and legibly in the center with bold, white chalk letters.”
- The Janus-Pro version appears to handle text within images more effectively than its predecessor, although it may still face limitations.
Janus-Pro is available in two sizes—1 billion and 7 billion parameters—both generating images at a 384×384 resolution. Commercial users can access it under a permissive license.
Technical Overview
Janus-Pro distinguishes between multimodal understanding (analyzing images) and visual generation (creating images), aiming to prevent conflicts between these two tasks.
- Multimodal Understanding
- SigLIP Encoder: Extracts high-dimensional semantic features from images.
- Understanding Adaptor: Maps these semantic features to the large language model’s (LLM) input space.
- Visual Generation
- VQ Tokenizer: Converts images into discrete IDs.
- Generation Adaptor: Translates those token IDs back into the LLM’s input space for final image creation.
Benchmark Scores
- GenEval: Janus-Pro 7B reportedly scores 0.80, outperforming Dall-E 3 and Stable Diffusion 3 Medium.
- DPG-Bench: Achieves 84.19, surpassing other methods and indicating strong capability in following dense text-to-image instructions.
How Does Janus-Pro Compare to Dall-E 3 or Stable Diffusion?
DeepSeek’s internal benchmarks suggest that Janus-Pro outperforms Dall-E 3 and Stable Diffusion. However, sample side-by-side comparisons often show that Dall-E 3 produces more accurate faces, body proportions, and text in images:
- Prompt: “A photo of a herd of red sheep on a green field.”
- The Dall-E 3 output appeared more coherent than the Janus-Pro image.
- Prompt: “A beautiful 35 year old woman of average build wearing a pink tulle dress sits on the ground in front of the Eiffel Tower…”
- Janus-Pro struggled with proportions, whereas Dall-E 3 displayed more precise visual details.
- Prompt: “An image of a little boy holding a white board with the text ‘AI is awesome!’”
- Dall-E 3 produced clearer text, while Janus-Pro’s letters were somewhat distorted.
It is possible that specific fine-tuning or parameters might improve Janus-Pro’s outputs. By default settings, however, Dall-E 3 often seems to provide more polished results.
For those seeking a superior AI image generator, the Flux Pro 1.1 Ultra within Flux Labs AI is frequently cited as among the best. This open-weight model allows custom fine-tuning on user-provided images.
Getting Started with Janus-Pro
DeepSeek has made Janus models freely available on HuggingFace, supporting broader academic and commercial research:
- Janus-1.3B
- JanusFlow-1.3B
- Janus-Pro-1B
- Janus-Pro-7B
Note that Janus-Pro 7B uses nearly 15GB of memory. For those not wishing to run the model locally, a Gradio demo is provided on HuggingFace, enabling text-to-image and image captioning directly in the browser.
Example: Multimodal Understanding
Users can upload an image and prompt Janus-Pro to explain it. For instance, providing a “buff Doge vs. Cheems” meme yields a breakdown of how each Doge represents advanced or simple visual encoding approaches, respectively. This feature has potential for auto-captioning or generating alternative text.
Sample Code Snippet
DeepSeek offers an inference script to generate images from text. The process involves:
- Loading the Janus-Pro-7B model into memory.
- Encoding text prompts using VLChatProcessor.
- Storing output tokens for each generated image.
- Decoding tokens into a final 384×384 image.
Users can adapt this script for custom workflows or integrate Janus-Pro into existing pipelines.
Final Thoughts
Although DeepSeek promotes Janus-Pro as a competitor to Dall-E 3, real-world testing indicates that Janus-Pro may lag behind in generating consistently high-quality images. Its 384×384 resolution and associated reconstruction losses can result in outputs with less detail than some might expect. Nonetheless, Janus-Pro’s open-source availability underscores DeepSeek’s intent to innovate and drive competition in the AI image arena. As the company continues refining its technology, its commitment to accessible, open development could pose a disruptive force in the broader marketplace.