GLM-Image is the world's first open-source industrial-grade autoregressive image generation model. GLM-Image achieves 91.16% word accuracy on CVTG-2K benchmark — #1 among open-source models.
GLM-Image is perfect for posters, presentations, infographics, and any scenario requiring precise text and complex information.
MIT licensed · API from $0.014/image · Self-hosting available
Generate images with AI models, support text-to-image and image-to-image.
GLM-Image is ZhipuAI's flagship image generation model featuring a hybrid 9B autoregressive generator + 7B diffusion decoder architecture. GLM-Image represents a breakthrough in 'cognitive generation' technology — the first open-source industrial-grade discrete autoregressive image model with exceptional text rendering capabilities.
GLM-Image features a specialized Glyph Encoder that dramatically improves text generation accuracy. Achieves 91.16% word accuracy on CVTG-2K benchmark, ranking #1 among all open-source models. Supports both Chinese (97.88%) and English (95.24%) text with clear, complete strokes.
GLM-Image is optimized for posters, presentations, infographics, and scenarios requiring accurate semantic understanding. The autoregressive module leverages GLM-4-9B's language capabilities for superior instruction following and complex information expression.
GLM-Image combines a 9B-parameter autoregressive generator (initialized from GLM-4-9B-0414) with a 7B-parameter single-stream DiT diffusion decoder. The AR module handles semantic understanding while the diffusion module focuses on high-fidelity detail restoration.
GLM-Image supports both text-to-image and image-to-image tasks in a single model. Capabilities include image editing, style transfer, identity-preserving generation for people and objects, and multi-subject consistency.
GLM-Image excels where traditional diffusion models struggle: precise text rendering, semantic understanding, and knowledge-dense visual content. Post-training with GRPO reinforcement learning ensures both aesthetic quality and text accuracy.
Multiple integration options for different use cases — from quick API calls to full local deployment with transformers and diffusers.
Describe your image with detailed text including subject, style, composition, and any text you want rendered. For best results, use quotation marks for text content and specify layout structure clearly.
Set resolution (512px-2048px, multiples of 32), aspect ratio (1:1, 3:4, 4:3, 16:9), and inference parameters. Recommended sizes include 1280×1280, 1568×1056, and 1472×1088 for optimal results.
The 9B autoregressive module first generates semantic tokens (~256 expanding to 1K-4K), then the 7B diffusion decoder renders high-resolution details. Use guidance_scale between 1.5-7.5 depending on prompt specificity.
Export your generated images in high resolution. For image-to-image tasks, upload reference images for editing, style transfer, or identity-preserving modifications. Iterate quickly with the same prompt recipe.
Industrial-grade capabilities backed by benchmark-leading performance. Built for commercial applications requiring reliable text rendering and semantic precision.
Autoregressive generator (9B params) initialized from GLM-4-9B with visual token vocabulary. Diffusion decoder (7B params) uses single-stream DiT for latent-space image decoding with Glyph Encoder.
CVTG-2K benchmark word accuracy of 0.9116, NED score of 0.9557. LongText-Bench: 95.24% English, 97.88% Chinese. #1 among open-source models for text rendering.
Generate images from 512px to 2048px (width and height must be multiples of 32). Supports 1:1, 3:4, 4:3, 16:9, and custom aspect ratios for any use case.
Unified model supports generation from text prompts, image editing, style transfer, background replacement, multi-subject fusion, and identity-preserving generation.
Post-training uses GRPO algorithm with modular feedback: OCR reward for text accuracy, aesthetic reward for visual quality, and AIGC detector as adversarial reward.
Available on Hugging Face (zai-org/GLM-Image) and ModelScope. Full commercial use permitted. Self-host with transformers + diffusers or use the ZhipuAI cloud API.
GLM-Image leads open-source models in text rendering accuracy while matching mainstream latent diffusion methods in general image quality.
CVTG-2K benchmark word accuracy for complex visual text generation. #1 among all open-source models.
LongText-Bench Chinese score. Evaluates long and multi-line text rendering across 8 text-intensive scenarios.
LongText-Bench English score. Clear, complete strokes with minimal character errors for professional materials.
From commercial posters to technical diagrams, GLM-Image excels in scenarios requiring accurate text and semantic understanding.
Generate holiday posters, promotional banners, and sale advertisements with precise brand text and product information. Complete compositions with clear visual hierarchy in seconds.
Marketing Teams
Commercial Posters & Ads
Create presentation visuals with titles, data representations, and concept diagrams. Clear, readable text with professional layouts for quarterly reports and product launches.
Business Professionals
Presentations & Slides
Generate educational illustrations, knowledge charts, and process diagrams. Accurately express professional knowledge and help readers understand abstract concepts.
Educators
Infographics & Data Viz
Quickly generate social media images, video thumbnails, and cover art. Supports multiple styles and aspect ratios for different platform requirements.
Content Creators
Social Media Graphics
Perform style conversion, background replacement, and portrait enhancement on existing photos. Maintains subject consistency for professional-grade results.
Designers
Image Editing & Style Transfer
Integrate via ZhipuAI API or self-host with Hugging Face transformers and diffusers. MIT license allows full commercial use with flexible deployment options.
Developers
API Integration & Self-Hosting
Get updates on new features, benchmark results, and deployment optimizations. No spam — just technical updates that matter.
Technical details, deployment options, and usage guidelines for GLM-Image.
Need more help? Check the Hugging Face model card or contact support
Experience industry-leading text rendering and knowledge-intensive image generation. Open-source, MIT licensed, and ready for production.