Powered by ZhipuAI's 9B+7B Hybrid Architecture

GLM-Image: Industry-Leading Text Rendering in AI Image Generation

GLM-Image is the world's first open-source industrial-grade autoregressive image generation model. GLM-Image achieves 91.16% word accuracy on CVTG-2K benchmark — #1 among open-source models.
GLM-Image is perfect for posters, presentations, infographics, and any scenario requiring precise text and complex information.

MIT licensed · API from $0.014/image · Self-hosting available

AI Image Generator

Generate images with AI models, support text-to-image and image-to-image.

Loading generator...

What is GLM-Image

GLM-Image is ZhipuAI's flagship image generation model featuring a hybrid 9B autoregressive generator + 7B diffusion decoder architecture. GLM-Image represents a breakthrough in 'cognitive generation' technology — the first open-source industrial-grade discrete autoregressive image model with exceptional text rendering capabilities.

Superior Text Rendering

GLM-Image features a specialized Glyph Encoder that dramatically improves text generation accuracy. Achieves 91.16% word accuracy on CVTG-2K benchmark, ranking #1 among all open-source models. Supports both Chinese (97.88%) and English (95.24%) text with clear, complete strokes.

Knowledge-Intensive Generation

GLM-Image is optimized for posters, presentations, infographics, and scenarios requiring accurate semantic understanding. The autoregressive module leverages GLM-4-9B's language capabilities for superior instruction following and complex information expression.

Hybrid Architecture

GLM-Image combines a 9B-parameter autoregressive generator (initialized from GLM-4-9B-0414) with a 7B-parameter single-stream DiT diffusion decoder. The AR module handles semantic understanding while the diffusion module focuses on high-fidelity detail restoration.

Unified Generation Pipeline

GLM-Image supports both text-to-image and image-to-image tasks in a single model. Capabilities include image editing, style transfer, identity-preserving generation for people and objects, and multi-subject consistency.

Why Choose GLM-Image

GLM-Image excels where traditional diffusion models struggle: precise text rendering, semantic understanding, and knowledge-dense visual content. Post-training with GRPO reinforcement learning ensures both aesthetic quality and text accuracy.

The integrated Glyph Encoder text module significantly reduces the common 'forgetting characters while writing' problem. Unlike pure diffusion models, GLM-Image maintains text coherence throughout the generation process, producing clear, complete strokes with minimal character errors.

How to Use GLM-Image

Multiple integration options for different use cases — from quick API calls to full local deployment with transformers and diffusers.

1

Write Your Prompt

Describe your image with detailed text including subject, style, composition, and any text you want rendered. For best results, use quotation marks for text content and specify layout structure clearly.

2

Configure Settings

Set resolution (512px-2048px, multiples of 32), aspect ratio (1:1, 3:4, 4:3, 16:9), and inference parameters. Recommended sizes include 1280×1280, 1568×1056, and 1472×1088 for optimal results.

3

Generate with AI

The 9B autoregressive module first generates semantic tokens (~256 expanding to 1K-4K), then the 7B diffusion decoder renders high-resolution details. Use guidance_scale between 1.5-7.5 depending on prompt specificity.

4

Download & Iterate

Export your generated images in high resolution. For image-to-image tasks, upload reference images for editing, style transfer, or identity-preserving modifications. Iterate quickly with the same prompt recipe.

Technical Specifications

Industrial-grade capabilities backed by benchmark-leading performance. Built for commercial applications requiring reliable text rendering and semantic precision.

9B + 7B Hybrid Architecture

Autoregressive generator (9B params) initialized from GLM-4-9B with visual token vocabulary. Diffusion decoder (7B params) uses single-stream DiT for latent-space image decoding with Glyph Encoder.

91.16% Text Accuracy

CVTG-2K benchmark word accuracy of 0.9116, NED score of 0.9557. LongText-Bench: 95.24% English, 97.88% Chinese. #1 among open-source models for text rendering.

Flexible Resolution

Generate images from 512px to 2048px (width and height must be multiples of 32). Supports 1:1, 3:4, 4:3, 16:9, and custom aspect ratios for any use case.

Text-to-Image & Image-to-Image

Unified model supports generation from text prompts, image editing, style transfer, background replacement, multi-subject fusion, and identity-preserving generation.

GRPO Reinforcement Learning

Post-training uses GRPO algorithm with modular feedback: OCR reward for text accuracy, aesthetic reward for visual quality, and AIGC detector as adversarial reward.

Open Source (MIT License)

Available on Hugging Face (zai-org/GLM-Image) and ModelScope. Full commercial use permitted. Self-host with transformers + diffusers or use the ZhipuAI cloud API.

Benchmark Performance

GLM-Image leads open-source models in text rendering accuracy while matching mainstream latent diffusion methods in general image quality.

91.16% CVTG-2K benchmark word accuracy for complex visual text generation. #1 among all open-source models.

91.16%

CVTG-2K benchmark word accuracy for complex visual text generation. #1 among all open-source models.

97.88% LongText-Bench Chinese score. Evaluates long and multi-line text rendering across 8 text-intensive scenarios.

97.88%

LongText-Bench Chinese score. Evaluates long and multi-line text rendering across 8 text-intensive scenarios.

95.24% LongText-Bench English score. Clear, complete strokes with minimal character errors for professional materials.

95.24%

LongText-Bench English score. Clear, complete strokes with minimal character errors for professional materials.

Use Cases

From commercial posters to technical diagrams, GLM-Image excels in scenarios requiring accurate text and semantic understanding.

Generate holiday posters, promotional banners, and sale advertisements with precise brand text and product information. Complete compositions with clear visual hierarchy in seconds.

Marketing Teams, Commercial Posters & Ads

Marketing Teams

Commercial Posters & Ads

Create presentation visuals with titles, data representations, and concept diagrams. Clear, readable text with professional layouts for quarterly reports and product launches.

Business Professionals, Presentations & Slides

Business Professionals

Presentations & Slides

Generate educational illustrations, knowledge charts, and process diagrams. Accurately express professional knowledge and help readers understand abstract concepts.

Educators, Infographics & Data Viz

Educators

Infographics & Data Viz

Quickly generate social media images, video thumbnails, and cover art. Supports multiple styles and aspect ratios for different platform requirements.

Content Creators, Social Media Graphics

Content Creators

Social Media Graphics

Perform style conversion, background replacement, and portrait enhancement on existing photos. Maintains subject consistency for professional-grade results.

Designers, Image Editing & Style Transfer

Designers

Image Editing & Style Transfer

Integrate via ZhipuAI API or self-host with Hugging Face transformers and diffusers. MIT license allows full commercial use with flexible deployment options.

Developers, API Integration & Self-Hosting

Developers

API Integration & Self-Hosting

Stay Updated

Get updates on new features, benchmark results, and deployment optimizations. No spam — just technical updates that matter.

Frequently Asked Questions

Technical details, deployment options, and usage guidelines for GLM-Image.











Need more help? Check the Hugging Face model card or contact support

Start Generating with GLM-Image

Experience industry-leading text rendering and knowledge-intensive image generation. Open-source, MIT licensed, and ready for production.

GLM-Image: Open-Source AI Image Generator with Accurate Text