在这里插入图片描述

🔥 News

2025.06.26 🌟 我们非常自豪地推出Kwai Keye-VL，这是快手Kwai Keye团队精心打造的前沿多模态大语言模型。作为快手先进技术生态中的核心AI产品，Keye在视频理解、视觉感知和推理任务方面表现卓越，树立了新的性能标杆。我们的团队正在不懈努力突破可能的边界，敬请期待更多令人兴奋的进展！

在这里插入图片描述

快速入门

以下，我们通过简单示例展示如何结合🤗 Transformers使用Kwai Keye-VL。

Kwai Keye-VL的代码已集成至最新版Hugging Face transformers库，建议您通过以下命令从源码构建：

pip install git+https://github.com/huggingface/transformers accelerate

我们提供一套工具包，帮助您像调用API一样更便捷地处理各类视觉输入。包括base64编码、URL链接以及交错排列的图像和视频。您可以通过以下命令进行安装：

# It's highly recommanded to use `[decord]` feature for faster video loading.
pip install "keye-vl-utils[decord]==1.0.0"

如果您未使用Linux系统，可能无法直接从PyPI安装decord。此时可以使用pip install keye-vl-utils命令，该工具包将自动回退至torchvision进行视频处理。但您仍可通过源码安装decord来启用视频加载时的decord支持。

使用🤗 Transformers进行对话

以下代码片段展示如何结合transformers和keye_vl_utils使用对话模型：

继Qwen3之后，我们也提供了软切换机制，允许用户动态控制模型行为。通过在用户提示中添加/think、/no_think或不添加任何指令，即可切换模型的思考模式。

from transformers import AutoModel, AutoTokenizer, AutoProcessor
from keye_vl_utils import process_vision_info# default: Load the model on the available device(s)
model_path = "Kwai-Keye/Keye-VL-8B-Preview"model = AutoModel.from_pretrained(model_path,torch_dtype="auto",device_map="auto",trust_remote_code=True,
)# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = KeyeForConditionalGeneration.from_pretrained(
#     "Kwai-Keye/Keye-VL-8B-Preview",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )# default processer
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_pat, min_pixels=min_pixels, max_pixels=max_pixels, trust_remote_code=True)# Non-Thinking Mode
messages = [{"role": "user","content": [{"type": "image","image": "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg",},{"type": "text", "text": "Describe this image./no_think"},],}
]# Auto-Thinking Mode
messages = [{"role": "user","content": [{"type": "image","image": "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg",},{"type": "text", "text": "Describe this image."},],}
]# Thinking mode
messages = [{"role": "user","content": [{"type": "image","image": "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg",},{"type": "text", "text": "Describe this image./think"},],}
]# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)
inputs = inputs.to("cuda")# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

视频推理

# Messages containing a images list as a video and a text query
messages = [{"role": "user","content": [{"type": "video","video": ["file:///path/to/frame1.jpg","file:///path/to/frame2.jpg","file:///path/to/frame3.jpg","file:///path/to/frame4.jpg",],},{"type": "text", "text": "Describe this video."},],}
]# Messages containing a local video path and a text query
messages = [{"role": "user","content": [{"type": "video","video": "file:///path/to/video1.mp4","max_pixels": 360 * 420,"fps": 1.0,},{"type": "text", "text": "Describe this video."},],}
]# Messages containing a video url and a text query
messages = [{"role": "user","content": [{"type": "video","video": "http://s2-11508.kwimgs.com/kos/nlav11508/MLLM/videos_caption/98312843263.mp4",},{"type": "text", "text": "Describe this video."},],}
]#In Keye-VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",**video_kwargs,
)
inputs = inputs.to("cuda")# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# Sample messages for batch inference
messages1 = [{"role": "user","content": [{"type": "image", "image": "file:///path/to/image1.jpg"},{"type": "image", "image": "file:///path/to/image2.jpg"},{"type": "text", "text": "What are the common elements in these pictures?"},],}
]
messages2 = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages2]# Preparation for batch inference
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=texts,images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",
)
inputs = inputs.to("cuda")# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

提升性能的图像分辨率设置

该模型支持多种分辨率输入。默认情况下，它会采用原始分辨率处理输入，但更高的分辨率可提升性能（需消耗更多计算资源）。用户可设置像素的最小值和最大值（如256-1280的token计数范围）来优化配置，从而平衡运行速度与内存占用。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained("Kwai-Keye/Keye-VL-8B-Preview", min_pixels=min_pixels, max_pixels=max_pixels
)

此外，我们提供两种方法对输入模型的图像尺寸进行细粒度控制：

定义最小像素和最大像素范围：图像将按原比例缩放，确保像素值落在设定区间内。
指定精确尺寸：直接设置resized_height和resized_width参数。这些数值会自动圆整为28的整数倍。

# min_pixels and max_pixels
messages = [{"role": "user","content": [{"type": "image","image": "file:///path/to/your/image.jpg","resized_height": 280,"resized_width": 420,},{"type": "text", "text": "Describe this image."},],}
]
# resized_height and resized_width
messages = [{"role": "user","content": [{"type": "image","image": "file:///path/to/your/image.jpg","min_pixels": 50176,"max_pixels": 50176,},{"type": "text", "text": "Describe this image."},],}
]

👀 架构与训练策略

在这里插入图片描述
快意-VL模型架构基于Q文3-8B语言模型，整合了从开源SigLIP初始化的视觉编码器。该模型支持原生动态分辨率，通过将每幅图像划分为14x14的补丁序列来保持原始宽高比，随后通过简单的MLP层映射并融合视觉标记。模型采用3D旋转位置编码（RoPE）对文本、图像和视频信息进行统一处理，在位置编码与绝对时间之间建立一对一对应关系，从而确保对视频信息时序变化的精准感知。

🌟 Pre-Train

在这里插入图片描述
快影关键预训练流程，采用四阶段渐进式策略：图文匹配、ViT-LLM对齐、多任务预训练及模型融合退火。

预训练数据：海量、高质量、多样化

多样性：涵盖图文对、视频、纯文本等数据类型，任务类型包括细粒度描述、OCR文本识别、问答、目标定位等。
高质量：采用CLIP评分和视觉语言模型(VLM)判别器进行数据筛选，并利用MinHASH去重技术防止数据泄露。
自建数据集：专门构建高质量内部数据集，尤其在精细描述和中文OCR领域，以弥补开源数据的不足。

训练流程：四阶段渐进式优化
Kwai Keye-VL采用四阶段渐进式训练策略：

阶段0（视觉预训练）：持续预训练视觉编码器以适应内部数据分布并支持动态分辨率。
阶段1（跨模态对齐）：冻结骨干模型，仅训练MLP以低成本实现鲁棒的图文对齐。
阶段2（多任务预训练）：解锁全部参数，全面提升模型的视觉理解能力。
阶段3（退火训练）：通过高质量数据微调，进一步提升模型的细粒度理解能力。

最终，Kwai Keye-VL探索同构异构融合技术——对不同数据比例的退火训练模型进行参数平均，在保留多维能力的同时减少模型偏差，从而增强模型的鲁棒性。

📈 实验结果

在这里插入图片描述

Keye-VL-8B凭借强大且先进的感知能力崭露头角，其性能足以与顶尖模型媲美。
Keye-VL-8B在视频理解领域展现出非凡的熟练度。在包括Video-MME、Video-MMMU、TempCompass、LongVideoBench和MMVU在内的一系列权威公共视频基准测试中，该模型的表现明显超越了同规模的其他顶级模型。
在需要复杂逻辑推理和数学问题求解的评估集（如WeMath、MathVerse和LogicVista）中，Kwai Keye-VL-8B展现出强劲的性能曲线，凸显了其在逻辑推演和解决复杂量化问题方面的高阶能力。