使用 MLX 运行 Gemma 4 音频转录 — 总结

来源： Simon Willison’s Weblog 日期： 2026-04-12 原文链接： https://simonwillison.net/2026/Apr/12/mlx-audio/

核心内容

Simon Willison 分享了一种在 macOS 上完全本地化进行音频转录的方法。该方案使用 Google 的 Gemma 4 E2B 模型（10.28GB），通过 Apple Silicon 优化的 MLX 框架运行，无需任何云端 API。

关键要点

一行命令即可运行：通过 uv run 命令，无需手动安装依赖，即可直接调用 Gemma 4 模型进行音频转录
完全本地推理：利用 Apple Silicon 芯片的 MLX 加速框架，在本地完成语音识别，无需联网
模型规模适中：Gemma 4 E2B 模型大小为 10.28GB，适合在现代 Mac 上运行
准确度待提升：在 14 秒测试音频上，模型能生成转录但存在部分词汇识别错误

技术栈

模型：Google Gemma 4 E2B（google/gemma-4-e2b-it）
推理框架：MLX + mlx-vlm
包管理：uv（Astral 出品的 Python 包管理器）
运行环境：macOS + Apple Silicon

适用场景

需要离线语音转文字的开发者
注重数据隐私、不希望音频数据传到云端的用户
想快速体验 Gemma 4 多模态能力的技术爱好者

局限性

仅支持 macOS（依赖 MLX 框架）
转录准确度不如 Whisper 等专业 ASR 模型
模型体积较大（10.28GB），首次运行需要下载

Gemma 4 Audio with MLX

来源： Simon Willison’s Weblog 日期： 2026-04-12

Thanks to a tip from Rahim Nathwani, here’s a uv run recipe for transcribing an audio file on macOS using the 10.28 GB Gemma 4 E2B model with MLX and mlx-vlm:

uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \
  mlx_vlm.generate \
  --model google/gemma-4-e2b-it \
  --audio file.wav \
  --prompt "Transcribe this audio" \
  --max-tokens 500 \
  --temperature 1.0

The author tested this on a 14-second WAV file. The model produced a transcription, though some words were misinterpreted.

Tags: python, ai, generative-ai, llms, uv, mlx, gemma, speech-to-text

使用 MLX 运行 Gemma 4 音频转录

来源： Simon Willison’s Weblog 日期： 2026-04-12

感谢 Rahim Nathwani 的提示，这里有一个在 macOS 上使用 10.28 GB 的 Gemma 4 E2B 模型配合 MLX 和 mlx-vlm 进行音频转录的 uv run 命令：

uv run --python 3.13 --with mlx_vlm --with torchvision --with gradio \
  mlx_vlm.generate \
  --model google/gemma-4-e2b-it \
  --audio file.wav \
  --prompt "Transcribe this audio" \
  --max-tokens 500 \
  --temperature 1.0

作者在一个 14 秒的 WAV 文件上进行了测试。模型成功生成了转录文本，但有部分词汇识别不够准确。

标签： python, ai, 生成式 AI, LLM, uv, mlx, gemma, 语音转文字