林俊旸离职后首发长文：从「想得更久」到「为行动而想」

用户4242

3月27日修改

🔗 原文链接： https://mp.weixin.qq.com/s/omLaxHMx...

林俊旸林俊旸赛博禅心2026年3月26日 20:42 北京

NOTE

3 月 4 日凌晨发出那句「me stepping down. bye my beloved qwen」之后，林俊旸在社交媒体上沉默了三周​

今天，他在 X（Twitter）上发布了离职以来的第一篇长文

common.docs_name - LarkCCM_Docs_Menu_Image

https://x.com/JustinLin610/status/2037116325210829168

在这篇文章里，他没有谈离职原因，没有回应去向传闻。全文只做了一件事：写下他对 AI 下一阶段方向的判断​

从「让模型想得更久」到「让模型边做边想」

以下是原文全文，采用中英对照呈现

开篇

The last two years reshaped how we evaluate models and what we expect from them. OpenAI's o1 showed that "thinking" could be a first-class capability, something you train for and expose to users. DeepSeek-R1 proved that reasoning-style post-training could be reproduced and scaled outside the original labs. OpenAI described o1 as a model trained with reinforcement learning to "think before it answers." DeepSeek positioned R1 as an open reasoning model competitive with o1.​

过去两年，整个行业对模型的评判标准和预期都变了。OpenAI 的 o1 让大家看到，「思考」本身可以是一种被训练出来的能力。DeepSeek-R1 紧随其后，证明推理式的后训练可以在原始实验室之外被复现、被扩。OpenAI 把 o1 定义为「先想再答」的 RL 模型，DeepSeek 则把 R1 定位为可以正面对标 o1 的开源推理模型​

That phase mattered. But the first half of 2025 was mostly about reasoning thinking: how to make models spend more inference-time compute, how to train them with stronger rewards, how to expose or control that extra reasoning effort. The question now is what comes next. I believe the answer is agentic thinking: thinking in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world.​

那个阶段很重要，但 2025 上半年基本还是在围绕一个问题打转：怎么让模型在推理的时候多想一会儿。现在该问下一步了。我的判断是 智能体式思考（agentic thinking） 。为了行动而思考，在跟环境打交道的过程中思考，根据真实反馈不断修正计划​

1. o1 和 R1 真正教会了我们什么

The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical.​

第一波推理模型教会我们一件事：要在语言模型上把强化学习跑起来，反馈信号得是确定的、稳定的、能规模化的。数学、代码、逻辑这些可以验证对错的领域，成了 RL 的主战场。因为在这些场景里，奖励信号的质量远高于「让人类标注员投票选哪个回答更好」，它要的是对不对，不是像不像​

Once a model is trained to reason through longer trajectories, RL stops being a lightweight add-on to supervised fine-tuning. It becomes a systems problem. You need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story. OpenAI described o1 as a reasoning line trained with RL, and DeepSeek R1 later reinforced that direction by showing how much dedicated algorithmic and infrastructure work reasoning-based RL demands. The first big transition: from scaling pretraining to scaling post-training for reasoning.​

模型一旦开始在更长的推理轨迹上训练，RL 就不再是 SFT 上面加的一层薄薄的东西了，它变成了一个系统工程问题。你需要大规模的 rollout、高吞吐的验证、稳定的策略更新。推理模型的诞生，说到底是一个基础设施的故事。 第一个大转变：从扩展预训练，到扩展推理后训练​

2. 真正的难题从来不是「合并思考与指令」

At the beginning of 2025, many of us in Qwen team had an ambitious picture in mind. The ideal system would unify thinking and instruct modes. It would support adjustable reasoning effort, similar in spirit to low / medium / high reasoning settings. Better still, it would automatically infer the appropriate amount of reasoning from the prompt and context, so the model could decide when to answer immediately, when to think longer, and when to spend much more computation on a truly difficult problem.​

2025 年初，我们千问团队有一个很大的野心：做一个统一的系统，让思考模式和指令模式合二为一。用户可以调推理力度，低、中、高三档。更好的情况是模型自己判断这道题该想多久，简单的直接答，难的多花点算力​

Conceptually, this was the right direction. Qwen3 was one of the clearest public attempts. It introduced "hybrid thinking modes," supported both thinking and non-thinking behavior in one family, emphasized controllable thinking budgets, and described a four-stage post-training pipeline that explicitly included "thinking mode fusion" after long-CoT cold start and reasoning RL.​

方向是对的。Qwen3 是业内最清晰的一次公开尝试，引入了「混合思考模式」，一个模型家族里同时支持想和不想两种状态，还有一个四阶段的后训练流水线，专门做「思考模式融合」​

But merging is much easier to describe than to execute well. The hard part is data. When people talk about merging thinking and instruct, they often think first about model-side compatibility: can one checkpoint support both modes, can one chat template switch between them, can one serving stack expose the right toggles. The deeper issue is that the data distributions and behavioral objectives of the two modes are substantially different.​

但做起来比说起来难多了。难点在数据。大家聊合并的时候，第一反应往往是模型侧的问题：一个 checkpoint 能不能同时装两种模式。真正的麻烦在更深处，两种模式要的数据分布和行为目标，本质上就不一样​

We did not get everything right when trying to balance model merging with improving the quality and diversity of post-training data. During that revision process, we also paid close attention to how users were actually engaging with thinking and instruct modes. A strong instruct model is typically rewarded for directness, brevity, formatting compliance, low latency on repetitive, high-volume enterprise tasks such as rewriting, labeling, templated support, structured extraction, and operational QA. A strong thinking model is rewarded for spending more tokens on difficult problems, maintaining coherent intermediate structure, exploring alternative paths, and preserving enough internal computation to meaningfully improve final correctness.​

这件事我们没有全做对。过程中我们一直在看用户到底怎么用这两种模式。好的指令模型讲究干脆利落，回复短、格式规矩、延迟低，适合企业里那种大批量的改写、标注、模板客服。好的思考模型则相反，它需要在难题上多花 token，走不同的路径去探索，保持足够的内部计算来真正提升最终的准确率​

These two behavior profiles pull against each other. If the merged data is not carefully curated, the result is usually mediocre in both directions: the "thinking" behavior becomes noisy, bloated, or insufficiently decisive, while the "instruct" behavior becomes less crisp, less reliable, and more expensive than what commercial users actually want.​

两种行为画像天然互斥 。数据没策展好的话，两头都会变平庸：思考模式变得啰嗦、膨胀、不果断，指令模式则变得不够干脆、不够稳定，还比客户实际需要的更贵​

Separation remained attractive in practice. Later in 2025, after the initial hybrid framing of Qwen3, the 2507 line shipped distinct Instruct and Thinking updates, including separate 30B and 235B variants. In commercial deployment, a large number of customers still wanted high-throughput, low-cost, highly steerable instruct behavior for batch operations. For those scenarios, merging wasn't obviously a benefit. Separating the lines allowed teams to focus on solving the data and training problems of each mode more cleanly.​

林俊旸离职后首发长文：从「想得更久」到「为行动而想」​

林俊旸离职后首发长文：从「想得更久」到「为行动而想」