arcee-ai/Trinity-Large-Thinking
Arcee AI's reasoning-focused sparse MoE (AfmoeForCausalLM) with structured <think> traces and agentic tool use
View on HuggingFaceGuide
Overview
Trinity-Large-Thinking is
Arcee AI's reasoning-focused Trinity Large checkpoint — a sparse MoE model designed for
long-horizon planning, tool use, and multi-step agent workflows. It uses the
AfmoeForCausalLM architecture and emits explicit reasoning traces inside <think>...</think>
blocks.
For multi-turn chat and agentic loops, reasoning tokens should be preserved across turns as part of the working state.
Prerequisites
- vLLM >= 0.11.1
- Hardware: multi-GPU recommended for production deployments
Install vLLM
uv venv
source .venv/bin/activate
uv pip install -U vllm openai --torch-backend auto
Launch command
vllm serve arcee-ai/Trinity-Large-Thinking \
--dtype bfloat16 \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Why these flags:
--reasoning-parser deepseek_r1extracts<think>...</think>intomessage.reasoning.--enable-auto-tool-choicelets the model decide when to call tools.--tool-call-parser qwen3_coderconverts tool calls into OpenAI-styletool_calls.--dtype bfloat16matches the recommended serving dtype.
Add parallelism flags (--tensor-parallel-size, --data-parallel-size, or
--enable-expert-parallel) for your hardware. Lower --max-model-len if you don't
need the full long-context config.
Validation Request
from openai import OpenAI
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
model = client.models.list().data[0].id
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location.",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "What is the weather in Paris right now?"}],
tools=tools, tool_choice="auto",
)
msg = response.choices[0].message
reasoning = getattr(msg, "reasoning", None) or getattr(msg, "reasoning_content", None)
print("reasoning:", reasoning)
print("content:", msg.content)
print("tool_calls:", msg.tool_calls)
Preserving Reasoning Across Turns
Pass reasoning back as reasoning on assistant messages:
assistant_msg = {"role": "assistant", "content": msg.content or ""}
if reasoning:
assistant_msg["reasoning"] = reasoning
if msg.tool_calls:
assistant_msg["tool_calls"] = [
{"id": tc.id, "type": "function",
"function": {"name": tc.function.name, "arguments": tc.function.arguments}}
for tc in msg.tool_calls
]
messages.append(assistant_msg)
Rules:
- Pass reasoning back as
reasoning(even if your client exposes it asreasoning_content). - Keep
contentas an empty string (notnull) on tool-only turns. - Append the assistant message before tool-result messages.
- Use
/v1/chat/completionsfor structured reasoning output.
Troubleshooting
- No reasoning: start server with
--reasoning-parser deepseek_r1; use/v1/chat/completions. - Tool calls as plain text: enable
--enable-auto-tool-choiceand--tool-call-parser qwen3_coder. - Loses coherence after tool turns: preserve
reasoningon each assistant turn; don't set content tonull. - OOM: lower
--max-model-len; scale parallelism; use a local checkpoint path.