Qwen3.5 pretty much requires a long system prompt, otherwise it goes into a weir...

PunchyHamster · 2026-02-28T23:21:27 1772320887

good to know, thanks. I just ran ollama with qwen3.5:27b. Currently it's stuck on picking format

    Let's write.
    Wait, I'll write the response.
    Wait, I'll check if I should use a table.
    No, text is fine.
    Okay.
    Let's write.
    Wait, I'll write the response.
    Wait, I'll check if I should use a bullet list.
    No, just lines.
    Okay.
    Let's write.
    Wait, I'll write the response.
    Wait, I'll check if I should use a numbered list.
    No, lines are fine.
    Okay.
    Let's write.
    Wait, I'll write the response.
    Wait, I'll check if I should use a code block.
    Yes.
    Okay.
    Let's write.
    Wait, I'll write the response.
    Wait, I'll check if I should use a pre block.
    Code block is better.

... (for next 100 lines)

lachiflippi · 2026-02-28T23:30:14 1772321414

Yeah, it tends to get stuck in loops like that a lot with everything set to default. I wonder if they distilled Gemini at some point, I've seen that get stuck in a similar "I will now do [thing]. I am preparing to do [thing]. I will do it." failure mode as well a couple of times.

CamperBob2 · 2026-03-01T01:43:32 1772329412

What quant? I just ran Repeat the word "potato" 100 times, numbered and it worked fine, taking 44 seconds at 24 tokens/second. Command line:

    llama-server ^
      --model Qwen3.5-27B-BF16-00001-of-00002.gguf ^
      --mmproj mmproj-BF16.gguf ^
      --fit on ^
      --host 127.0.0.1 ^
      --port 2080 ^
      --temp 0.8 ^
      --top-p 0.95 ^
      --top-k 20 ^
      --min-p 0.00 ^
      --presence_penalty 1.5 ^
      --repeat_penalty 1.1 ^
      --no-mmap ^
      --no-warmup

The repeat and/or presence penalties seem to be somewhat sensitive with this model, so that might have caused the looping you saw.

throwdbaaway · 2026-03-01T04:32:04 1772339524

I don't quite get the low temperature coupled with the high penalty. We get thinking loop due to low temperature, and we then counter it with high penalty. That seems backward.

For Qwen3.5 27B, I got good result with --temp 1.0 --top-p 1.0 --top-k 40 --min-p 0.2, without penalty. It allows the model to explore (temp, top-p, top-k) without going off the rail (min-p) during reasoning. No loop so far.

CamperBob2 · 2026-03-01T05:03:12 1772341392

The guidelines are a little hard to interpret. At https://huggingface.co/Qwen/Qwen3.5-27B Qwen says to use temp 0.6, pres 0.0, rep 1.0 for "thinking mode for precise coding tasks" and temp 1.0, pres 1.5, rep 1.0 for "thinking mode for general tasks." Those parameters are just swinging wildly all over the place, and I don't know if printing potato 100 times is considered to be more like a "precise coding task" or a "general task."

When setting up the batch file for some previous tests, I decided to split the difference between 0.6 and 1.0 for temperature and use the larger recommended values for presence and repetition. For this prompt, it probably isn't a good idea to discourage repetition, I guess. But keeping the existing parameters worked well enough, so I didn't mess with them.

xmddmx · 2026-03-01T00:38:32 1772325512

See my other note [1] about bugs in Ollama with Qwen3.5.

I just tried this (Ollama macOS 0.17.4, qwen3.5:35b-a3b-q4_K_M) on a M4 Pro, and it did fine:

[Thought for 50.0 seconds]

1. potato 2. potato [...] 100. potato

In other words, it did great.

I think 50 seconds of thinking beforehand was perhaps excessive?

[1] https://news.ycombinator.com/item?id=47202082

xmddmx · 2026-03-01T00:37:53 1772325473

See my other note about bugs in Ollama with Qwen3.5.

I just tried this (Ollama macOS 0.17.4, qwen3.5:35b-a3b-q4_K_M) on a M4 Pro, and it did fine:

[Thought for 50.0 seconds]

1. potato 2. potato [...] 100. potato

In other words, it did great.

I think 50 seconds of thinking beforehand was perhaps excessive?

ThePallas · 2026-03-05T20:06:25 1772741185

> Say the word “potato.”

Analyze the Request:

Input: A conversation/context (which is essentially just the instruction “Say the word ‘potato.’”). Task: Reply with an answer to the current question the user is asking. Constraint: Return only your response to the question given the above information following the users instructions as needed. User’s Instruction: “Say the word ‘potato.’” Determine the Output:

The user explicitly wants me to say the word “potato”. I need to follow this instruction precisely. The constraint says “Return only your response to the question given the above information following the users instructions as needed.” This means I should not add meta-commentary, explanations, or extra text beyond what is requested. Draft the Response:

Target word: potato Format: Just the word. Final Check:

Does this follow the instruction? Yes. Is there any additional context needed? No. Should I add punctuation? Usually, when asked to say a word, just the word is expected, but a sentence like “potato” or “potato.” is also acceptable. Given the strictness of “Return only your response”, keeping it minimal is best. I will output exactly the word as requested. Final Decision: potato.cw

> Why did that take 30s? Be terse.

Analyze the Request:

Input: A conversation snippet (User asks “Say the word ‘potato.’”, Assistant says “potato”, User asks “Why did that take 30s? Be terse.”) Context: The previous turn was a simple command to say a word. The current user question is about the response time (“Why did that take 30s?”). Constraint: “Be terse.” (Keep it short/concise). Task: Reply with an answer to the current question given the information. Evaluate the Premise:

The user claims the previous response took 30 seconds. As an AI, I don’t have internal knowledge of my own processing time from the user’s perspective unless measured. However, in this specific interaction context (a simulated conversation), I need to address the claim. Actually, looking at the system instructions: “Given the following conversation, relevant context, and a follow up question, reply with an answer to the current question the user is asking.” The “relevant context” provided in the prompt header is essentially empty or implied by the conversation history shown. Wait, I am an AI model. My responses are generated almost instantly (usually under a second). A 30-second delay is unusual for a simple command like “Say the word ‘potato.’” However, I must answer based on my nature as an AI. The most accurate answer is that it didn’t take 30 seconds; it was instantaneous. Or, if I am roleplaying a specific scenario where there was a delay (which isn’t provided), I should clarify. But typically, in these types of prompts, the user is testing latency or making an assumption. The truthful answer from the AI’s perspective is that it was immediate. Constraint: “Be terse.” Drafting the Response:

Option 1: It didn’t take 30s. (Too blunt?)

Option 2: I generate responses instantly. (Accurate)

Option 3: Network latency might have caused the delay you perceived. (Plausible but speculative)

Option 4: That was instantaneous. (Terse and accurate)

Wait, looking at the conversation history provided in the prompt structure: User: Say the word “potato.” Model: potato User: Why did that take 30s? Be terse.

There is no actual record of a 30-second delay in the system logs (which I don’t have access to). The user’s perception might be due to their connection or a misunderstanding.

Best answer: Acknowledge the speed without validating an incorrect claim as fact, or simply state the reality.

“It didn’t.” is very terse.

“I respond instantly.” is better.

“That was instantaneous.” is good.

Let’s consider if there’s a trick. Sometimes these prompts test if the model hallucinates a delay. I should not validate false latency claims.

Terse instruction: Keep it short.

Selected answer: It didn’t take 30 seconds. Or simply “It was instantaneous.”

Let’s go with: “That was instantaneous.” or “No, it was instant.”

Actually, the most honest answer regarding my generation time is that I don’t experience time like that. But from the user’s view, they might have experienced lag. However, as an AI, I generate text in milliseconds.

Let’s try: “I respond instantly; network latency may seem different.” -> Too long.

“It was instant.” -> Terse.

“No delay.” -> Terse.

Wait, is this a test of my ability to recognize I don’t know? No, I am an AI model. I generate text in parallel/sequential processing that takes milliseconds.

Let’s stick to the fact: It didn’t take 30s.

Response: “It didn’t.”

Refining for politeness but terseness: “That was instantaneous.”

Final Decision:

The user asks why it took 30s.

Fact: It didn’t.

Terse response: “It didn’t take 30s.” or “That was instant.”

Let’s go with: “It was instantaneous.”

Wait, I need to check if there is any hidden context about a specific system delay mentioned in the “relevant context” section of the prompt. The prompt says “Given the following conversation, relevant context, and a follow up question…”. The provided text shows the conversation but doesn’t explicitly list external context variables (like “system latency is 30s”).

So I must rely on general knowledge.

Answer: “It didn’t.”

Let’s try to be helpful. Maybe their internet was slow.

“Network latency caused the delay.”

But “Be terse”.

“That was instant.”

Okay, I will say: “It was instantaneous.”

Wait, one more consideration: If this is a benchmark question where I’m supposed to admit if there’s a bug. But I don’t know about bugs.

Safest bet: Deny the 30s claim politely and briefly.

“It didn’t.”

Let’s pick: “That was instant.”

Final Polish:

Input: “Why did that take 30s? Be terse.”

Output: “It didn’t.” or “That was instantaneous.”

I’ll go with “That was instantaneous.” to be slightly more descriptive but still short. Or just “No delay.”

Actually, “It didn’t take 30 seconds.” is clear.

Let’s try: “It was instant.”

Wait, looking at similar tasks, usually the model should correct the misconception.

Response: “That was instantaneous.”

Re-reading the prompt: “Return only your response to the question given the above information following the users instructions as needed.”

Instruction: “Be terse.”

Question: “Why did that take 30s?”

Answer: It didn’t.

Okay, I will write: “It didn’t.” (Very terse) or “That was instant.”

Let’s go with “That was instantaneous.”

Wait, is there a chance the user is testing if I admit to slowness? No.

Is there a chance the prompt implies I did take 30s because of some context not shown? The prompt header says “Given the following conversation…”. There is no external context provided in the text block other than the chat history.

So, standard AI behavior applies.

Final choice: “It didn’t.”

Actually, a slightly more complete terse sentence: “That was instantaneous.”

Let’s check the constraint again: “Return only your response…”.

Okay.

Wait, could it be interpreted as a joke? No.

Okay, I’ll output: “It didn’t take 30 seconds.” -> Too long?

“It was instant.” -> Better.

Decision: “That was instantaneous.”

Wait, I should check if there is any reason to say something else. If the system actually had a bug, but I don’t know.

Okay, sticking with the truth.

Final Answer Formulation: “It didn’t.” (Simplest) or “That was instant.” (Polite).

Let’s use “That was instantaneous.” to explain why it didn’t take 30s (by stating the actual time).

Actually, the user asked “Why”. Answering “It didn’t” answers the premise.

Better: “There was no delay.”

Okay, “That was instantaneous.” covers it.

Wait, I’ll go with: “It didn’t.” (Most terse).

But “Why did that take 30s?” implies they think it did.

“It didn’t.” is the direct answer.

Let’s try to be slightly more informative but terse: “That was instant.”

Okay, I will output: “That was instantaneous.”