Hi HN, happy to see this here! I highly recommend to take a look at the technica...

amrrs · on Jan 23, 2025

For those who don't know, He is the gg of `gguf`. Thank you for all your contributions! Literally the core of Ollama, LMStudio, Jan and multiple other apps!

kennethologist · on Jan 24, 2025

A. Legend. Thanks for having DeepSeek available so quickly in LM Studio.

sergiotapia · on Jan 23, 2025

well hot damn! killing it!

halyconWays · on Jan 23, 2025

[flagged]

kamranjon · on Jan 23, 2025

They collaborate together! Her name is Justine Tunney - she took her “execute everywhere” work with Cosmopolitan to make Llamafile using the llama.cpp work that Giorgi has done.

halyconWays · on Feb 2, 2025

She actually stole that code from a user named slaren and was personally banned by Gerg from the llama.cpp repo for about a year because of it. Also it was just lazy loading the weights, it wasn't actually a 50% reduction.

https://news.ycombinator.com/item?id=35411909

kamranjon · on Feb 8, 2025

That seems like a false narrative, which is strange because you could have just read the explanation from Jart a little further down in the thread:

https://news.ycombinator.com/item?id=35413289

madeforhnyo · on Jan 23, 2025

Someone did? Could you pls share a link?

bangaladore · on Jan 23, 2025

Quick testing on vscode to see if I'd consider replacing Copilot with this. Biggest showstopper right now for me is the output length is substantially small. The default length is set to 256, but even if I up it to 4096, I'm not getting any larger chunks of code.

Is this because of a max latency setting, or the internal prompt, or am I doing something wrong? Or is it only really make to try to autocomplete lines and not blocks like Copilot will.

Thanks :)

ggerganov · on Jan 23, 2025

There are 4 stopping criteria atm:

- Generation time exceeded (configurable in the plugin config)

- Number of tokens exceeded (not the case since you increased it)

- Indentation - stops generating if the next line has shorter indent than the first line

- Small probability of the sampled token

Most likely you are hitting the last criteria. It's something that should be improved in some way, but I am not very sure how. Currently, it is using a very basic token sampling strategy with a custom threshold logic to stop generating when the token probability is too low. Likely this logic is too conservative.

bangaladore · on Jan 23, 2025

Hmm, interesting.

I didn't catch T_max_predict_ms and upped that to 5000ms for fun. Doesn't seem to make a difference, so I'm guessing you are right.

eklavya · on Jan 24, 2025

Thanks for sharing the vscode link. After trying I have disabled the continue.dev extension and ollama. For me this is wayyyyy faster.

jerpint · on Jan 23, 2025

Thank you for all of your incredible contributions!

liuliu · on Jan 23, 2025

KV cache shifting is interesting!

Just curious: how much of your code nowadays completed by LLM?

ggerganov · on Jan 23, 2025

Yes, I think it is surprising that it works.

I think a fairly large amount, though can't give a good number. I have been using Github Copilot from the very early days and with the release of Qwen Coder last year have fully switched to using local completions. I don't use the chat workflow to code though, only FIM.

menaerus · on Jan 24, 2025

Interesting approach.

Am I correct to understand that you're basically minimizing the latencies and required compute/mem-bw by avoiding the KV cache? And encoding the (local) context in the input tokens instead?

I ask this because you set the prompt/context size to 0 (--ctx-size 0) and the batch size to 1024 (-b 1024). Former would mean that llama.cpp will only be using the context that is already encoded in the model itself but no local (code) context besides the one provided in the input tokens but perhaps I misunderstood something.

Thanks for your contributions and obviously the large amount of time you take to document your work!

ggerganov · on Jan 24, 2025

The primary tricks for reducing the latency are around context reuse, meaning that the computed KV cache of tokens from previous requests is reused for new requests and thus computation is saved.

To get high-quality completions, you need to provide a large context of your codebase so that the generated suggestion is more inline with your style and implementation logic. However, naively increasing the context will quickly hit a computation limit because each request would need to compute (a.k.a prefill) a lot of tokens.

The KV cache shifts used here is an approach to reuse the cache of old tokens by "shifting" them in new absolute positions in the new context. This way a request that would normally require a context of lets say 10k tokens, could be processed more quickly by computing just lets say 500 tokens and reusing the cache of the other 9.5k tokens, thus cutting the compute ~10 fold.

The --ctx-size 0 CLI arg simply tells the server to allocate memory buffers for the maximum context size supported by the model. For the case of Qwen Coder models, this corresponds to 32k tokens.

The batch sizes are related to how much local context around your cursor to use, along with the global context from the ring buffer. This is described in more detail in the links, but simply put: decreasing the batch size will make the completion faster, but with less quality.

menaerus · on Jan 24, 2025

Ok, so --ctx-size with a value != 0 means that we can override the default model context size. Since for obvious computation cost reasons we cannot use the 32k fresh context per each request, the trick you make is to use the 1k context (batch that includes local and semi-local code) that you enrich with the previous model responses by keeping them in and feeding them from KV cache? To increase the correlation between the current request and previous responses you do the shifting in KV cache?

ggerganov · on Jan 24, 2025

Yes, exactly. You can set --ctx-size to a smaller value if you know that you will not hit the limit of 32k - this will save you VRAM.

To control how much global context to keep in the ring buffer (i.e. the context that is being reused to enrich the local context), you can adjust the "ring_n_chunks" and "rink_chunk_size". With the default settings, this amounts to about 8k tokens of context on our codebases when the ring buffer is full, which is a conservative setting. Increasing these numbers will make the context bigger, will improve the quality but will affect the performance.

There are a few other tricks to reduce the compute for the local context (i.e. the 1k batch of tokens), so that in practice, a smaller amount is processed. This further saves compute during the prefill.

menaerus · on Jan 24, 2025

Since qwen 2.5 turbo with 1M context size is advertised to be able to crunch ~30k LoC, I guess we can say then that the 32k qwen 2.5 model is capable of ~960 LoC and therefore 32k model with an upper bound of context set to 8k is capable of ~250 LoC?

Not bad.

gloflo · on Jan 23, 2025

What is FIM?

jjnoakes · on Jan 23, 2025

Fill-in-the-middle. If your cursor is in the middle of a file instead of at the end, then the LLM will consider text after the cursor in addition to the text before the cursor. Some LLMs can only look before the cursor; for coding,.ones that can FIM work better (for me at least).

rav · on Jan 23, 2025

FIM is "fill in middle", i.e. completion in a text editor using context on both sides of the cursor.

LoganDark · on Jan 24, 2025

llama.cpp supports FIM?

attentive · on Jan 23, 2025

Is it correct to assume this plugin won't work with ollama?

If so, what's ollama missing?

mistercheph · on Jan 24, 2025

this plugin is designed specifically for the llama.cpp server api, if you want copilot like features with ollama, you can use an ollama instance as a drop-in replacement for github copilot with this plugin: https://github.com/bernardo-bruning/ollama-copilot

There is also https://github.com/olimorris/codecompanion.nvim which doesn't have text completion, but supports a lot of other AI editor workflows that I believe are inspired by Zed and supports ollama out of the box

nancyp · on Jan 23, 2025

TIL: VIM has it's own language. Thanks Georgi for LLAMA.cpp!

nacs · on Jan 23, 2025

Vim is incredibly extensible.

You can use C or VIMscript but programs like Neovim support Lua as well which makes it really easy to make plugins.

halyconWays · on Jan 23, 2025

Please make one for Jetbrains' IDEs!