My First Real Encounter With VRAM Limits

Getting a local model running on the RTX 2060 felt like a breakthrough. Then I started asking the card to do more, and VRAM made it clear who was actually in charge here.

After I got my first local LLM running, I made the completely reasonable mistake of assuming the hard part was over. The model loaded. Tokens appeared. The machine did not burst into flames. No smoke came out of the case.

Naturally, I interpreted this as permission to go bigger and better.

I started trying more. Different models. Coding models. Chat models. Bigger models. Bigger is always better, right? Well... turns out that wasn't exactly right. These newer and bigger models started choking. Hard.

At first, I thought the whole problem came down to model size.

A 3B model should be easy. A 7B model should be a little harder. A 20B model should be where the machine looks me directly in the eye and asks why I am like this. But it should at least work, in theory, right? Remember quantization? That should have fixed everything!

But it didn't. Not at all.

And that, kids, is how I met ~~your mother~~ the context window.

The Working Pile

In order to make life easier on myself, instead of typing launch commands in the console every time I wanted to start up a different model or configuration (of which I had accumulated many), I made batch scripts to save the configuration and run for me.

These scripts are simple. They are not polished documentation, and they're not amazing feats of architecture. They are just little snapshots of me throwing models at an RTX 2060 and seeing what survived.

One that worked was Qwen2.5-Coder-7B, a dense model with 7.61 billion parameters and 28 layers, that I downloaded as a Q4_K_M quantization:


docker run --rm -it --gpus all ^
  -p 8082:8080 ^
  -v F:\LocalAI\models\gguf:/models ^
  ghcr.io/ggml-org/llama.cpp:server-cuda ^
  -m /models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf ^
  --host 0.0.0.0 ^
  --port 8080 ^
  -ngl 99 ^
  -c 4096

Note that I mapped the normal port 8080 to 8082 because I was already running a separate service on 8080.

That bad boy was getting somewhere between 15-18 tokens per second. That's not blazing fast, but it is certainly workable! But... something was off. After one or two questions, it would start hallucinating like crazy, and it would get itself into a loop where it would just print the same word over and over and over until I killed the process. But why?

Context.

Well, probably. There's a lot of things that could be causing that. But, in this case that was a driving factor. That -c 4096 flag was setting the context size to 4096 tokens. Think of the context as the short-term memory of the conversation you're having with it. Longer context means it remembers more of what you talked about already. That's why sometimes in a long chat, you'll find yourself telling the model things you already told it before. That tidbit had rolled out of the context window.

It turns out, especially with a coding model, 4096 tokens is not enough context for the way I wanted to use it. It can work fine for short conversations on a chat model, but not here. No big deal though, right? Let's just bump that up. What could go wrong?


docker run --rm -it --gpus all ^
  -p 8082:8080 ^
  -v F:\LocalAI\models\gguf:/models ^
  ghcr.io/ggml-org/llama.cpp:server-cuda ^
  -m /models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf ^
  --host 0.0.0.0 ^
  --port 8080 ^
  -ngl 99 ^
  -c 32768

Well... That didn't work. Tokens per second dropped to 1. Sometimes 2.

Oh.

The Context Window Ambush

The logs from that phase tell the story pretty clearly. The RTX 2060 has 6GB of VRAM total, and when I tried to push Qwen2.5-Coder-7B with the full 32k context window, it needed more GPU memory than the card could offer. That forced part of the workload out of VRAM and into system memory.

The model weights aren't the whole story. The model also needs room for the conversation itself. The longer the context window, the more working memory the model needs while it generates. Llama.cpp even suggested cutting the context back down to 4096 tokens, which would save roughly 1.6GB of VRAM.

But what was the big deal? I have 64GB of system RAM, wasn't I fine?

Well, yes, but also no.

Yes, it successfully offloaded everything it couldn't fit onto the system. But, because system RAM is nowhere near as fast as VRAM, and the overhead involved with reaching out of the GPU across to system RAM, waiting for it to get the data, and then send it back, is a massive bottleneck. So, does it work? Absolutely. Is it painfully slow? You bet your ass it is.

The 2060 Was Doing Its Best

I do not want to sound unfair to the RTX 2060. That card did more than I expected.

It ran several of the 3B local language models just fine. It handled quantized 7B experiments, and it did actually maintain a usable token generation rate on some of the later mixture of experts models I tried out. It let me test infrastructure, launch servers, compare behavior, and learn the shape of the problem without renting someone else’s machine. That was useful and important.

But the 2060 also taught me where enthusiasm stops and resource budgets begin.

Larger models pushed the card hard. Longer context windows made everything heavier. System RAM spillover felt awful. Load times stretched. Token generation slowed down. The machine developed the emotional energy of an old dot matrix printer asked to participate in a blockchain startup.

It could usually make things run.

It could not always make them run well.

The Setup Becomes a Tradeoff Machine

The important lesson is not “small GPUs are bad.” That would be too easy, and also wrong.

The real lesson is that VRAM is a budget. Every decision spends from it. Model size spends from it. Context length spends from it. Moving part of the workload back to the CPU buys some back. So does quantization.

But then there's the rest of the machine. Your operating system uses some of it. Windows demands a lot. So does Chrome, and VS Code, and Discord, and on and on and on.

You have to stop asking, “What is the biggest model I can run?” and instead start asking, “What work am I actually trying to do?”

After this phase, every launch script started looking like a set of compromises.

The model mattered. The quant mattered. The context window mattered. The number of GPU layers mattered. The machine’s remaining memory mattered. The patience of the person sitting in front of the keyboard definitely mattered.

I had been saying things like Q4_K_M without fully appreciating what those characters were doing for me. They were not just part of the filename. They were the difference between a useful experiment and a very warm box full of regret. Turns out, I needed to learn more about how that part actually worked.

Current Thoughts

I'm not going to tell you that the RTX 2060 is secretly some sleeper card that's powerful enough to do everything. It does not have hidden reserves of magical compute waiting under the fan shroud. But, it is a solid little card! Don't write off your 2060s just because they are older and have very little VRAM to work with.

The models and tooling have improved to the point where it can absolutely run a little local model for small tasks. Later on, I even got it working with ComfyUI doing image and simple video generation. It wasn't fast, but it did it.

Local AI is within reach. Even with hardware as humble as the NVIDIA RTX 2060.