Build a Local AI Coding Agent via Llama.cpp and VSCode for Agentic Programming

Build a Local AI Coding Agent with Qwen2.5-Coder 14B / Gemma4, llama.cpp & VS Code (Ubuntu Tutorial)

Agentic programming is quickly becoming one of the most exciting areas in AI development. Instead of simple autocomplete, modern coding agents can understand projects, edit multiple files, refactor code, debug applications, and help automate software development workflows.

In this tutorial, we will build a fully local AI coding assistant using:

Qwen2.5-Coder 14B or Gemma4 E4B
llama.cpp
CUDA GPU acceleration
VS Code
Kilo Code extension

Everything will run locally on your own GPU without requiring cloud APIs or subscriptions.

What You Will Build

By the end of this tutorial, you will have:

A local OpenAI-compatible AI server
Full GPU accelerated inference using llama.cpp
Qwen2.5-Coder 14B or Gemma4 E4B running locally
VS Code connected to your local coding model
An agentic AI coding workflow

This setup is ideal for:

Software developers
AI enthusiasts
Privacy-focused workflows
Offline AI development
Building coding assistants

System Requirements

For this tutorial I used:

Ubuntu Linux
NVIDIA RTX 5070 12GB GPU
CUDA Toolkit installed
Python 3
VS Code

Recommended GPU VRAM:

Model	Recommended VRAM
Qwen2.5-Coder 14B Q4_K_M	10GB–12GB
Qwen2.5-Coder 14B Q5_K_M	12GB+

Step 1 — Install CUDA Drivers

First ensure your NVIDIA drivers and CUDA toolkit are properly installed.

Verify using:

nvidia-smi

And:

nvcc --version

If your GPU appears correctly, you are ready to continue.

Step 2 — Clone and Build llama.cpp

Clone the official llama.cpp repository:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Build with CUDA acceleration enabled:

cmake -B build \
-DGGML_CUDA=ON

cmake --build build -j

After compilation finishes, test it:

./build/bin/llama-cli --help

If you see the help output, llama.cpp is installed correctly.

Step 3 — Install Hugging Face CLI

We will use the Hugging Face CLI to download the GGUF model.

Install it:

pip install -U "huggingface_hub[cli]"

hf auth login

Step 4 — Download Qwen2.5-Coder 14B GGUF

We will use the official GGUF repository from Qwen.

Download the Q4_K_M quantized model:

hf download \
Qwen/Qwen2.5-Coder-14B-Instruct-GGUF \
qwen2.5-coder-14b-instruct-q4_k_m.gguf \
--local-dir ~/AI/models

This quantization provides an excellent balance between:

Performance
VRAM usage
Coding quality
Speed

Step 5 — Run the Model with Full GPU Offloading

Now launch the model using llama-server.

Recommended command:

~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/qwen2.5-coder-14b-instruct-q4_k_m.gguf \
-ngl 999 \
-c 8192 \
--host 0.0.0.0 \
--port 8080

Explanation:

Parameter	Meaning
`-ngl 999`	Full GPU offloading
`-c 8192`	8K context length
`--flash-attn`	Enables Flash Attention
`--host 0.0.0.0`	Allows LAN access
`--port 8080`	Runs server on port 8080

Why 16K Context Crashed on a 12GB GPU

Initially I tried running:

-c 16384

However, the server crashed with a CUDA out-of-memory error.

The reason is that larger context lengths require significantly more KV cache VRAM.

Approximate memory usage:

Context Size	Extra VRAM Usage
4K	Low
8K	Moderate
16K	Very High

The 14B model weights already consume most of the GPU memory, and the 16K KV cache pushed total VRAM usage beyond the available limit.

Reducing context to 8192 solved the issue while still providing good coding performance.

Step 6 — Open the Local Server

Once running successfully, open:

http://localhost:8080

The OpenAI-compatible API endpoint is:

http://localhost:8080/v1

This endpoint can be connected to AI coding extensions and applications.

Step 7 — Install Visual Studio Code

Download and install VS Code.

After installation, open the Extensions marketplace.

Step 8 — Install the Kilo Code Extension

Search for:

Kilo Code

Install the extension.

Kilo Code allows VS Code to connect to local AI models and enables agentic coding workflows.

Step 9 — Configure Kilo Code

Inside the extension settings configure:

Provider

OpenAI Compatible

Base URL

http://localhost:8080/v1

Model Name

qwen2.5-coder

API Key

anything

The API key is ignored locally by llama.cpp.

What is Agentic Programming?

Traditional autocomplete simply predicts the next few tokens.

Agentic programming is much more advanced.

Modern AI coding agents can:

Understand project structure
Edit multiple files
Plan implementation steps
Refactor applications
Debug errors
Generate components
Explain codebases
Maintain context over long workflows

This transforms VS Code into a collaborative AI development environment.

Performance of Qwen2.5-Coder 14B

Qwen2.5-Coder performs extremely well for:

Python
JavaScript
TypeScript
React
APIs
Refactoring
Multi-file reasoning
AI coding agents

Compared to smaller local models, the 14B variant provides:

Better reasoning
Better code quality
More accurate refactoring
Stronger instruction following

For agentic workflows, this makes a noticeable difference.

Example Agentic Workflow

Some practical examples:

Generate a modern React landing page
Refactor an existing Django project
Create API routes automatically
Explain complex codebases
Fix frontend bugs
Generate Tailwind layouts
Build automation scripts

Because everything runs locally, responses are fast and private.

Final Thoughts

Running powerful coding models locally is becoming easier than ever.

With:

Qwen2.5-Coder 14B
llama.cpp
CUDA acceleration
VS Code integrations

You can build a professional AI coding assistant entirely on your own hardware.

This setup delivers:

Privacy
Offline capability
No subscription fees
Full control
Excellent coding performance

As local AI models continue improving, agentic programming workflows will become increasingly powerful for developers.

Useful Commands Summary

Launch Server

~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/qwen2.5-coder-14b-instruct-q4_k_m.gguf \
-ngl 999 \
-c 8192 \
--flash-attn \
--host 0.0.0.0 \
--port 8080

Verify GPU

nvidia-smi

Verify CUDA

nvcc --version

Download Model

hf download \
Qwen/Qwen2.5-Coder-14B-Instruct-GGUF \
qwen2.5-coder-14b-instruct-q4_k_m.gguf \
--local-dir ~/AI/models

Conclusion

Local AI development is entering an exciting era.

If you have a modern NVIDIA GPU, you can now run advanced coding agents directly on your desktop and integrate them seamlessly into your development workflow.

Gemma4 and Qwen2.5-Coder combined with llama.cpp and VS Code creates an impressive fully local AI programming environment suitable for both hobbyists and professional developers.

Happy coding.