Build a Local AI Coding Agent via Llama.cpp and VSCode for Agentic Programming

Build a Local AI Coding Agent with Qwen2.5-Coder 14B / Gemma4, llama.cpp & VS Code (Ubuntu Tutorial)

Agentic programming is quickly becoming one of the most exciting areas in AI development. Instead of simple autocomplete, modern coding agents can understand projects, edit multiple files, refactor code, debug applications, and help automate software development workflows.

In this tutorial, we will build a fully local AI coding assistant using:

  • Qwen2.5-Coder 14B or Gemma4 E4B
  • llama.cpp
  • CUDA GPU acceleration
  • VS Code
  • Kilo Code extension

Everything will run locally on your own GPU without requiring cloud APIs or subscriptions.


What You Will Build

By the end of this tutorial, you will have:

  • A local OpenAI-compatible AI server
  • Full GPU accelerated inference using llama.cpp
  • Qwen2.5-Coder 14B or Gemma4 E4B running locally
  • VS Code connected to your local coding model
  • An agentic AI coding workflow

This setup is ideal for:

  • Software developers
  • AI enthusiasts
  • Privacy-focused workflows
  • Offline AI development
  • Building coding assistants

System Requirements

For this tutorial I used:

  • Ubuntu Linux
  • NVIDIA RTX 5070 12GB GPU
  • CUDA Toolkit installed
  • Python 3
  • VS Code

Recommended GPU VRAM:

Model Recommended VRAM
Qwen2.5-Coder 14B Q4_K_M 10GB–12GB
Qwen2.5-Coder 14B Q5_K_M 12GB+

Step 1 — Install CUDA Drivers

First ensure your NVIDIA drivers and CUDA toolkit are properly installed.

Verify using:

nvidia-smi

And:

nvcc --version

If your GPU appears correctly, you are ready to continue.


Step 2 — Clone and Build llama.cpp

Clone the official llama.cpp repository:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

Build with CUDA acceleration enabled:

cmake -B build \
-DGGML_CUDA=ON

cmake --build build -j

After compilation finishes, test it:

./build/bin/llama-cli --help

If you see the help output, llama.cpp is installed correctly.


Step 3 — Install Hugging Face CLI

We will use the Hugging Face CLI to download the GGUF model.

Install it:

pip install -U "huggingface_hub[cli]"

Login to your Hugging Face account:

hf auth login

Step 4 — Download Qwen2.5-Coder 14B GGUF

We will use the official GGUF repository from Qwen.

Download the Q4_K_M quantized model:

hf download \
Qwen/Qwen2.5-Coder-14B-Instruct-GGUF \
qwen2.5-coder-14b-instruct-q4_k_m.gguf \
--local-dir ~/AI/models

This quantization provides an excellent balance between:

  • Performance
  • VRAM usage
  • Coding quality
  • Speed

Step 5 — Run the Model with Full GPU Offloading

Now launch the model using llama-server.

Recommended command:

~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/qwen2.5-coder-14b-instruct-q4_k_m.gguf \
-ngl 999 \
-c 8192 \
--host 0.0.0.0 \
--port 8080

Explanation:

Parameter Meaning
-ngl 999 Full GPU offloading
-c 8192 8K context length
--flash-attn Enables Flash Attention
--host 0.0.0.0 Allows LAN access
--port 8080 Runs server on port 8080

Why 16K Context Crashed on a 12GB GPU

Initially I tried running:

-c 16384

However, the server crashed with a CUDA out-of-memory error.

The reason is that larger context lengths require significantly more KV cache VRAM.

Approximate memory usage:

Context Size Extra VRAM Usage
4K Low
8K Moderate
16K Very High

The 14B model weights already consume most of the GPU memory, and the 16K KV cache pushed total VRAM usage beyond the available limit.

Reducing context to 8192 solved the issue while still providing good coding performance.


Step 6 — Open the Local Server

Once running successfully, open:

http://localhost:8080

The OpenAI-compatible API endpoint is:

http://localhost:8080/v1

This endpoint can be connected to AI coding extensions and applications.


Step 7 — Install Visual Studio Code

Download and install VS Code.

After installation, open the Extensions marketplace.


Step 8 — Install the Kilo Code Extension

Search for:

Kilo Code

Install the extension.

Kilo Code allows VS Code to connect to local AI models and enables agentic coding workflows.


Step 9 — Configure Kilo Code

Inside the extension settings configure:

Provider

OpenAI Compatible

Base URL

http://localhost:8080/v1

Model Name

qwen2.5-coder

API Key

anything

The API key is ignored locally by llama.cpp.


What is Agentic Programming?

Traditional autocomplete simply predicts the next few tokens.

Agentic programming is much more advanced.

Modern AI coding agents can:

  • Understand project structure
  • Edit multiple files
  • Plan implementation steps
  • Refactor applications
  • Debug errors
  • Generate components
  • Explain codebases
  • Maintain context over long workflows

This transforms VS Code into a collaborative AI development environment.


Performance of Qwen2.5-Coder 14B

Qwen2.5-Coder performs extremely well for:

  • Python
  • JavaScript
  • TypeScript
  • React
  • APIs
  • Refactoring
  • Multi-file reasoning
  • AI coding agents

Compared to smaller local models, the 14B variant provides:

  • Better reasoning
  • Better code quality
  • More accurate refactoring
  • Stronger instruction following

For agentic workflows, this makes a noticeable difference.


Example Agentic Workflow

Some practical examples:

  • Generate a modern React landing page
  • Refactor an existing Django project
  • Create API routes automatically
  • Explain complex codebases
  • Fix frontend bugs
  • Generate Tailwind layouts
  • Build automation scripts

Because everything runs locally, responses are fast and private.


Final Thoughts

Running powerful coding models locally is becoming easier than ever.

With:

  • Qwen2.5-Coder 14B
  • llama.cpp
  • CUDA acceleration
  • VS Code integrations

You can build a professional AI coding assistant entirely on your own hardware.

This setup delivers:

  • Privacy
  • Offline capability
  • No subscription fees
  • Full control
  • Excellent coding performance

As local AI models continue improving, agentic programming workflows will become increasingly powerful for developers.


Useful Commands Summary

Launch Server

~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/qwen2.5-coder-14b-instruct-q4_k_m.gguf \
-ngl 999 \
-c 8192 \
--flash-attn \
--host 0.0.0.0 \
--port 8080

Verify GPU

nvidia-smi

Verify CUDA

nvcc --version

Download Model

hf download \
Qwen/Qwen2.5-Coder-14B-Instruct-GGUF \
qwen2.5-coder-14b-instruct-q4_k_m.gguf \
--local-dir ~/AI/models

Conclusion

Local AI development is entering an exciting era.

If you have a modern NVIDIA GPU, you can now run advanced coding agents directly on your desktop and integrate them seamlessly into your development workflow.

Gemma4 and Qwen2.5-Coder combined with llama.cpp and VS Code creates an impressive fully local AI programming environment suitable for both hobbyists and professional developers.

Happy coding.