Build a Local AI Coding Agent via Llama.cpp and VSCode for Agentic Programming
Build a Local AI Coding Agent with Qwen2.5-Coder 14B / Gemma4, llama.cpp & VS Code (Ubuntu Tutorial)
Agentic programming is quickly becoming one of the most exciting areas in AI development. Instead of simple autocomplete, modern coding agents can understand projects, edit multiple files, refactor code, debug applications, and help automate software development workflows.
In this tutorial, we will build a fully local AI coding assistant using:
- Qwen2.5-Coder 14B or Gemma4 E4B
- llama.cpp
- CUDA GPU acceleration
- VS Code
- Kilo Code extension
Everything will run locally on your own GPU without requiring cloud APIs or subscriptions.
What You Will Build
By the end of this tutorial, you will have:
- A local OpenAI-compatible AI server
- Full GPU accelerated inference using llama.cpp
- Qwen2.5-Coder 14B or Gemma4 E4B running locally
- VS Code connected to your local coding model
- An agentic AI coding workflow
This setup is ideal for:
- Software developers
- AI enthusiasts
- Privacy-focused workflows
- Offline AI development
- Building coding assistants
System Requirements
For this tutorial I used:
- Ubuntu Linux
- NVIDIA RTX 5070 12GB GPU
- CUDA Toolkit installed
- Python 3
- VS Code
Recommended GPU VRAM:
| Model | Recommended VRAM |
|---|---|
| Qwen2.5-Coder 14B Q4_K_M | 10GB–12GB |
| Qwen2.5-Coder 14B Q5_K_M | 12GB+ |
Step 1 — Install CUDA Drivers
First ensure your NVIDIA drivers and CUDA toolkit are properly installed.
Verify using:
nvidia-smi
And:
nvcc --version
If your GPU appears correctly, you are ready to continue.
Step 2 — Clone and Build llama.cpp
Clone the official llama.cpp repository:
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Build with CUDA acceleration enabled:
cmake -B build \
-DGGML_CUDA=ON
cmake --build build -j
After compilation finishes, test it:
./build/bin/llama-cli --help
If you see the help output, llama.cpp is installed correctly.
Step 3 — Install Hugging Face CLI
We will use the Hugging Face CLI to download the GGUF model.
Install it:
pip install -U "huggingface_hub[cli]"
Login to your Hugging Face account:
hf auth login
Step 4 — Download Qwen2.5-Coder 14B GGUF
We will use the official GGUF repository from Qwen.
Download the Q4_K_M quantized model:
hf download \
Qwen/Qwen2.5-Coder-14B-Instruct-GGUF \
qwen2.5-coder-14b-instruct-q4_k_m.gguf \
--local-dir ~/AI/models
This quantization provides an excellent balance between:
- Performance
- VRAM usage
- Coding quality
- Speed
Step 5 — Run the Model with Full GPU Offloading
Now launch the model using llama-server.
Recommended command:
~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/qwen2.5-coder-14b-instruct-q4_k_m.gguf \
-ngl 999 \
-c 8192 \
--host 0.0.0.0 \
--port 8080
Explanation:
| Parameter | Meaning |
|---|---|
-ngl 999 |
Full GPU offloading |
-c 8192 |
8K context length |
--flash-attn |
Enables Flash Attention |
--host 0.0.0.0 |
Allows LAN access |
--port 8080 |
Runs server on port 8080 |
Why 16K Context Crashed on a 12GB GPU
Initially I tried running:
-c 16384
However, the server crashed with a CUDA out-of-memory error.
The reason is that larger context lengths require significantly more KV cache VRAM.
Approximate memory usage:
| Context Size | Extra VRAM Usage |
|---|---|
| 4K | Low |
| 8K | Moderate |
| 16K | Very High |
The 14B model weights already consume most of the GPU memory, and the 16K KV cache pushed total VRAM usage beyond the available limit.
Reducing context to 8192 solved the issue while still providing good coding performance.
Step 6 — Open the Local Server
Once running successfully, open:
http://localhost:8080
The OpenAI-compatible API endpoint is:
http://localhost:8080/v1
This endpoint can be connected to AI coding extensions and applications.
Step 7 — Install Visual Studio Code
Download and install VS Code.
After installation, open the Extensions marketplace.
Step 8 — Install the Kilo Code Extension
Search for:
Kilo Code
Install the extension.
Kilo Code allows VS Code to connect to local AI models and enables agentic coding workflows.
Step 9 — Configure Kilo Code
Inside the extension settings configure:
Provider
OpenAI Compatible
Base URL
http://localhost:8080/v1
Model Name
qwen2.5-coder
API Key
anything
The API key is ignored locally by llama.cpp.
What is Agentic Programming?
Traditional autocomplete simply predicts the next few tokens.
Agentic programming is much more advanced.
Modern AI coding agents can:
- Understand project structure
- Edit multiple files
- Plan implementation steps
- Refactor applications
- Debug errors
- Generate components
- Explain codebases
- Maintain context over long workflows
This transforms VS Code into a collaborative AI development environment.
Performance of Qwen2.5-Coder 14B
Qwen2.5-Coder performs extremely well for:
- Python
- JavaScript
- TypeScript
- React
- APIs
- Refactoring
- Multi-file reasoning
- AI coding agents
Compared to smaller local models, the 14B variant provides:
- Better reasoning
- Better code quality
- More accurate refactoring
- Stronger instruction following
For agentic workflows, this makes a noticeable difference.
Example Agentic Workflow
Some practical examples:
- Generate a modern React landing page
- Refactor an existing Django project
- Create API routes automatically
- Explain complex codebases
- Fix frontend bugs
- Generate Tailwind layouts
- Build automation scripts
Because everything runs locally, responses are fast and private.
Final Thoughts
Running powerful coding models locally is becoming easier than ever.
With:
- Qwen2.5-Coder 14B
- llama.cpp
- CUDA acceleration
- VS Code integrations
You can build a professional AI coding assistant entirely on your own hardware.
This setup delivers:
- Privacy
- Offline capability
- No subscription fees
- Full control
- Excellent coding performance
As local AI models continue improving, agentic programming workflows will become increasingly powerful for developers.
Useful Commands Summary
Launch Server
~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/qwen2.5-coder-14b-instruct-q4_k_m.gguf \
-ngl 999 \
-c 8192 \
--flash-attn \
--host 0.0.0.0 \
--port 8080
Verify GPU
nvidia-smi
Verify CUDA
nvcc --version
Download Model
hf download \
Qwen/Qwen2.5-Coder-14B-Instruct-GGUF \
qwen2.5-coder-14b-instruct-q4_k_m.gguf \
--local-dir ~/AI/models
Conclusion
Local AI development is entering an exciting era.
If you have a modern NVIDIA GPU, you can now run advanced coding agents directly on your desktop and integrate them seamlessly into your development workflow.
Gemma4 and Qwen2.5-Coder combined with llama.cpp and VS Code creates an impressive fully local AI programming environment suitable for both hobbyists and professional developers.
Happy coding.
Popular Tags