Running AI Models Locally via Llama.cpp in fresh Ubuntu
Running AI Models via llama.cpp in Fresh Ubuntu (CUDA 13.1 + RTX 5070 Setup)
Artificial Intelligence is becoming more accessible than ever, and with powerful GPUs like the RTX 5070, you can now run advanced AI models completely locally on your Ubuntu machine without relying on cloud services.
In this guide, I’ll show you how I transformed a fresh Ubuntu installation into a full local AI workstation using:
- NVIDIA CUDA 13.1
- llama.cpp
- RTX 5070 GPU acceleration
- Hugging Face GGUF models
- Gemma 4 Vision support
- Localhost AI chat server
By the end of this tutorial, you’ll be able to run multimodal AI models directly on your Linux machine with full GPU offloading.
Why Run AI Models Locally?
Running models locally gives several advantages:
- Complete privacy
- No API costs
- Lower latency
- Offline usage
- Full control over models
- Open-source ecosystem
With modern quantized GGUF models, even consumer GPUs can deliver impressive AI performance.
System Used in This Tutorial
Hardware
- NVIDIA RTX 5070
- 12GB VRAM
- SSD storage
Operating System
- Ubuntu 26.04 LTS
Installing NVIDIA Drivers on Ubuntu
First, detect available NVIDIA drivers:
ubuntu-drivers devices
Ubuntu recommended:
nvidia-driver-595-open
Install it:
sudo apt install nvidia-driver-595-open
Reboot after installation.
Verifying GPU Installation
Check that the GPU is detected properly:
nvidia-smi
If everything is working correctly, you should see your graphics card like RTX 5070 listed.
Installing CUDA 13.1
Ubuntu repositories may not contain the latest CUDA packages, so install CUDA directly from NVIDIA.
Download CUDA from:
https://developer.nvidia.com/cuda-downloads
After installation, verify CUDA:
nvcc --version
Fixing CUDA + GCC Compatibility Issues
While building llama.cpp, Ubuntu 26.04 introduced a compatibility issue involving:
rsqrt
rsqrtf
inside CUDA math headers.
The fix involved:
- installing GCC 14
- explicitly setting CUDA host compiler
- patching CUDA headers with
noexcept(true)
This resolved the build errors successfully.
Building llama.cpp with CUDA Support
Clone the repository:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Configure build:
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-14 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-14 \
-DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-14 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-DCMAKE_CUDA_FLAGS="--compiler-bindir=/usr/bin/g++-14"
Compile:
cmake --build build -j$(nproc)
After successful compilation, binaries like:
llama-cli
llama-server
become available.
Organizing AI Models
I created a separate folder for models:
mkdir -p ~/AI/models
Keeping models separate from the source code helps maintain a cleaner setup.
Installing Hugging Face CLI
Using pipx is the cleanest method:
sudo apt install pipx
pipx ensurepath
source ~/.bashrc
Install Hugging Face CLI:
pipx install huggingface_hub
Downloading Gemma 4 GGUF Model
I used Gemma 4 E4B GGUF from Hugging Face.
Download command:
hf download \
lmstudio-community/gemma-4-E4B-it-GGUF \
gemma-4-E4B-it-Q4_K_M.gguf \
--local-dir ~/AI/models
Running Gemma 4 in Terminal Mode
Launch with full GPU offload:
~/llama.cpp/build/bin/llama-cli \
-m ~/AI/models/gemma-4-E4B-it-Q4_K_M.gguf \
-ngl 999 \
-c 16384
Explanation:
-ngl 999→ full GPU offload-c 16384→ 16K context size
Running AI Models on Localhost
One of the best features of llama.cpp is the built-in local server.
Start the server:
~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/gemma-4-E4B-it-Q4_K_M.gguf \
-ngl 999 \
-c 16384 \
--host 0.0.0.0 \
--port 8080
Open browser:
http://localhost:8080
Now you have a complete local AI chat interface running entirely on your machine.
Adding Vision Support (Multimodal AI)
To enable image understanding, download the projection model:
hf download \
lmstudio-community/gemma-4-E4B-it-GGUF \
mmproj-gemma-4-E4B-it-BF16.gguf \
--local-dir ~/AI/models
Run server with vision support:
~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/gemma-4-E4B-it-Q4_K_M.gguf \
--mmproj ~/AI/models/mmproj-gemma-4-E4B-it-BF16.gguf \
-ngl 999 \
-c 16384 \
--host 0.0.0.0 \
--port 8080
You can now upload images directly into the localhost chat UI.
GPU Monitoring
Monitor GPU usage live:
watch -n 1 nvidia-smi
You should see:
- VRAM allocation
- CUDA utilization
- active GPU inference
Performance on RTX 5070
The RTX 5070 handled Gemma 4 E4B extremely well.
Highlights:
- smooth inference
- full GPU offloading
- responsive localhost chat
- stable 16K context
- efficient VRAM usage
This setup is powerful enough for:
- coding assistants
- local AI chat
- vision analysis
- multimodal workflows
- offline AI tools
Final Thoughts
Local AI on Linux has become incredibly powerful.
With:
- Ubuntu
- CUDA
- llama.cpp
- GGUF models
- RTX GPUs
you can now create your own private AI workstation completely independent of cloud providers.
The combination of:
- full GPU acceleration
- open-source models
- localhost APIs
- multimodal support
makes Linux one of the best platforms for AI enthusiasts and developers.
Watch the Full Video Tutorial
If you want the complete step-by-step walkthrough, including:
- CUDA installation
- fixing compiler issues
- building llama.cpp
- enabling Gemma Vision
- localhost setup
watch the full video on my YouTube channel.
https://youtu.be/aaOLfjlvIPU
Popular Tags