Running AI Models Locally via Llama.cpp in fresh Ubuntu

Running AI Models via llama.cpp in Fresh Ubuntu (CUDA 13.1 + RTX 5070 Setup)

Artificial Intelligence is becoming more accessible than ever, and with powerful GPUs like the RTX 5070, you can now run advanced AI models completely locally on your Ubuntu machine without relying on cloud services.

In this guide, I’ll show you how I transformed a fresh Ubuntu installation into a full local AI workstation using:

NVIDIA CUDA 13.1
llama.cpp
RTX 5070 GPU acceleration
Hugging Face GGUF models
Gemma 4 Vision support
Localhost AI chat server

By the end of this tutorial, you’ll be able to run multimodal AI models directly on your Linux machine with full GPU offloading.

Why Run AI Models Locally?

Running models locally gives several advantages:

Complete privacy
No API costs
Lower latency
Offline usage
Full control over models
Open-source ecosystem

With modern quantized GGUF models, even consumer GPUs can deliver impressive AI performance.

System Used in This Tutorial

Hardware

NVIDIA RTX 5070
12GB VRAM
SSD storage

Operating System

Ubuntu 26.04 LTS

Installing NVIDIA Drivers on Ubuntu

First, detect available NVIDIA drivers:

ubuntu-drivers devices

Ubuntu recommended:

nvidia-driver-595-open

Install it:

sudo apt install nvidia-driver-595-open

Reboot after installation.

Verifying GPU Installation

Check that the GPU is detected properly:

nvidia-smi

If everything is working correctly, you should see your graphics card like RTX 5070 listed.

Installing CUDA 13.1

Ubuntu repositories may not contain the latest CUDA packages, so install CUDA directly from NVIDIA.

Download CUDA from:

https://developer.nvidia.com/cuda-downloads

After installation, verify CUDA:

nvcc --version

Fixing CUDA + GCC Compatibility Issues

While building llama.cpp, Ubuntu 26.04 introduced a compatibility issue involving:

rsqrt
rsqrtf

inside CUDA math headers.

The fix involved:

installing GCC 14
explicitly setting CUDA host compiler
patching CUDA headers with noexcept(true)

This resolved the build errors successfully.

Building llama.cpp with CUDA Support

Clone the repository:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Configure build:

cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-14 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-14 \
-DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-14 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-DCMAKE_CUDA_FLAGS="--compiler-bindir=/usr/bin/g++-14"

Compile:

cmake --build build -j$(nproc)

After successful compilation, binaries like:

llama-cli
llama-server

become available.

Organizing AI Models

I created a separate folder for models:

mkdir -p ~/AI/models

Keeping models separate from the source code helps maintain a cleaner setup.

Installing Hugging Face CLI

Using pipx is the cleanest method:

sudo apt install pipx
pipx ensurepath
source ~/.bashrc

Install Hugging Face CLI:

pipx install huggingface_hub

Downloading Gemma 4 GGUF Model

I used Gemma 4 E4B GGUF from Hugging Face.

Download command:

hf download \
lmstudio-community/gemma-4-E4B-it-GGUF \
gemma-4-E4B-it-Q4_K_M.gguf \
--local-dir ~/AI/models

Running Gemma 4 in Terminal Mode

Launch with full GPU offload:

~/llama.cpp/build/bin/llama-cli \
-m ~/AI/models/gemma-4-E4B-it-Q4_K_M.gguf \
-ngl 999 \
-c 16384

Explanation:

-ngl 999 → full GPU offload
-c 16384 → 16K context size

Running AI Models on Localhost

One of the best features of llama.cpp is the built-in local server.

Start the server:

~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/gemma-4-E4B-it-Q4_K_M.gguf \
-ngl 999 \
-c 16384 \
--host 0.0.0.0 \
--port 8080

Open browser:

http://localhost:8080

Now you have a complete local AI chat interface running entirely on your machine.

Adding Vision Support (Multimodal AI)

To enable image understanding, download the projection model:

hf download \
lmstudio-community/gemma-4-E4B-it-GGUF \
mmproj-gemma-4-E4B-it-BF16.gguf \
--local-dir ~/AI/models

Run server with vision support:

~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/gemma-4-E4B-it-Q4_K_M.gguf \
--mmproj ~/AI/models/mmproj-gemma-4-E4B-it-BF16.gguf \
-ngl 999 \
-c 16384 \
--host 0.0.0.0 \
--port 8080

You can now upload images directly into the localhost chat UI.

GPU Monitoring

Monitor GPU usage live:

watch -n 1 nvidia-smi

You should see:

VRAM allocation
CUDA utilization
active GPU inference

Performance on RTX 5070

The RTX 5070 handled Gemma 4 E4B extremely well.

Highlights:

smooth inference
full GPU offloading
responsive localhost chat
stable 16K context
efficient VRAM usage

This setup is powerful enough for:

coding assistants
local AI chat
vision analysis
multimodal workflows
offline AI tools

Final Thoughts

Local AI on Linux has become incredibly powerful.

With:

Ubuntu
CUDA
llama.cpp
GGUF models
RTX GPUs

you can now create your own private AI workstation completely independent of cloud providers.

The combination of:

full GPU acceleration
open-source models
localhost APIs
multimodal support

makes Linux one of the best platforms for AI enthusiasts and developers.

Watch the Full Video Tutorial

If you want the complete step-by-step walkthrough, including:

CUDA installation
fixing compiler issues
building llama.cpp
enabling Gemma Vision
localhost setup

watch the full video on my YouTube channel.

https://youtu.be/aaOLfjlvIPU