Running AI Models Locally via Llama.cpp in fresh Ubuntu

Running AI Models via llama.cpp in Fresh Ubuntu (CUDA 13.1 + RTX 5070 Setup)

Artificial Intelligence is becoming more accessible than ever, and with powerful GPUs like the RTX 5070, you can now run advanced AI models completely locally on your Ubuntu machine without relying on cloud services.

In this guide, I’ll show you how I transformed a fresh Ubuntu installation into a full local AI workstation using:

  • NVIDIA CUDA 13.1
  • llama.cpp
  • RTX 5070 GPU acceleration
  • Hugging Face GGUF models
  • Gemma 4 Vision support
  • Localhost AI chat server

By the end of this tutorial, you’ll be able to run multimodal AI models directly on your Linux machine with full GPU offloading.


Why Run AI Models Locally?

Running models locally gives several advantages:

  • Complete privacy
  • No API costs
  • Lower latency
  • Offline usage
  • Full control over models
  • Open-source ecosystem

With modern quantized GGUF models, even consumer GPUs can deliver impressive AI performance.


System Used in This Tutorial

Hardware

  • NVIDIA RTX 5070
  • 12GB VRAM
  • SSD storage

Operating System

  • Ubuntu 26.04 LTS

Installing NVIDIA Drivers on Ubuntu

First, detect available NVIDIA drivers:

ubuntu-drivers devices

Ubuntu recommended:

nvidia-driver-595-open

Install it:

sudo apt install nvidia-driver-595-open

Reboot after installation.


Verifying GPU Installation

Check that the GPU is detected properly:

nvidia-smi

If everything is working correctly, you should see your graphics card like RTX 5070 listed.


Installing CUDA 13.1

Ubuntu repositories may not contain the latest CUDA packages, so install CUDA directly from NVIDIA.

Download CUDA from:

https://developer.nvidia.com/cuda-downloads

After installation, verify CUDA:

nvcc --version

Fixing CUDA + GCC Compatibility Issues

While building llama.cpp, Ubuntu 26.04 introduced a compatibility issue involving:

rsqrt
rsqrtf

inside CUDA math headers.

The fix involved:

  • installing GCC 14
  • explicitly setting CUDA host compiler
  • patching CUDA headers with noexcept(true)

This resolved the build errors successfully.


Building llama.cpp with CUDA Support

Clone the repository:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Configure build:

cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=/usr/bin/gcc-14 \
-DCMAKE_CXX_COMPILER=/usr/bin/g++-14 \
-DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-14 \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-DCMAKE_CUDA_FLAGS="--compiler-bindir=/usr/bin/g++-14"

Compile:

cmake --build build -j$(nproc)

After successful compilation, binaries like:

llama-cli
llama-server

become available.


Organizing AI Models

I created a separate folder for models:

mkdir -p ~/AI/models

Keeping models separate from the source code helps maintain a cleaner setup.


Installing Hugging Face CLI

Using pipx is the cleanest method:

sudo apt install pipx
pipx ensurepath
source ~/.bashrc

Install Hugging Face CLI:

pipx install huggingface_hub

Downloading Gemma 4 GGUF Model

I used Gemma 4 E4B GGUF from Hugging Face.

Download command:

hf download \
lmstudio-community/gemma-4-E4B-it-GGUF \
gemma-4-E4B-it-Q4_K_M.gguf \
--local-dir ~/AI/models

Running Gemma 4 in Terminal Mode

Launch with full GPU offload:

~/llama.cpp/build/bin/llama-cli \
-m ~/AI/models/gemma-4-E4B-it-Q4_K_M.gguf \
-ngl 999 \
-c 16384

Explanation:

  • -ngl 999 → full GPU offload
  • -c 16384 → 16K context size

Running AI Models on Localhost

One of the best features of llama.cpp is the built-in local server.

Start the server:

~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/gemma-4-E4B-it-Q4_K_M.gguf \
-ngl 999 \
-c 16384 \
--host 0.0.0.0 \
--port 8080

Open browser:

http://localhost:8080

Now you have a complete local AI chat interface running entirely on your machine.


Adding Vision Support (Multimodal AI)

To enable image understanding, download the projection model:

hf download \
lmstudio-community/gemma-4-E4B-it-GGUF \
mmproj-gemma-4-E4B-it-BF16.gguf \
--local-dir ~/AI/models

Run server with vision support:

~/llama.cpp/build/bin/llama-server \
-m ~/AI/models/gemma-4-E4B-it-Q4_K_M.gguf \
--mmproj ~/AI/models/mmproj-gemma-4-E4B-it-BF16.gguf \
-ngl 999 \
-c 16384 \
--host 0.0.0.0 \
--port 8080

You can now upload images directly into the localhost chat UI.


GPU Monitoring

Monitor GPU usage live:

watch -n 1 nvidia-smi

You should see:

  • VRAM allocation
  • CUDA utilization
  • active GPU inference

Performance on RTX 5070

The RTX 5070 handled Gemma 4 E4B extremely well.

Highlights:

  • smooth inference
  • full GPU offloading
  • responsive localhost chat
  • stable 16K context
  • efficient VRAM usage

This setup is powerful enough for:

  • coding assistants
  • local AI chat
  • vision analysis
  • multimodal workflows
  • offline AI tools

Final Thoughts

Local AI on Linux has become incredibly powerful.

With:

  • Ubuntu
  • CUDA
  • llama.cpp
  • GGUF models
  • RTX GPUs

you can now create your own private AI workstation completely independent of cloud providers.

The combination of:

  • full GPU acceleration
  • open-source models
  • localhost APIs
  • multimodal support

makes Linux one of the best platforms for AI enthusiasts and developers.


Watch the Full Video Tutorial

If you want the complete step-by-step walkthrough, including:

  • CUDA installation
  • fixing compiler issues
  • building llama.cpp
  • enabling Gemma Vision
  • localhost setup

watch the full video on my YouTube channel.

https://youtu.be/aaOLfjlvIPU