Getting Started with Local AI: Run LLMs on Your Own Hardware
Step-by-step guide to running AI models locally. Install Ollama, pick the right model for your hardware, and start generating in under 5 minutes.
Getting Started with Local AI: Run LLMs on Your Own Hardware
Running AI on your own hardware used to require a PhD, a rack of servers, and months of setup. Today, you can go from zero to chatting with a large language model in under five minutes — no API keys, no data leaving your machine, no per-token costs.
Whether you care about privacy, want offline access on a plane, or just love tinkering, local AI is more accessible than ever. This guide walks you through everything: checking your hardware, installing a runtime, picking your first model, and actually running it. By the end, you will have a working local AI setup and a clear picture of where to go next.
What You Need
You do not need a supercomputer. What matters most is how much memory your GPU (or Mac) has available. Here is a practical breakdown:
Minimum — 8 GB VRAM GPU or 16 GB RAM Mac
This gets you into the game. You can comfortably run 3B to 7B parameter models, which are genuinely useful for chatting, summarizing, and light coding help. An RTX 3060 (12 GB), an M1 MacBook Pro, or even an older GTX 1080 Ti (11 GB) all fall in this tier.
Recommended — 12–24 GB VRAM
This is the sweet spot for most people. An RTX 3090 (24 GB), RTX 4080 (16 GB), or M2 Pro Mac fits here. You can run 8B to 14B models with excellent quality and reasonable speed, and even push into 30B territory with aggressive quantization.
Ideal — 24 GB+ VRAM or 32 GB+ unified memory
At this level you can run frontier-quality open models — 70B class and above — that genuinely rival cloud APIs. RTX 4090, RTX 3090 Ti, or an M3 Max Mac Studio with 96 GB unified memory live here.
Not sure where your hardware falls? Use our VRAM calculator — it auto-detects your GPU and tells you exactly which models will fit.
Step 1: Check Your Hardware
Before downloading anything, spend 60 seconds understanding your system. The key question: how much VRAM (or unified memory) do you have?
Quick ways to find out:
- Windows: Open Task Manager → Performance → GPU. You will see "Dedicated GPU memory."
- macOS: Apple menu → About This Mac → More Info → System Report → Graphics. Unified memory is shared with the CPU, so the full RAM amount counts.
- Linux: Run
nvidia-smifor NVIDIA cards, orrocm-smifor AMD.
Alternatively, use our hardware browser to find your exact GPU or Mac model and see pre-computed capability profiles. Our VRAM calculator goes further — it auto-detects your GPU via WebGPU and shows you which models fit, which are marginal, and which to skip entirely.
One number to remember: your VRAM in gigabytes. Everything else follows from that.
Step 2: Install a Runtime
A "runtime" is the software that loads an AI model and runs inference. Think of it as the engine — the model is the fuel. There are several good options; here is what each excels at:
Ollama (Recommended for Most Users)
Ollama is the fastest path from zero to running. It handles model downloading, automatic quantization selection, GPU detection, and serves a local API — all behind a single command-line interface.
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download the installer from ollama.com
After installation, Ollama runs as a background service. You interact with it through the ollama CLI or its OpenAI-compatible REST API at http://localhost:11434. It automatically uses your GPU if available and falls back to CPU otherwise.
Why Ollama wins for beginners: you never have to think about model formats, quantization levels, or GPU configuration. It just works.
LM Studio
LM Studio is the best option if you prefer a graphical interface. It provides a drag-and-drop model browser, a built-in chat UI, and one-click downloads from Hugging Face. If the terminal feels intimidating, start here.
Download it from lmstudio.ai. It is free for personal use on macOS, Windows, and Linux.
llama.cpp
The engine that powers most local AI tooling under the hood. Direct llama.cpp gives you maximum control and performance — you manage model files manually and pass command-line flags to tune every parameter. Ideal if you want to squeeze the last bit of speed out of your hardware or run on unusual setups.
# Build from source (requires cmake and a C++ compiler)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON # or GGML_METAL for Apple Silicon
cmake --build build --config Release
vLLM
vLLM is designed for serving models to multiple users simultaneously. It has PagedAttention and continuous batching built in, making it the right choice if you are building an API service rather than chatting interactively. It requires a CUDA GPU and is more complex to set up, but delivers exceptional throughput.
Bottom line for getting started: install Ollama. You can always add other runtimes later.
Step 3: Pick Your First Model
Choosing your first model is where people get stuck — there are thousands to pick from. Here is a practical map.
By Hardware
| Your VRAM | Recommended First Models |
|---|---|
| 8 GB | ollama run llama3.2:3b or ollama run phi4-mini |
| 12 GB | ollama run llama3.1:8b or ollama run qwen3:8b |
| 16 GB | ollama run qwen3:14b or ollama run phi4:14b |
| 24 GB | ollama run qwen3:30b-a3b or ollama run deepseek-r1:32b |
These recommendations lean toward models that punch above their weight at each tier — fast, capable, and well-supported in Ollama's model library.
By Use Case
General chatting and Q&A
Llama 3.1 8B is the workhorse of local AI. Meta's instruction-tuned model is friendly, capable, and runs well on mid-range hardware. Qwen 3 8B is a strong alternative with excellent multilingual support.
ollama run llama3.1:8b
Coding assistance
Qwen 3 Coder 30B A3B is a mixture-of-experts model that fits in 24 GB of VRAM despite its size. Devstral Small 24B from Mistral is another excellent choice, purpose-built for agentic coding tasks.
ollama run qwen3:30b-a3b
Reasoning and problem-solving
DeepSeek R1 comes in multiple sizes and is one of the best reasoning models available locally. The 7B version runs on modest hardware; the 32B version (if your VRAM allows) delivers results that rival much larger models.
ollama run deepseek-r1:7b
Not sure which to pick? Browse all models with filtering by VRAM requirement, use case, and provider. Our fit scores tell you at a glance whether a model will run comfortably, marginally, or not at all on your hardware.
Step 4: Run It
With Ollama installed and a model in mind, you are one command away from a working local AI:
# Download the model and start an interactive chat
ollama run llama3.1:8b
The first time you run this command, Ollama downloads the model. For Llama 3.1 8B in Q4 quantization, that is about 4.7 GB. On a typical broadband connection, expect 3–5 minutes.
Once downloaded, you get an interactive prompt:
>>> Tell me about the Rust programming language
Type your message and press Enter. The model starts generating immediately.
Rust is a systems programming language focused on three goals: safety, speed,
and concurrency. It accomplishes these goals without a garbage collector...
What is happening under the hood: Ollama loads the GGUF model weights into your GPU's VRAM (or system RAM if no GPU is detected), initializes the KV cache, and begins autoregressive token generation. Each token is sampled from the probability distribution produced by a forward pass through the transformer. The speed you see — tokens per second — is determined primarily by your memory bandwidth, not raw compute.
To exit the chat, type /bye or press Ctrl+D.
Other useful Ollama commands:
# List downloaded models
ollama list
# Download a model without starting chat
ollama pull mistral:7b
# Remove a model
ollama rm llama3.1:8b
# Run with a custom system prompt
ollama run llama3.1:8b "You are a concise coding assistant."
# Use the API directly (OpenAI-compatible)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Understanding Model Names
Ollama's model library uses a consistent naming pattern that is worth understanding so you can navigate it confidently.
The number (3B, 8B, 14B, 70B) — This is the parameter count. More parameters generally means higher quality and more knowledge, but also more VRAM required and slower generation. A 70B model is not 10x better than a 7B — the relationship is logarithmic — but the quality difference is noticeable, especially on complex tasks.
Capability suffixes:
- No suffix /
:latest— the default variant, usually a balanced instruct-tuned model instruct— fine-tuned for following instructions and chat; what you want for most taskscode— specialized for programming tasks; often outperforms general models on codevision— supports image inputs alongside text
Quantization tags (more on these below):
:q4_K_M— 4-bit quantization, medium variant; popular default:q8_0— 8-bit quantization; higher quality, more VRAM:fp16— full 16-bit precision; highest quality, roughly 2x VRAM of Q4
So ollama run qwen3:8b-instruct-q4_K_M means: Qwen 3, 8 billion parameters, instruction-tuned, 4-bit quantized. Ollama picks a sensible default when you do not specify a tag.
What About Quantization?
Quantization is how the community fits large models into practical amounts of memory. The core idea: instead of storing each model weight as a 16-bit float (2 bytes), you compress it to 4 bits (0.5 bytes). A model that needed 16 GB now needs 4 GB.
The trade-off is a small quality reduction. For most tasks, 4-bit quantization is indistinguishable from full precision. For highly technical or mathematical reasoning, you may notice the difference — in which case, move up to Q6 or Q8 if your VRAM allows.
Practical guide to picking a quant:
| Quantization | Memory Use | Quality | When to Use |
|---|---|---|---|
| Q4_K_M | ~4.7 GB / 7B | Very good | Default choice; best size-quality tradeoff |
| Q5_K_M | ~5.7 GB / 7B | Better | When you have VRAM to spare |
| Q6_K | ~6.6 GB / 7B | Near lossless | For quality-sensitive tasks |
| Q8_0 | ~8.5 GB / 7B | Excellent | Maximum quality at 8-bit |
Ollama selects an appropriate quantization level automatically based on your available memory when you run ollama pull or ollama run. You do not have to think about this for getting started.
For a deeper dive — including how quantization works mathematically and how to compare GGUF quant types — read our GGUF Quantization Explained guide.
Common Issues and Fixes
"Out of memory" or model fails to load
Your model is too large for your VRAM. Solutions in order of preference:
- Try a smaller quantization: if you are running Q8, switch to Q4_K_M
- Try a smaller model: if 13B is failing, try 7B or 8B
- Enable GPU offloading with CPU fallback (available in llama.cpp directly — Ollama does this automatically)
# Ollama auto-manages this, but for llama.cpp you can specify layers to offload:
./llama-cli -m model.gguf --n-gpu-layers 33
"Generation is extremely slow" (1-2 tokens per second)
Your model is likely running on CPU instead of GPU. Check:
# For NVIDIA, verify GPU is being used:
nvidia-smi
# Ollama shows GPU usage in its logs:
ollama logs
If Ollama is not using your GPU, ensure you have the right drivers installed. For NVIDIA: CUDA 12+. For AMD on Linux: ROCm 6+.
"The model's answers seem low quality or confused"
A few things to try:
- Switch to a higher quantization level (Q6 or Q8 instead of Q4)
- Try a larger parameter count if your VRAM allows
- Check that you are using an instruct-tuned model, not a base model
- Write a clearer system prompt
"Ollama is not found after installation"
On Linux, Ollama installs to /usr/local/bin/ollama. If your shell cannot find it, run:
source ~/.bashrc # or ~/.zshrc
# or
export PATH=$PATH:/usr/local/bin
Privacy: The Real Advantage of Local AI
It is worth pausing on what "local" actually means for your data. When you use a cloud AI service, your prompts travel to a remote server, get processed, and a response comes back. Your conversation may be logged, used for training, or subject to the provider's data retention policies.
With local AI, nothing leaves your machine. The model weights sit on your disk, inference happens on your GPU, and the conversation exists only in RAM. There is no API call, no telemetry, no usage logs. This makes local AI the right choice for:
- Working with confidential documents or proprietary code
- Healthcare or legal use cases with strict data requirements
- Offline environments (flights, remote locations, air-gapped networks)
- Anyone who simply does not want their prompts analyzed
Once your model is downloaded, you can disable your internet connection entirely and everything keeps working.
Next Steps
You now have a working local AI setup. Here is how to go deeper:
Explore the model landscape
Browse all models on WillItRunAI — filtered by your hardware, sorted by fit score. Every model page shows VRAM requirements, generation speed estimates, and direct Ollama commands.
Find the right model for your task
Use our VRAM calculator to explore different workloads — coding, reasoning, RAG, long-context — and see which models perform best at each. The calculator auto-detects your GPU and scores every model against your specific hardware.
Compare models side by side
Not sure whether to run Llama 3.1 or Qwen 3 on your hardware? Our compare tool shows both models side by side: VRAM usage, fit score, benchmark rankings, and capability differences.
Go deeper on hardware
If you are thinking about upgrading your GPU or Mac to unlock more models, our hardware browser shows capability profiles for every major GPU — from consumer cards to data center hardware.
Learn more
- VRAM Requirements for Popular AI Models — a comprehensive reference table
- GGUF Quantization Explained — how quantization works and which variant to pick
Frequently Asked Questions
Can I run AI models without a GPU?
Yes, but it will be slow. CPU-only inference works with llama.cpp and Ollama, but expect 1–5 tokens per second versus 20–100+ tokens per second on a GPU. Apple Silicon Macs are the notable exception: their unified memory architecture lets the GPU and CPU share the same pool, so a 16 GB M2 MacBook Pro delivers genuinely usable speeds without a discrete GPU.
Is Ollama free?
Yes, Ollama is completely free and open source (MIT license). It handles model downloading, quantization selection, and serving automatically. There is no paid tier, no usage limits, and no account required.
How long does it take to download an AI model?
Download time depends on model size and your internet speed. A 7B Q4 model (~4.7 GB) takes 2–5 minutes on a typical broadband connection. A 70B model (~40 GB) can take 30–60 minutes. Models are cached locally after the first download — subsequent runs start instantly.
What is the easiest way to run AI locally?
Install Ollama, then run ollama run llama3.1:8b. Two steps, and you are chatting. LM Studio is another beginner-friendly option if you prefer a graphical interface over the terminal.
Do I need internet to run local AI?
You need internet only for the initial model download. Once the weights are on your disk, everything runs 100% offline. This is one of the biggest practical advantages of local AI.
Running AI locally is one of those things that feels like magic the first time it works. A model that would cost dollars per hour in API fees, running for free on hardware you already own, with complete privacy. Get Ollama installed, run your first model, and see what is possible — you might not go back to cloud APIs for everything.