Skip to content
AI/ML Engineering

Self-Hosted ChatGPT: Run Open WebUI with Local LLMs (Complete Guide)

Deploy a private ChatGPT alternative with Open WebUI and Ollama. Complete Docker Compose setup with model selection, RAG document upload, web search, multi-user config, and security hardening.

A
Abhishek Patel11 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

Self-Hosted ChatGPT: Run Open WebUI with Local LLMs (Complete Guide)
Self-Hosted ChatGPT: Run Open WebUI with Local LLMs (Complete Guide)

Why Self-Host a ChatGPT Alternative?

Every message you send to ChatGPT, Claude, or Gemini travels to someone else's server. For personal use, that's a reasonable trade-off. For a team handling proprietary code, medical records, legal documents, or internal strategy -- it's a non-starter. Self-hosting gives you complete data sovereignty: every conversation stays on hardware you control, with zero API costs after the initial setup.

Open WebUI is the leading open-source frontend for local LLMs. It connects to Ollama (or any OpenAI-compatible API), supports multiple users, and ships with features that rival commercial offerings -- RAG document upload, web browsing, image generation, conversation branching, and more. I've been running it for my team for six months, and it has replaced our ChatGPT Team subscription entirely.

What Is Open WebUI?

Definition: Open WebUI is an open-source, self-hosted web interface for interacting with large language models. It connects to model backends like Ollama, llama.cpp, or any OpenAI-compatible API, providing a ChatGPT-like experience with multi-user support, conversation history, document upload (RAG), web search, and administrative controls -- all running on your own infrastructure.

Think of Open WebUI as the frontend and Ollama as the backend. Ollama handles downloading, quantizing, and serving models. Open WebUI provides the chat interface, user management, and power features. Together, they form a complete self-hosted ChatGPT replacement.

Prerequisites and Hardware Requirements

Before you start, here's what you need:

ComponentMinimumRecommendedFor Teams (5-10 users)
CPU4 cores8+ cores (AVX2 support)16+ cores
RAM16GB32GB64GB+
GPU (optional)None (CPU-only works)8GB+ VRAM (RTX 3060/4060)24GB+ VRAM (RTX 4090)
Storage50GB SSD200GB NVMe SSD500GB+ NVMe SSD
DockerDocker Engine 20.10+Docker Engine 24+Docker Engine 24+
OSLinux, macOS, WSL2Linux (Ubuntu 22.04+)Linux (Ubuntu 22.04+)

Watch out: Without a GPU, you're limited to smaller models (7B-14B parameters) at slower speeds. A 7B model on CPU generates around 14-18 tokens/sec on a modern desktop -- usable, but noticeably slower than the instant feel of ChatGPT. If you plan to serve multiple users or run 70B+ models, a dedicated GPU is strongly recommended.

Step 1: Install Ollama

Ollama is the model runtime. Install it first:

# Linux / WSL2
curl -fsSL https://ollama.com/install.sh | sh

# macOS (via Homebrew)
brew install ollama

# Verify installation
ollama --version

Ollama runs as a system service and exposes an API on port 11434 by default. On Linux, it starts automatically via systemd. On macOS, it runs as a background application.

Step 2: Pull Models

Download the models you want to use. Here are the best options by use case and hardware tier:

ModelParametersVRAM / RAM (Q4)Best ForSpeed (RTX 4090)
Llama 4 Scout17B active (109B total MoE)~12GBGeneral purpose, multilingual~55 t/s
Qwen 3 8B8B~5GBCoding, reasoning, multilingual~95 t/s
Qwen 3 32B32B~20GBComplex reasoning, analysis~35 t/s
Mistral Small 3.124B~15GBBalanced quality/speed, tool use~45 t/s
DeepSeek-R1 14B14B~9GBMath, reasoning, chain-of-thought~65 t/s
Llama 3.3 70B70B~42GBMaximum quality (needs big GPU)~18 t/s
Gemma 3 4B4B~3GBLightweight, fast responses~130 t/s
# Pull models -- each download runs once, models are cached locally
ollama pull qwen3:8b
ollama pull llama4-scout
ollama pull mistral-small3.1
ollama pull deepseek-r1:14b

# List downloaded models
ollama list

# Quick test
ollama run qwen3:8b "Explain Docker volumes in two sentences."

Pro tip: Start with Qwen 3 8B. It punches well above its weight class for coding and general tasks, runs on modest hardware, and generates tokens fast. Add larger models later once you've confirmed your hardware handles them well. You can switch models mid-conversation in Open WebUI.

Step 3: Deploy with Docker Compose

The production setup uses Docker Compose to orchestrate Open WebUI, Ollama, SearXNG (for web search), and ChromaDB (for RAG document storage). Here's the complete stack:

# docker-compose.yml
version: "3.8"

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    restart: unless-stopped
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    # Uncomment for NVIDIA GPU passthrough:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]
    environment:
      - OLLAMA_NUM_PARALLEL=4        # concurrent requests
      - OLLAMA_MAX_LOADED_MODELS=2   # models kept in memory
      - OLLAMA_KEEP_ALIVE=10m        # unload idle models after 10 min

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "3000:8080"
    volumes:
      - open-webui-data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=true
      - ENABLE_SIGNUP=false
      - DEFAULT_USER_ROLE=user
      - ENABLE_RAG_WEB_SEARCH=true
      - RAG_WEB_SEARCH_ENGINE=searxng
      - SEARXNG_QUERY_URL=http://searxng:8080/search?q=<query>&format=json
      - RAG_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
      - CHROMA_HTTP_HOST=chromadb
      - CHROMA_HTTP_PORT=8000
    depends_on:
      - ollama
      - searxng
      - chromadb

  searxng:
    image: searxng/searxng:latest
    container_name: searxng
    restart: unless-stopped
    volumes:
      - searxng-data:/etc/searxng
    environment:
      - SEARXNG_BASE_URL=http://localhost:8080/
    ports:
      - "8888:8080"

  chromadb:
    image: chromadb/chroma:latest
    container_name: chromadb
    restart: unless-stopped
    volumes:
      - chroma-data:/chroma/chroma
    ports:
      - "8000:8000"
    environment:
      - ANONYMIZED_TELEMETRY=false

volumes:
  ollama-data:
  open-webui-data:
  searxng-data:
  chroma-data:
# Start the entire stack
docker compose up -d

# Check all services are healthy
docker compose ps

# View logs
docker compose logs -f open-webui

Open your browser to http://localhost:3000. The first user to register becomes the admin. If you set ENABLE_SIGNUP=false, the admin creates additional accounts manually through the admin panel.

GPU Passthrough Setup

For NVIDIA GPUs, you need the NVIDIA Container Toolkit installed on the host:

# Install NVIDIA Container Toolkit (Ubuntu/Debian)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

# Verify GPU access inside Docker
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

Then uncomment the deploy section in the Ollama service definition and restart with docker compose up -d.

Model Preloading and Warm-Up

By default, Ollama loads a model into memory on first request, which adds 5-30 seconds of latency. To preload your primary model at boot:

# Add to crontab or a systemd timer
@reboot sleep 30 && curl -s http://localhost:11434/api/generate \
  -d '{"model":"qwen3:8b","prompt":"warmup","stream":false}' > /dev/null

# Or create a preload script
#!/bin/bash
MODELS=("qwen3:8b" "mistral-small3.1")
for model in "${MODELS[@]}"; do
  echo "Preloading $model..."
  curl -s http://localhost:11434/api/generate \
    -d "{"model":"$model","prompt":"hello","stream":false}" > /dev/null
done

Multi-User Configuration

Open WebUI supports full multi-user setups with role-based access. Key admin settings to configure after first login:

  • User roles: Admin, User, and Pending. Set DEFAULT_USER_ROLE=pending to require admin approval for new accounts.
  • Model permissions: Restrict which models specific users can access. Useful for limiting expensive large models to senior team members.
  • Shared conversations: Users can share conversation links within the instance. Admins can view all conversations if needed.
  • Custom model presets: Create system-prompt templates (e.g., "Code Reviewer," "Technical Writer") that users can select as conversation modes.
  • LDAP / OAuth: Integrate with existing identity providers for single sign-on. Supports Google, Microsoft, GitHub, and generic OIDC providers.

Power Features Worth Configuring

RAG: Document Upload and Querying

Open WebUI's RAG pipeline lets users upload PDFs, Word documents, text files, and web pages directly into a conversation. Documents are chunked, embedded, and stored in ChromaDB. When a user asks a question, relevant chunks are retrieved and injected into the model's context. This turns your local LLM into a knowledge base that can answer questions about your specific documents -- no data leaves your server.

# Fine-tune RAG in docker-compose environment variables
- RAG_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
- RAG_CHUNK_SIZE=1000
- RAG_CHUNK_OVERLAP=200
- RAG_TOP_K=5
- RAG_RELEVANCE_THRESHOLD=0.3

Web Browsing via SearXNG

With SearXNG integrated, users can toggle web search per message. The system queries SearXNG, retrieves top results, scrapes content, and feeds it to the model as context. This gives your local LLM access to current information without sending your prompts to Google or Bing. SearXNG itself is a meta-search engine that aggregates results from multiple providers anonymously.

Image Generation

Open WebUI supports image generation via AUTOMATIC1111's Stable Diffusion WebUI or ComfyUI backends. Configure the connection in admin settings, and users can generate images directly in chat. For teams that need image generation without sending prompts to DALL-E or Midjourney, this is a complete local alternative.

Conversation Branching

One of the most underrated features: you can branch a conversation at any point, creating alternative response paths. Ask the same question to different models, or explore multiple approaches to a problem without losing the original thread. Each branch maintains its own history and can be continued independently.

Security: Reverse Proxy and Access Control

Never expose Open WebUI directly to the internet. Place it behind a reverse proxy with TLS termination:

# Caddyfile (simplest option)
chat.yourdomain.com {
    reverse_proxy localhost:3000
}
# Nginx alternative
server {
    listen 443 ssl http2;
    server_name chat.yourdomain.com;

    ssl_certificate     /etc/letsencrypt/live/chat.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/chat.yourdomain.com/privkey.pem;

    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support (required for streaming responses)
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }

    # Rate limiting
    limit_req_zone $binary_remote_addr zone=chat:10m rate=10r/s;
    location / {
        limit_req zone=chat burst=20 nodelay;
    }
}

Additional security measures worth implementing:

  • Firewall rules: Block direct access to ports 11434 (Ollama), 8000 (ChromaDB), and 8888 (SearXNG) from external networks. Only Open WebUI should communicate with these services.
  • Network isolation: Use a dedicated Docker network so backend services are not reachable from other containers or the host network.
  • Backup: Schedule regular backups of the open-webui-data volume. This contains all conversations, user accounts, and uploaded documents.

Alternatives to Open WebUI

Open WebUI is the most feature-complete option, but these are worth evaluating depending on your needs:

ProjectStrengthsBest ForLimitations
LibreChatMulti-provider (OpenAI, Anthropic, local), plugin systemTeams using both cloud and local modelsMore complex setup, heavier resource use
LobeChatPolished UI, plugin marketplace, TTS/STTConsumer-grade experienceLess focus on local-first deployment
text-generation-webuiMaximum model control, quantization optionsML engineers, model experimentationComplex UI, single-user oriented
JanDesktop app, offline-first, simpleIndividual users, non-technicalNo multi-user, limited admin controls
AnythingLLMStrong RAG focus, workspace-basedDocument Q&A use casesSmaller community, fewer integrations

Frequently Asked Questions

How does Open WebUI compare to ChatGPT in terms of quality?

It depends entirely on the model you run behind it. A Qwen 3 8B or Llama 4 Scout will handle most general tasks competently -- summarization, coding assistance, writing, Q&A -- at quality roughly comparable to GPT-3.5. For GPT-4-level quality, you need 70B+ parameter models, which require 48GB+ VRAM. The interface itself matches or exceeds ChatGPT's feature set, especially with RAG and branching capabilities.

Can I connect Open WebUI to cloud APIs like OpenAI or Anthropic?

Yes. Open WebUI supports any OpenAI-compatible API endpoint. Set the OPENAI_API_BASE_URL and OPENAI_API_KEY environment variables to connect to OpenAI, Anthropic (via a proxy), Groq, Together AI, or any other provider. You can run local models via Ollama and cloud models simultaneously, letting users choose per conversation.

What happens when multiple users query the same model at once?

Ollama handles concurrent requests by queuing them. The OLLAMA_NUM_PARALLEL setting controls how many requests are processed simultaneously (default is 1). Set it to 2-4 for small teams. Each parallel request consumes additional memory for its KV cache, so monitor RAM/VRAM usage. With a 7B model on a 24GB GPU, you can comfortably handle 4 parallel requests.

How much disk space do models consume?

Quantized models at Q4 precision use roughly 0.5-0.6 GB per billion parameters. A 7B model is about 4.5GB, a 14B model is 9GB, and a 70B model is 42GB. Ollama stores models in its data directory and deduplicates shared layers across model variants. Plan for 50-100GB if you want 3-5 models available locally.

Can I fine-tune models through Open WebUI?

Not directly. Open WebUI is an inference frontend, not a training tool. However, you can create custom Modelfiles in Ollama that set system prompts, temperature, and other parameters to tailor model behavior. For actual fine-tuning, use tools like Unsloth, Axolotl, or the Hugging Face TRL library, then import the resulting model into Ollama with ollama create.

Is this setup suitable for production use in a company?

For internal tools serving 5-20 users, absolutely. I've seen teams run this stack reliably for months with proper monitoring and backups. For customer-facing production at scale, you'll want more robust infrastructure: load balancing across multiple Ollama instances, dedicated model serving with vLLM or TGI, proper observability, and an SLA-backed hosting environment. Open WebUI is best suited for internal productivity tooling.

How do I update Open WebUI and Ollama?

With Docker Compose, updates are straightforward. Pull the latest images and recreate the containers. Your data persists in Docker volumes, so updates don't affect conversations or settings. Check the release notes before major version upgrades -- breaking changes are rare but do happen.

docker compose pull
docker compose up -d

From Setup to Daily Driver

The gap between commercial AI chat products and self-hosted alternatives has closed dramatically. Open WebUI with Ollama gives you a ChatGPT-equivalent experience where every byte of data stays on your hardware. The setup takes under an hour, the models keep improving every few months, and you eliminate recurring API costs entirely. Start with a single model on whatever hardware you have, add models and features as your needs grow, and stop worrying about who's reading your conversations.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.