How does OpenWebUI handle multiple users accessing the system at the same time?

OpenWebUI manages concurrent users through a request queue. vLLM's continuous batching groups compatible requests to maximize GPU throughput. Each user sees their own conversation interface and history with no cross-user data leakage. On the Summit Base tier (2x H100, 160GB VRAM), expect response time increases when more than 5-8 users are prompting simultaneously. The Summit Ridge tier ($150K-$160K, custom H100 config) comfortably serves 10-15 simultaneous users for typical business tasks.

Local AI Stack: DeepSeek V4-Flash, vLLM, OpenWebUI

Latest Models

DeepSeek V4 (Released April 24, 2026)

The DeepSeek V4 family represents the current frontier of open-source large language models. MIT licensed, fully open weights, and commercially unrestricted.

Model	Parameters	Active Parameters	Context Window	License	VRAM Required
DeepSeek V4-Pro	1.6 Trillion	~160B (MoE)	128K	MIT	~800GB+ (multi-node)
DeepSeek V4-Flash	284 Billion	~13B (MoE)	1M	MIT	~282GB (FP16)
DeepSeek V3	671 Billion	~37B (MoE)	128K	MIT	~350GB+ (FP16)

Pre-Installed

Pre-Installed Models: DeepSeek V4-Flash, Llama 3.1 70B, Mixtral 8x22B

Every Island Mountain Summit Base and Summit Ridge system ships with these models installed, configured, and tested.

DeepSeek V4-Flash

~80-100GB VRAM (Quantized)

284B parameter mixture-of-experts model with 1M token context window. Runs quantized on Summit Base tier for efficient local inference. Exceptional at code analysis, long-document processing, and complex reasoning tasks. Full-quality version available on Summit Pinnacle tier.

Llama 3.1 70B

40-48GB VRAM

Meta's general-purpose workhorse. Strong across writing, summarization, question answering, and conversational tasks. The most capable general-purpose model in the stack.

Mixtral 8x22B

~80GB VRAM

Mistral's mixture-of-experts model with strong multilingual capability. Excellent for organizations that work across languages or need efficient multi-task inference.

User Interface

OpenWebUI Platform

No command line required. Open your browser, pick a model, start prompting.

Browser-Based

Access from any device on your network. Chrome, Firefox, Safari, Edge - any modern browser works.

Multi-Model Selector

Switch between DeepSeek V4-Flash, Llama 3.1, Mixtral, or any model with a dropdown. No config files.

Conversation History

Full chat history stored locally on the system. Search, organize, and reference past conversations.

Admin Controls

User management, model access controls, and usage monitoring. Set up teams with different model permissions.

Concurrency

How Multi-User Concurrency Works

A single Island Mountain server supports multiple simultaneous users. This is not a theoretical claim - it is how the vLLM inference engine operates at the architecture level.

vLLM uses tensor parallelism to distribute model weights across both GPUs in the system. On a dual-A100 or dual-H100 configuration, a 70-billion-parameter model like DeepSeek V4-Flash splits its attention layers and feed-forward blocks across both cards. Each inference request is processed using the combined memory and compute of both GPUs, not queued behind the previous request.

In practice, a Summit Base-tier system comfortably handles 5-8 concurrent users running inference queries through the Open WebUI interface, depending on model size, context length, and response length. The Summit Ridge tier serves 10-15 simultaneous users. Throughput scales with GPU memory bandwidth - the H100's 3.35 TB/s HBM3 bandwidth delivers roughly 2x the concurrent throughput of the A100's 2.0 TB/s HBM2e.

For organizations running department-wide AI access - a law firm's attorneys, a tribal government's planning team, a medical practice's clinicians - this is not a shared-terminal bottleneck. It is genuine concurrent access with response times measured in seconds, not minutes.

Security

Air-Gap Configuration: What We Disable and How You Verify It

OpenWebUI is a self-hosted web application. The "web" refers to the browser-based interface, not internet dependency. It runs on the server, serves its UI over your local network, and users access it from any device on the LAN. No cloud account. No external API calls for inference.

But OpenWebUI ships with features that reach outbound by default. Island Mountain disables every one of them before your system leaves our facility.

What Gets Disabled Before Delivery

OFFLINE_MODE=True - Master switch. Disables all outbound network calls including version checks and remote resource fetching. This is the primary air-gap control.
HF_HUB_OFFLINE=1 - Prevents the HuggingFace Hub library from attempting to download embedding models or tokenizers. All required models are pre-loaded on local NVMe storage during our build process.
ENABLE_COMMUNITY_SHARING=False - Removes the ability to share prompts, configurations, or presets with the OpenWebUI community platform. No data leaves the system through this channel.
ANONYMIZED_TELEMETRY=False - Disables all telemetry collection. No usage data, anonymized or otherwise, is transmitted to any external endpoint.
ENABLE_RAG_WEB_SEARCH=False - Disables web search integration in the document retrieval pipeline. RAG operates entirely against locally stored documents.
SAFE_MODE=True - Disables community-contributed tools and functions that might contain network calls. Only pre-audited, locally installed tools are available.
ENABLE_SIGNUP=False - Locks user registration after initial admin account creation during setup. New users are added only by the administrator.

How Your IT Team Verifies It

Every environment variable is set in a single configuration file on the system. Your network security team can inspect it directly. Run docker inspect [container_name] to view all active environment variables. Or check the .env file in the OpenWebUI deployment directory.

For complete verification: connect the system to a monitored network segment and run inference queries for 24 hours. Monitor outbound traffic. There should be zero outbound connections initiated by the server. If your security policy requires it, configure host-level firewall rules (iptables/nftables) to block all outbound traffic as a belt-and-suspenders measure.

Island Mountain configures these settings during the 72-hour burn-in process. The system is tested in its air-gapped configuration before delivery, not just in its connected configuration.

How It Works

Build & Test Process

Every system goes through the same four-phase process before it ships.

Component Sourcing & Verification

GPUs sourced from verified enterprise resellers with documented provenance. Every component tested individually before assembly begins.

Assembly & Configuration

15-phase build process: chassis prep, CPU/RAM install, NVMe config, GPU seating and power, BIOS optimization, OS install, CUDA toolkit, inference engine, model deployment, OpenWebUI setup.

72-Hour Burn-In Testing

Continuous stress testing across all GPUs with automated temperature, error rate, and performance monitoring. Automated alerting on any anomaly. Systems that fail burn-in don't ship.

Benchmarking & Delivery

Full performance benchmark suite across all installed models. Delivery manifest documenting every component serial number, test result, and configuration detail.

Ownership & Lifecycle

After It Ships: Maintenance, Updates, and Support

IT directors need to know what happens after delivery. Here's the full lifecycle story.

Model Updates

New open-source models are released regularly. Updating is a download-and-configure process through Ollama or the OpenWebUI interface. We publish tested model compatibility lists for each hardware tier so you know what runs before you install it. For the first 30 days, we'll walk you through model updates directly.

Security Patches

The system runs Ubuntu Server LTS with standard package management. OS-level security patches are applied through apt update && apt upgrade, the same process your IT team uses on any Linux server. CUDA drivers and inference engine updates are managed through NVIDIA's official repositories.

Hardware Warranty

Every system ships with a 1-year hardware warranty covering all components. GPU failures are handled through supplier RMA agreements with documented replacement timelines. We maintain a 20% warranty reserve per unit to ensure we can cover replacements without delay. Extended warranty options are available at purchase.

GPU Upgrade Path

When next-generation models demand more VRAM or faster memory bandwidth, your existing GPUs are credited at current secondary market value toward replacement cards. This is a physical hardware swap, not a firmware toggle. Upgrade pricing depends on the replacement GPU market price minus your trade-in credit.

Power & Cooling Requirements

All systems require a dedicated 208V/30A circuit (NEMA L6-30R), standard in server rooms and data closets. Average power draw is 1.5-2.5 kW under typical inference loads. The system runs at standard server room temperatures (64-80°F / 18-27°C). No specialized cooling infrastructure required beyond normal HVAC.

Ongoing Support Options

After the 30-day included support period, ongoing support is available on a per-incident or annual retainer basis. This covers model configuration, performance tuning, troubleshooting, and remote diagnostics. You always have direct access to the person who built the system.

Technical Questions

Common Technical Questions

What is tensor parallelism and why does it make local AI faster?

Tensor parallelism splits a model across multiple GPUs so they process different parts of each computation simultaneously. Without it, one GPU works while the other sits idle. With vLLM's tensor parallelism, model layers are distributed across both GPUs, cutting inference latency roughly in half. On Island Mountain systems, the H100's 3.35 TB/s memory bandwidth between GPUs enables near-linear scaling. Pre-configured on all Island Mountain systems.

Can I run DeepSeek V4-Flash on the Summit Base or Summit Ridge tier?

The Summit Base tier runs V4-Flash quantized (80-100GB VRAM), which gives you the 284B parameter model in a more efficient form. For the full FP16 quality version (282GB VRAM), you need the Summit Pinnacle tier (2x H200 141GB GPUs, 282GB total), launching Q3 2026. Quantization preserves reasoning capability while reducing model size significantly.

How does OpenWebUI handle multiple users at once?

OpenWebUI manages concurrent users through a request queue. vLLM's continuous batching groups compatible requests to maximize GPU throughput. Each user sees their own conversation interface - no cross-user data leakage. The Summit Base tier (2x H100) handles 5-8 simultaneous users comfortably. The Summit Ridge tier handles 10-15 simultaneous users for typical business tasks.

See All FAQs

Roadmap

V4-Flash: What's Coming

DeepSeek V4-Flash (284B parameters, 13B active via mixture-of-experts) represents the next generation of efficient large-scale inference. With a 1M token context window, it's built for tasks that current models can't handle: full-codebase analysis, book-length document processing, and extended research sessions.

Summit Base Tier (2x H100, $75K-$85K): V4-Flash runs quantized. This preserves reasoning capability while reducing memory footprint to 80-100GB, enabling efficient local inference on consumer-grade hardware. Ideal for organizations that need V4-Flash capability without the Summit Pinnacle tier investment.

Summit Pinnacle Tier ($350K-$400K, Q3 2026): V4-Flash at full FP16 quality requires approximately 282GB of VRAM - dual H200 141GB GPUs provide exactly this. This configuration unlocks V4-Flash's maximum reasoning and long-context capabilities for enterprise workloads.

We validate quantization carefully before shipping. When the Summit Pinnacle tier launches, full-quality V4-Flash support will be real, tested, and documented - not aspirational.

Join the Summit Pinnacle Tier Waitlist See All Product Tiers

Or call directly: 1-801-609-1130

DeepSeek V4-Flash Local Deployment: Models, vLLM, and OpenWebUI