
Open-source frontier models, battle-tested inference engines, and a browser-based interface that lets anyone on your team start prompting in minutes.
The DeepSeek V4 family represents the current frontier of open-source large language models. MIT licensed, fully open weights, and commercially unrestricted.
| Model | Parameters | Active Parameters | Context Window | License | VRAM Required |
|---|---|---|---|---|---|
| DeepSeek V4-Pro | 1.6 Trillion | ~160B (MoE) | 128K | MIT | ~800GB+ (multi-node) |
| DeepSeek V4-Flash | 284 Billion | ~13B (MoE) | 1M | MIT | ~282GB (FP16) |
| DeepSeek V3 | 671 Billion | ~37B (MoE) | 128K | MIT | ~350GB+ (FP16) |
Every Island Mountain Summit Base and Summit Ridge system ships with these models installed, configured, and tested.
~80-100GB VRAM (Quantized)
284B parameter mixture-of-experts model with 1M token context window. Runs quantized on Summit Base tier for efficient local inference. Exceptional at code analysis, long-document processing, and complex reasoning tasks. Full-quality version available on Summit Pinnacle tier.
40-48GB VRAM
Meta's general-purpose workhorse. Strong across writing, summarization, question answering, and conversational tasks. The most capable general-purpose model in the stack.
~80GB VRAM
Mistral's mixture-of-experts model with strong multilingual capability. Excellent for organizations that work across languages or need efficient multi-task inference.
No command line required. Open your browser, pick a model, start prompting.
Access from any device on your network. Chrome, Firefox, Safari, Edge - any modern browser works.
Switch between DeepSeek V4-Flash, Llama 3.1, Mixtral, or any model with a dropdown. No config files.
Full chat history stored locally on the system. Search, organize, and reference past conversations.
User management, model access controls, and usage monitoring. Set up teams with different model permissions.
A single Island Mountain server supports multiple simultaneous users. This is not a theoretical claim - it is how the vLLM inference engine operates at the architecture level.
vLLM uses tensor parallelism to distribute model weights across both GPUs in the system. On a dual-A100 or dual-H100 configuration, a 70-billion-parameter model like DeepSeek V4-Flash splits its attention layers and feed-forward blocks across both cards. Each inference request is processed using the combined memory and compute of both GPUs, not queued behind the previous request.
In practice, a Summit Base-tier system comfortably handles 5-8 concurrent users running inference queries through the Open WebUI interface, depending on model size, context length, and response length. The Summit Ridge tier serves 10-15 simultaneous users. Throughput scales with GPU memory bandwidth - the H100's 3.35 TB/s HBM3 bandwidth delivers roughly 2x the concurrent throughput of the A100's 2.0 TB/s HBM2e.
For organizations running department-wide AI access - a law firm's attorneys, a tribal government's planning team, a medical practice's clinicians - this is not a shared-terminal bottleneck. It is genuine concurrent access with response times measured in seconds, not minutes.
OpenWebUI is a self-hosted web application. The "web" refers to the browser-based interface, not internet dependency. It runs on the server, serves its UI over your local network, and users access it from any device on the LAN. No cloud account. No external API calls for inference.
But OpenWebUI ships with features that reach outbound by default. Island Mountain disables every one of them before your system leaves our facility.
OFFLINE_MODE=True - Master switch. Disables all outbound network calls including version checks and remote resource fetching. This is the primary air-gap control.HF_HUB_OFFLINE=1 - Prevents the HuggingFace Hub library from attempting to download embedding models or tokenizers. All required models are pre-loaded on local NVMe storage during our build process.ENABLE_COMMUNITY_SHARING=False - Removes the ability to share prompts, configurations, or presets with the OpenWebUI community platform. No data leaves the system through this channel.ANONYMIZED_TELEMETRY=False - Disables all telemetry collection. No usage data, anonymized or otherwise, is transmitted to any external endpoint.ENABLE_RAG_WEB_SEARCH=False - Disables web search integration in the document retrieval pipeline. RAG operates entirely against locally stored documents.SAFE_MODE=True - Disables community-contributed tools and functions that might contain network calls. Only pre-audited, locally installed tools are available.ENABLE_SIGNUP=False - Locks user registration after initial admin account creation during setup. New users are added only by the administrator.Every environment variable is set in a single configuration file on the system. Your network security team can inspect it directly. Run docker inspect [container_name] to view all active environment variables. Or check the .env file in the OpenWebUI deployment directory.
For complete verification: connect the system to a monitored network segment and run inference queries for 24 hours. Monitor outbound traffic. There should be zero outbound connections initiated by the server. If your security policy requires it, configure host-level firewall rules (iptables/nftables) to block all outbound traffic as a belt-and-suspenders measure.
Island Mountain configures these settings during the 72-hour burn-in process. The system is tested in its air-gapped configuration before delivery, not just in its connected configuration.
Every system goes through the same four-phase process before it ships.
GPUs sourced from verified enterprise resellers with documented provenance. Every component tested individually before assembly begins.
15-phase build process: chassis prep, CPU/RAM install, NVMe config, GPU seating and power, BIOS optimization, OS install, CUDA toolkit, inference engine, model deployment, OpenWebUI setup.
Continuous stress testing across all GPUs with automated temperature, error rate, and performance monitoring. Automated alerting on any anomaly. Systems that fail burn-in don't ship.
Full performance benchmark suite across all installed models. Delivery manifest documenting every component serial number, test result, and configuration detail.
IT directors need to know what happens after delivery. Here's the full lifecycle story.
New open-source models are released regularly. Updating is a download-and-configure process through Ollama or the OpenWebUI interface. We publish tested model compatibility lists for each hardware tier so you know what runs before you install it. For the first 30 days, we'll walk you through model updates directly.
The system runs Ubuntu Server LTS with standard package management. OS-level security patches are applied through apt update && apt upgrade, the same process your IT team uses on any Linux server. CUDA drivers and inference engine updates are managed through NVIDIA's official repositories.
Every system ships with a 1-year hardware warranty covering all components. GPU failures are handled through supplier RMA agreements with documented replacement timelines. We maintain a 20% warranty reserve per unit to ensure we can cover replacements without delay. Extended warranty options are available at purchase.
When next-generation models demand more VRAM or faster memory bandwidth, your existing GPUs are credited at current secondary market value toward replacement cards. This is a physical hardware swap, not a firmware toggle. Upgrade pricing depends on the replacement GPU market price minus your trade-in credit.
All systems require a dedicated 208V/30A circuit (NEMA L6-30R), standard in server rooms and data closets. Average power draw is 1.5-2.5 kW under typical inference loads. The system runs at standard server room temperatures (64-80°F / 18-27°C). No specialized cooling infrastructure required beyond normal HVAC.
After the 30-day included support period, ongoing support is available on a per-incident or annual retainer basis. This covers model configuration, performance tuning, troubleshooting, and remote diagnostics. You always have direct access to the person who built the system.
Tensor parallelism splits a model across multiple GPUs so they process different parts of each computation simultaneously. Without it, one GPU works while the other sits idle. With vLLM's tensor parallelism, model layers are distributed across both GPUs, cutting inference latency roughly in half. On Island Mountain systems, the H100's 3.35 TB/s memory bandwidth between GPUs enables near-linear scaling. Pre-configured on all Island Mountain systems.
The Summit Base tier runs V4-Flash quantized (80-100GB VRAM), which gives you the 284B parameter model in a more efficient form. For the full FP16 quality version (282GB VRAM), you need the Summit Pinnacle tier (2x H200 141GB GPUs, 282GB total), launching Q3 2026. Quantization preserves reasoning capability while reducing model size significantly.
OpenWebUI manages concurrent users through a request queue. vLLM's continuous batching groups compatible requests to maximize GPU throughput. Each user sees their own conversation interface - no cross-user data leakage. The Summit Base tier (2x H100) handles 5-8 simultaneous users comfortably. The Summit Ridge tier handles 10-15 simultaneous users for typical business tasks.
DeepSeek V4-Flash (284B parameters, 13B active via mixture-of-experts) represents the next generation of efficient large-scale inference. With a 1M token context window, it's built for tasks that current models can't handle: full-codebase analysis, book-length document processing, and extended research sessions.
Summit Base Tier (2x H100, $75K-$85K): V4-Flash runs quantized. This preserves reasoning capability while reducing memory footprint to 80-100GB, enabling efficient local inference on consumer-grade hardware. Ideal for organizations that need V4-Flash capability without the Summit Pinnacle tier investment.
Summit Pinnacle Tier ($350K-$400K, Q3 2026): V4-Flash at full FP16 quality requires approximately 282GB of VRAM - dual H200 141GB GPUs provide exactly this. This configuration unlocks V4-Flash's maximum reasoning and long-context capabilities for enterprise workloads.
We validate quantization carefully before shipping. When the Summit Pinnacle tier launches, full-quality V4-Flash support will be real, tested, and documented - not aspirational.
Or call directly: 1-801-609-1130