Going Inference-Token Free: Local Models for Enterprise Knowledge Work

We’re building a knowledge chatbot that reads from SharePoint and uploaded files, and even in its early stages, an uncomfortable question is starting to take shape: are premium cloud models overkill for most enterprise work?”

Two kinds of AI workload

Most organizations have a split. On one side, genuinely hard problems: deep research, complex analysis, tasks that demand the most capable AI available. On the other, the everyday bulk: “What is the travel reimbursement policy?”, “How do I set up a new vendor account?”, “Where is the latest quarterly report?”

That second category is large, repetitive, and currently expensive when served through cloud AI providers who charge per query. We wanted to know whether freely available local models, specifically Qwen and Gemma, could handle it well enough to eliminate those per-query costs entirely.

Early results suggest they can.

What the chatbot does

The system answers staff questions using only approved internal documents. It pulls content from our SharePoint Online sites and from files uploaded manually by administrators (PDFs, Word documents, plain text, and Markdown). When a user asks a question, the system searches this document library, retrieves the most relevant passages, and uses an AI model to compose an answer grounded in that evidence, a pattern known as Retrieval-Augmented Generation (RAG). For questions that need more depth, it can also read the full source document before answering.

Users can choose which AI model powers their conversation. The system supports both local models running on our own hardware (at zero per-query cost) and cloud-hosted models when higher capability is needed. This flexibility let us compare local and cloud models side by side on the same questions from the same users.

What makes it work with smaller models

Several design choices compensate for the fact that local models are less powerful than their cloud counterparts.

Smart search, not just AI smarts. The system combines two different search techniques to find the right document passages. One matches meaning (semantic search), the other matches keywords (traditional text search). Together, they catch relevant content that either method alone would miss. If the first search does not return enough evidence, the system automatically rephrases the question and tries again.

Document structure preservation. When documents are broken into searchable passages, the system preserves the heading hierarchy (for example, “Section 3 > Subsection 3.2 > Paragraph”). This context helps the AI understand where a passage fits within its source document, which reduces the chance of misleading answers.

Built-in answer review. Before any response reaches the user, an independent review step checks whether the answer is actually supported by the retrieved documents, whether citations are accurate, and whether the response stays within scope. If the review finds problems, the system either retries or declines to answer rather than delivering something unreliable.

Visual content made searchable. Many internal documents, from user guides with annotated screenshots to scientific papers with graphs, rely heavily on images. A vision-capable local model reads these images and converts them into written descriptions that become part of the searchable text. A screenshot of a configuration screen becomes step-by-step instructions. A chart becomes a prose summary of its findings. This means staff can find information locked inside figures and diagrams, not just body text.

Organization-aware access control. Documents are tagged as either organization-wide or specific to a particular office or department. A staff member in the Nairobi office sees global policies plus Nairobi-specific guidance, but not internal procedures meant for the Geneva office. The same chatbot serves everyone, but each person sees only what is relevant and permitted for them.

Highly configurable without developer involvement. Administrators can tune search behavior, adjust security settings, manage document sources, configure escalation contacts, and customize suggested prompts for each knowledge domain, all through a web-based admin panel. No code changes or redeployment required. This is especially important when working with local models, because different models perform differently and the surrounding system needs to be adjustable to get the best out of each one.

These architectural compensations work best when the source documents themselves are well-structured. Heading hierarchy, terminology consistency, and section design directly affect whether the system retrieves the right content and returns a complete answer. Poor document structure is one of the most common and least visible causes of retrieval failure.

For those who create or review policy documents and user guides, we have published a practical authoring guide that shows how these choices play out in practice and what to do differently.

Keeping it secure

Running AI on internal documents demands proper safeguards, regardless of whether the model is local or in the cloud. We built a multi-layered security framework that protects against prompt injection (attempts to trick the AI into ignoring its instructions or leaking information).

The defense operates at multiple levels: pattern-based detection catches known attack signatures, a dedicated AI classifier identifies subtler manipulation attempts, and technical controls verify that the AI has not been influenced by tampered content in the documents themselves. Importantly, most of these protections work independently of which AI model is generating answers, meaning they are just as effective with a free local model as with a premium cloud service.

On top of this, the access control system described above acts as a structural safeguard. Even if someone found a way past the AI-level defenses, they still cannot surface documents their role does not grant access to.

Where it struggles

This is not a solved problem, and honesty about the limitations matters.

Speed. Local models on modest hardware generate text noticeably slower than cloud APIs. For one person asking a question, the wait is reasonable. When many people use the chatbot at the same time, the experience degrades because all requests share the same local computing resources.

Scaling for simultaneous users. A single local AI server becomes a bottleneck when demand spikes. Requests queue up, response times climb, and the user experience suffers.

One path forward starts at the edge: local LLMs running directly on devices, keeping inference fast, private, and entirely within the user’s own machine. Where edge compute falls short for larger models, heavier workloads, or stricter consistency requirements, dedicated servers closer to each office or team offer a middle ground without routing everything back through a central node. The architecture already supports multiple AI providers, so either direction would be an extension, not a rebuild. Both remain areas we’re actively exploring.

What this suggests

For the routine knowledge work that makes up most of an organization’s AI queries, local models are proving capable enough. In our RAG pipeline, the heavy lifting is done by the retrieval quality, the answer review process, and the security framework. The AI model’s job is to turn retrieved evidence into clear natural language, and models like Qwen and Gemma handle that well.

This does not replace cloud AI for advanced research or complex reasoning tasks. But it suggests organizations can be strategic about where they spend on AI: premium models for premium problems, local inference for the high-volume everyday questions.

This is a work-in-progress with open questions around performance at scale and distribution strategy. But the early results are encouraging enough to keep building, and the potential cost savings for operational AI use cases are significant.