AI #RAG#multi-tenant#SaaS

Multi-Tenant RAG in 2026: Building Secure Retrieval-Augmented Generation for SaaS

One RAG system, many customers, strict isolation. A practical guide to multi-tenant architecture patterns, data isolation, and cost management.

15 min · January 4, 2026 · Updated January 27, 2026

TL;DR

Multi-tenant RAG serves multiple customers from one system while maintaining strict data isolation.
Three isolation models: Silo (separate index per tenant), Pool (shared index with filters), Bridge (hybrid).
Choose Silo for enterprise (strongest isolation), Pool for SMB (cost-efficient), Bridge for mixed customer base.
Security is non-negotiable: encrypt per-tenant, filter on every query, audit access, prevent noisy neighbor.
The RAG pipeline: Ingest → Chunk → Embed → Index → Retrieve → Generate—with tenant context at every stage.
Major clouds (AWS Bedrock, Azure OpenAI) offer managed multi-tenant RAG; evaluate build vs. buy carefully.

What Is Multi-Tenant RAG

RAG (Retrieval-Augmented Generation) lets LLMs reason over proprietary data by retrieving relevant context before generating responses. Multi-tenant RAG does this for multiple customers sharing infrastructure:

Aspect	Single-Tenant RAG	Multi-Tenant RAG
Data isolation	Inherent	Must be enforced
Cost	Higher per customer	Shared across customers
Management	Simple	Complex
Scaling	Linear	Economies of scale
Customization	Full	Per-tenant configuration

Isolation Models

Silo Model: Separate Index Per Tenant

Tenant A ──► Index A ──► LLM
Tenant B ──► Index B ──► LLM
Tenant C ──► Index C ──► LLM

Pros	Cons
Strongest isolation	Higher cost
Independent scaling	More infrastructure
Tenant-specific tuning	Management overhead
Simpler compliance	Resource underutilization

Best for: Enterprise customers, regulated industries, high-value accounts.

Pool Model: Shared Index with Filters

Tenant A ─┐
Tenant B ─┼──► Shared Index ──► LLM
Tenant C ─┘
          (with tenant_id filter)

Pros	Cons
Cost-efficient	Weaker isolation
Simpler management	Noisy neighbor risk
Better resource utilization	Compliance concerns
Easy onboarding	Limited customization

Best for: SMB customers, non-sensitive data, freemium tiers.

Bridge Model: Hybrid Approach

Enterprise A ──► Dedicated Index ──► LLM
SMB Tenants ──► Shared Index ────► LLM

Pros	Cons
Right-sized isolation	More complex routing
Tiered pricing support	Multiple code paths
Flexible growth	Migration complexity

Best for: Mixed customer base, tiered product offerings.

The Multi-Tenant RAG Pipeline

Stage 1: Ingestion

class TenantAwareIngestion:
    def ingest(
        self, 
        tenant_id: str, 
        document: Document, 
        config: TenantConfig
    ) -> IngestionResult:
        # Validate tenant permissions
        if not self.can_ingest(tenant_id, document.type):
            raise PermissionError(f"Tenant {tenant_id} cannot ingest {document.type}")
        
        # Apply tenant-specific extraction
        extracted = self.extract(
            document,
            config.extraction_settings,
        )
        
        # Chunk with tenant configuration
        chunks = self.chunk(
            extracted,
            chunk_size=config.chunk_size,
            overlap=config.chunk_overlap,
        )
        
        # Tag with tenant metadata
        for chunk in chunks:
            chunk.metadata['tenant_id'] = tenant_id
            chunk.metadata['ingested_at'] = now()
            chunk.metadata['document_id'] = document.id
        
        return IngestionResult(chunks=chunks, tenant_id=tenant_id)

Stage 2: Embedding

class TenantAwareEmbedding:
    def embed(
        self, 
        chunks: List[Chunk], 
        tenant_id: str
    ) -> List[Vector]:
        # Get tenant embedding model (if customized)
        model = self.get_model(tenant_id)
        
        vectors = []
        for chunk in chunks:
            embedding = model.embed(chunk.text)
            vectors.append(Vector(
                id=chunk.id,
                values=embedding,
                metadata={
                    **chunk.metadata,
                    'tenant_id': tenant_id,  # Always include
                },
            ))
        
        return vectors

Stage 3: Indexing

class TenantAwareIndexing:
    def __init__(self, isolation_model: str):
        self.isolation_model = isolation_model
    
    def index(
        self, 
        vectors: List[Vector], 
        tenant_id: str
    ):
        if self.isolation_model == 'silo':
            # Dedicated index per tenant
            index = self.get_or_create_index(tenant_id)
            index.upsert(vectors)
            
        elif self.isolation_model == 'pool':
            # Shared index, tenant in metadata
            self.shared_index.upsert(vectors)
            
        elif self.isolation_model == 'bridge':
            # Route based on tenant tier
            if self.is_enterprise(tenant_id):
                index = self.get_or_create_index(tenant_id)
                index.upsert(vectors)
            else:
                self.shared_index.upsert(vectors)

Stage 4: Retrieval

class TenantAwareRetrieval:
    def retrieve(
        self, 
        query: str, 
        tenant_id: str,
        k: int = 5
    ) -> List[Chunk]:
        # Embed query
        query_vector = self.embed(query)
        
        if self.isolation_model == 'silo':
            # Query tenant's dedicated index
            index = self.get_index(tenant_id)
            results = index.query(query_vector, top_k=k)
            
        elif self.isolation_model == 'pool':
            # Query shared index WITH TENANT FILTER
            results = self.shared_index.query(
                query_vector,
                top_k=k,
                filter={'tenant_id': {'$eq': tenant_id}},  # Critical!
            )
        
        return results

Stage 5: Generation

class TenantAwareGeneration:
    def generate(
        self, 
        query: str, 
        context: List[Chunk],
        tenant_id: str
    ) -> Response:
        # Get tenant prompt template
        template = self.get_template(tenant_id)
        
        # Build prompt with context
        prompt = template.format(
            query=query,
            context=self.format_context(context),
        )
        
        # Get tenant LLM configuration
        llm_config = self.get_llm_config(tenant_id)
        
        # Generate with tenant settings
        response = self.llm.generate(
            prompt,
            model=llm_config.model,
            temperature=llm_config.temperature,
            max_tokens=llm_config.max_tokens,
        )
        
        # Log for tenant
        self.log(tenant_id, query, response)
        
        return response

Security Implementation

Mandatory: Tenant Filter on Every Query

def query_with_tenant_filter(
    self, 
    vector: List[float], 
    tenant_id: str,
    additional_filters: dict = None
) -> List[Result]:
    # ALWAYS include tenant filter
    base_filter = {'tenant_id': {'$eq': tenant_id}}
    
    if additional_filters:
        # Combine with AND, tenant filter cannot be overridden
        combined_filter = {
            '$and': [base_filter, additional_filters]
        }
    else:
        combined_filter = base_filter
    
    # Never allow query without tenant filter
    if 'tenant_id' not in str(combined_filter):
        raise SecurityError("Tenant filter bypassed")
    
    return self.index.query(vector, filter=combined_filter)

Per-Tenant Encryption

class TenantEncryption:
    def __init__(self, key_manager):
        self.key_manager = key_manager
    
    def encrypt_chunk(self, chunk: Chunk, tenant_id: str) -> EncryptedChunk:
        # Get tenant-specific key
        key = self.key_manager.get_key(tenant_id)
        
        # Encrypt chunk content
        encrypted_text = encrypt(chunk.text, key)
        
        return EncryptedChunk(
            id=chunk.id,
            encrypted_text=encrypted_text,
            metadata=chunk.metadata,  # Metadata can remain plain for filtering
        )
    
    def decrypt_chunk(self, encrypted: EncryptedChunk, tenant_id: str) -> Chunk:
        key = self.key_manager.get_key(tenant_id)
        text = decrypt(encrypted.encrypted_text, key)
        return Chunk(id=encrypted.id, text=text, metadata=encrypted.metadata)

Access Control

class TenantAccessControl:
    def can_read(self, user: User, document: Document) -> bool:
        # User must belong to document's tenant
        if user.tenant_id != document.tenant_id:
            return False
        
        # Check document-level permissions
        return user.has_permission('read', document)
    
    def can_ingest(self, user: User, tenant_id: str) -> bool:
        # User must belong to tenant
        if user.tenant_id != tenant_id:
            return False
        
        # Check role permissions
        return user.role in ['admin', 'editor']

Cost Management

Per-Tenant Metrics

class TenantUsageTracker:
    def track(
        self, 
        tenant_id: str, 
        operation: str,
        metrics: dict
    ):
        self.metrics_store.record({
            'tenant_id': tenant_id,
            'operation': operation,
            'timestamp': now(),
            **metrics,
        })
    
    def get_usage(self, tenant_id: str, period: str) -> Usage:
        records = self.metrics_store.query(
            tenant_id=tenant_id,
            period=period,
        )
        
        return Usage(
            embeddings_created=sum(r.get('embeddings', 0) for r in records),
            queries=sum(r.get('queries', 0) for r in records),
            tokens_used=sum(r.get('tokens', 0) for r in records),
            storage_bytes=self.get_storage(tenant_id),
        )

Rate Limiting

class TenantRateLimiter:
    def check(self, tenant_id: str, operation: str) -> bool:
        # Get tenant limits
        limits = self.get_limits(tenant_id)
        
        # Check current usage
        current = self.get_current_usage(tenant_id, operation)
        
        if current >= limits.get(operation, float('inf')):
            raise RateLimitExceeded(
                f"Tenant {tenant_id} exceeded {operation} limit"
            )
        
        # Increment usage
        self.increment(tenant_id, operation)
        return True

Tenant Onboarding/Offboarding

Onboarding

async def onboard_tenant(
    self, 
    tenant_id: str, 
    config: TenantConfig
) -> OnboardingResult:
    # Create tenant configuration
    await self.config_store.create(tenant_id, config)
    
    # Create tenant encryption key
    await self.key_manager.create_key(tenant_id)
    
    if config.isolation_model == 'silo':
        # Create dedicated index
        await self.index_manager.create_index(
            name=f"tenant-{tenant_id}",
            dimension=config.embedding_dimension,
        )
    
    # Initialize usage tracking
    await self.usage_tracker.initialize(tenant_id)
    
    return OnboardingResult(
        tenant_id=tenant_id,
        status='active',
    )

Offboarding

async def offboard_tenant(
    self, 
    tenant_id: str
) -> OffboardingResult:
    # Delete all tenant data
    if self.config.isolation_model == 'silo':
        await self.index_manager.delete_index(f"tenant-{tenant_id}")
    else:
        await self.shared_index.delete(
            filter={'tenant_id': {'$eq': tenant_id}}
        )
    
    # Delete encryption key
    await self.key_manager.delete_key(tenant_id)
    
    # Delete configuration
    await self.config_store.delete(tenant_id)
    
    # Archive usage records (for billing)
    await self.usage_tracker.archive(tenant_id)
    
    return OffboardingResult(tenant_id=tenant_id, status='deleted')

Implementation Checklist

Security

Architecture

Choose isolation model (silo/pool/bridge)
Design ingestion pipeline with tenant context
Implement tenant-aware retrieval
Configure per-tenant LLM settings
Set up usage tracking

Operations

FAQ

Which isolation model should I use?

Silo for enterprise, regulated, or high-value customers. Pool for SMB, freemium, or non-sensitive data. Bridge if you have both.

How do I prevent one tenant from seeing another’s data?

Always filter by tenant_id on queries. Validate this in code reviews. Write automated tests that attempt cross-tenant access.

What about performance with many tenants?

Pool model scales better (shared resources). Silo model can be expensive but performs more predictably. Consider caching frequently accessed chunks.

Should I use managed services or build my own?

Managed (AWS Bedrock Knowledge Bases, Azure OpenAI On Your Data) if you want faster time-to-market. Build if you need custom isolation, pricing, or features.

How do I handle tenant-specific customization?

Store per-tenant configuration: chunking strategy, embedding model, prompt templates, LLM settings. Apply at each pipeline stage.

What’s the cost model for multi-tenant RAG?

Track embeddings created, storage used, queries made, and tokens consumed per tenant. Price based on usage or tiers.

Sources & Further Reading

Azure Secure Multi-Tenant RAG — Microsoft architecture guide
AWS Multi-Tenant RAG — AWS implementation
AWS Multi-Tenant RAG with JWT — JWT-based access control
RAG for SaaS Introduction — SaaS RAG fundamentals
LLM Cost Optimization — Related: managing AI costs
Agent Tool Design — Related: tool patterns

Interested in our research?

We share our work openly. If you'd like to collaborate or discuss ideas — we'd love to hear from you.

Get in Touch