Skip to content

Project 2: Marketing / CDMO Internal Search Engine

Implementation Ideas for IT Head Discussion


1. Executive Summary

Build an on-premise, AI-powered search engine that enables the marketing and CDMO business development teams to instantly query unstructured internal documents (Excel, Word, PDF) stored across SharePoint and local drives. The system functions like an internal ChatGPT — users ask natural language questions and receive answers with full document traceability (source file citations and direct links to open the original document).

Key constraint: The system must be entirely air-gapped from the internet — no external API calls, no cloud LLMs, no data leaving Divi's infrastructure.


2. Current State Assessment

Aspect Current State
Data Locations SharePoint, OneDrive, local drives
Formats Excel, Word, PDF (rarely images)
Data Nature Unstructured — meeting notes, financial reports, trade show records, customer research, FDA pipeline data
Users Marketing & BD team (junior associates + senior managers)
Current Process Manual document searching — can take 1–2 weeks for a single question
Access Model No hierarchy needed — all marketing data is accessible to the whole team
External Data Contains publicly sourced data (financial reports, FDA databases) already saved internally
Key Constraint No internet connectivity — strictly internal documents only

Types of Information Stored

  • Commercial customers (~50 in various project stages)
  • Target/wish list companies (~50 companies with responsible BD contacts mapped)
  • Molecule tracking — which molecules are being pursued, what stage, patent expiry data from FDA database
  • Meeting notes & summaries — from trade shows, customer visits, unit visits
  • Export/import data — tracking competitor supply of molecules to the same customers
  • Customer interaction history — who met whom, when, what was discussed
  • Financial reports — revenue data for public companies (stored from public sources)
  • Product pipeline information — from stored customer/industry reports

3. Technical Architecture: RAG (Retrieval-Augmented Generation)

The recommended approach is a RAG architecture — combining vector search over indexed documents with a locally-hosted LLM for natural language question answering.

┌──────────────────────────────────────────────────────────────┐
│                  Divi's Internal Network                     │
│                                                              │
│  ┌──────────────┐                                            │
│  │  SharePoint / │──┐                                        │
│  │  OneDrive     │  │    ┌─────────────────────┐             │
│  └──────────────┘  ├───▶│  Document Processor  │             │
│  ┌──────────────┐  │    │  (Parsing + Chunking)│             │
│  │  Local Drives │──┘    └──────────┬──────────┘             │
│  └──────────────┘                   │                        │
│                                     ▼                        │
│                          ┌─────────────────────┐             │
│                          │  Embedding Model     │             │
│                          │  (On-Premise)        │             │
│                          └──────────┬──────────┘             │
│                                     │                        │
│                          ┌──────────▼──────────┐             │
│                          │  Vector Database     │             │
│                          │  (ChromaDB / Milvus) │             │
│                          └──────────┬──────────┘             │
│                                     │                        │
│   User Query ──▶ ┌─────────────────▼──────────────────┐     │
│                  │       RAG Orchestrator              │     │
│                  │  1. Embed query                     │     │
│                  │  2. Search vector DB for relevant   │     │
│                  │     document chunks                 │     │
│                  │  3. Pass chunks + query to LLM      │     │
│                  │  4. Return answer + source citations│     │
│                  └─────────────────┬──────────────────┘     │
│                                    │                         │
│                         ┌──────────▼──────────┐             │
│                         │  Local LLM           │             │
│                         │  (Llama 3 / Mistral) │             │
│                         └──────────┬──────────┘             │
│                                    │                         │
│                         ┌──────────▼──────────┐             │
│                         │  Chat UI (Web App)   │             │
│                         │  Answer + Source Docs │             │
│                         └─────────────────────┘             │
│                                                              │
│   No external network access                                 │
└──────────────────────────────────────────────────────────────┘

4. Component Details

4.1 Document Processing Pipeline

Parsing by Format:

Format Library / Tool Extraction Strategy
Excel openpyxl / pandas Convert each sheet to structured text; preserve table headers and row relationships
Word python-docx Extract paragraphs, tables, headers; maintain document structure
PDF pdfplumber / PyMuPDF Extract text with layout preservation; handle tables via tabula-py
Images Tesseract OCR (if needed) OCR for any scanned documents or embedded images

Chunking Strategy:

  • Split documents into overlapping chunks of ~500–1000 tokens
  • Preserve document metadata with each chunk: filename, file path, page/sheet number, section heading, last modified date
  • Use semantic chunking where possible (split on section boundaries rather than arbitrary token counts)
  • For Excel: treat each sheet or logical table as a separate chunk with full context

Metadata to Store per Chunk:

{
  "chunk_id": "uuid",
  "text": "extracted text content...",
  "source_file": "Customer_Interactions_Q4_2025.xlsx",
  "source_path": "//sharepoint/marketing/cdmo/reports/...",
  "file_type": "excel",
  "sheet_name": "GSK Meetings",
  "page_or_section": "Sheet 3, Rows 15-45",
  "last_modified": "2025-12-15T10:30:00",
  "business_segment": "CDMO",
  "tags": ["customer:GSK", "molecule:compound-x"]
}

4.2 Embedding Model (On-Premise)

Option Parameters Performance Hardware Needed
all-MiniLM-L6-v2 (Sentence-Transformers) 22M Good general-purpose, fast CPU only
bge-large-en-v1.5 (BAAI) 335M Excellent retrieval accuracy GPU recommended
nomic-embed-text 137M Strong, 8K context window CPU or light GPU
e5-large-v2 (Microsoft) 335M Top-tier retrieval GPU recommended

Recommendation: Start with all-MiniLM-L6-v2 for the mock-up (runs on CPU), move to bge-large-en-v1.5 in production for better accuracy.

All models are open-source, downloadable, and run entirely offline.

4.3 Vector Database (On-Premise)

Option Pros Cons
ChromaDB Easy setup, Python-native, good for POC Less scalable at very large volume
Milvus Production-grade, highly scalable More complex to deploy
Qdrant Fast, Rust-based, good filtering Newer, smaller community
pgvector (PostgreSQL extension) Reuse PostgreSQL, SQL-based Slightly less optimized for pure vector search

Recommendation: ChromaDB for prototyping → Milvus or Qdrant for production.

4.4 Local LLM (On-Premise, Air-Gapped)

Model Parameters Quality Hardware
Llama 3.1 8B 8B Good for Q&A, fast 1× GPU (16GB VRAM)
Llama 3.1 70B 70B Excellent quality 2–4× A100 GPUs or quantized on 1× 48GB GPU
Mistral 7B 7B Strong reasoning, fast 1× GPU (16GB VRAM)
Mixtral 8x7B 46.7B (MoE) Excellent, near GPT-3.5 2× 24GB GPUs
Phi-3 Medium 14B Strong for size 1× 24GB GPU

Serving Framework: Use Ollama, vLLM, or llama.cpp to host the LLM locally.

Recommendation: - Mock-up: Mistral 7B via Ollama (easy setup, runs on a single GPU) - Production: Llama 3.1 70B quantized (Q4) or Mixtral 8x7B for best answer quality

4.5 RAG Orchestration

Use LangChain or LlamaIndex as the orchestration framework:

User Question
     │
     ▼
┌──────────────────┐
│ Query Processing │  ← Rephrase/expand query for better retrieval
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Vector Search    │  ← Retrieve top-K relevant chunks (K=5-10)
│ + Metadata Filter│  ← Optional: filter by segment, customer, date range
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Re-Ranking       │  ← Re-rank retrieved chunks for relevance (optional)
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ LLM Generation   │  ← Generate answer from chunks + question
│ with Citations   │  ← Force model to cite source document for each fact
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Response + Links │  ← Answer text + clickable links to source files
└──────────────────┘

Citation Mechanism: - Each answer must include inline citations: "According to [Customer_Interactions_Q4_2025.xlsx, Sheet: GSK Meetings]..." - Citations are clickable — opening the source file directly (or linking to the SharePoint location) - System prompt instructs the LLM to only answer from provided context, never hallucinate


5. User Interface

5.1 Chat Interface

A web-based chat application (similar to ChatGPT) served on the internal network:

Core Features: - Natural language question input - Streaming answer display - Source document citations with clickable file links - Conversation history (for follow-up questions) - Ability to filter search scope (e.g., "search only CDMO documents" or "search only GSK-related files")

Technology: - Frontend: React or simple HTML/JS with a chat UI - Backend: Python FastAPI serving the RAG pipeline - Hosting: On-premise, internal network only

5.2 Example Interactions

Query 1: "Is anyone else supplying Molecule X to GSK?"

Based on the export data report [Export_Tracking_2025.xlsx, Sheet: Competitor Analysis], Company ABC shipped 500kg of Molecule X to GSK's facility in Ireland in Q3 2025. Additionally, [Market_Intelligence_CDMO.pdf, Page 12] notes that a second supplier from India began trial shipments in early 2025.

Sources: - 📄 Export_Tracking_2025.xlsx — Sheet: Competitor Analysis, Row 47 - 📄 Market_Intelligence_CDMO.pdf — Page 12

Query 2: "Have we met Novartis before? What did we discuss?"

Yes. Divi's met with Novartis representatives on three occasions: 1. CPhI Barcelona 2024 — Discussed potential supply of Intermediates A and B. Contact: John Smith, Head of Procurement. [Meeting_Notes_CPhI_2024.docx, Section: Novartis] 2. Unit Visit, Jan 2025 — Novartis team visited Unit 2 facility. Discussed manufacturing capacity and quality systems. [Visit_Report_Jan2025.pdf, Pages 3-5] 3. Email Exchange, Mar 2025 — Follow-up on pricing for Intermediate A. No agreement reached. [Customer_Tracker.xlsx, Sheet: Novartis]

Sources: - 📄 Meeting_Notes_CPhI_2024.docx - 📄 Visit_Report_Jan2025.pdf - 📄 Customer_Tracker.xlsx

Query 3: "What molecules are on our wish list with patent expiry in the next 3 years?"

Based on [Patent_Expiry_Tracker.xlsx], the following molecules on the wish list have patent expiry before 2029:

Molecule Patent Expiry Target Customer BD Contact
Compound A Jun 2027 Teva Ravi K.
Compound B Dec 2027 Mylan Priya S.
Compound C Mar 2028 Sun Pharma Anil M.

Source: 📄 Patent_Expiry_Tracker.xlsx — Sheet: Wish List


6. Document Traceability Implementation

This was explicitly requested — every answer must trace back to its source.

How it works:

  1. At ingestion time: Every document chunk is tagged with full file metadata (path, name, page/section, last modified)
  2. At query time: The RAG pipeline returns the top-K chunks along with their metadata
  3. At display time: The UI renders source citations as clickable links that:
  4. Open the file in SharePoint (if web-accessible internally)
  5. Or show the file path for the user to navigate to manually
  6. Highlight the specific section/page where the answer was found

Verification workflow:

User asks question → AI answers with citations → User clicks source link → 
Opens original document → User verifies → Confidence in answer


7. Security Architecture

┌──────────────────────────────────────────────────────────┐
│                Divi's Internal Network                    │
│                                                          │
│  ┌─────────────────────────────────────────────────────┐ │
│  │                Application Server                    │ │
│  │                                                      │ │
│  │  ┌──────────┐ ┌───────────┐ ┌─────────────────────┐ │ │
│  │  │ Chat UI  │ │ FastAPI   │ │ LangChain/LlamaIndex│ │ │
│  │  │ (React)  │ │ Backend   │ │ RAG Pipeline        │ │ │
│  │  └──────────┘ └───────────┘ └─────────────────────┘ │ │
│  └─────────────────────────┬───────────────────────────┘ │
│                            │                              │
│  ┌─────────────────────────▼───────────────────────────┐ │
│  │                 GPU Server                           │ │
│  │  ┌──────────────────┐  ┌──────────────────────────┐ │ │
│  │  │ Embedding Model  │  │ LLM (Llama 3 / Mistral)  │ │ │
│  │  │ (all-MiniLM)     │  │ via Ollama / vLLM        │ │ │
│  │  └──────────────────┘  └──────────────────────────┘ │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                          │
│  ┌─────────────────────────────────────────────────────┐ │
│  │  Vector DB (ChromaDB/Milvus) + PostgreSQL metadata  │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                          │
│  ✗ NO outbound internet access                           │
│  ✗ NO cloud API calls                                    │
│  ✓ All models downloaded once and run offline            │
│  ✓ All data stays within Divi's network                  │
│  ✓ Access via internal network / VPN only                │
└──────────────────────────────────────────────────────────┘

Air-Gap Compliance: - All AI models (embedding + LLM) are open-source and downloaded once during setup - No runtime internet dependency — everything runs locally - No telemetry, no external logging, no cloud sync - Model weights stored on the same on-premise server


8. Development Phases

Phase 1: Mock-Up with Synthetic Data (Weeks 1–3)

Task Details
Generate synthetic documents Create fake Excel, Word, PDF files mimicking real document structures (fake company names, molecules, meeting notes)
Set up RAG pipeline ChromaDB + all-MiniLM + Mistral 7B via Ollama
Build basic chat UI Simple web interface with question input and cited answers
Demo to client Show 5–10 example queries with source citations

Deliverable: Working chatbot answering questions from synthetic documents with source traceability.

Phase 2: Document Integration & Tuning (Weeks 4–7)

Task Details
Connect to SharePoint / actual document sources Ingest all marketing and CDMO documents
Optimize chunking strategy Tune chunk sizes and overlap for best retrieval accuracy
Fine-tune retrieval Test and optimize embedding model choice, re-ranking
Expand metadata tagging Auto-tag documents by segment, customer, molecule where possible
Test with real queries Work with BD team to validate answers against known facts

Deliverable: System connected to real documents with validated accuracy.

Phase 3: Production Hardening (Weeks 8–10)

Task Details
Deploy production LLM Upgrade to Llama 3.1 70B or Mixtral for better answer quality
Implement scheduled re-indexing Auto-detect new/modified documents in SharePoint and re-index
Add conversation memory Support follow-up questions in the same chat session
Add search filters Allow scoping by segment, customer, date range
User testing Train marketing team, collect feedback, iterate
Documentation Admin guide, user guide

Deliverable: Production-ready internal search engine deployed on-premise.


9. Key Questions for IT Head

  1. GPU Hardware: Is there an existing GPU server on-premise, or does one need to be procured? Minimum requirement: 1× NVIDIA GPU with 24GB VRAM (e.g., RTX 4090, A5000, or A100).
  2. SharePoint API Access: Can we get read-only API access to the marketing team's SharePoint document libraries? Or should we use a mounted file share?
  3. Document Volume: Approximately how many total documents are we indexing? (Rough estimate — hundreds, thousands?)
  4. Network Isolation: Can we set up a dedicated internal server/VM that has NO outbound internet access at all (full air-gap)?
  5. User Authentication: Should the chat interface use AD/LDAP login, or is it open to anyone on the internal network?
  6. File Protocol: When users click a source citation, should it open via a SharePoint URL, a UNC file path (\\server\share\...), or a mapped drive letter?
  7. Existing IT Infrastructure: Is Docker / Kubernetes available on-premise? Or should we plan for bare-metal deployment?
  8. Budget for Hardware: If no GPU server exists, is there budget for procuring one? (Approximate cost: $5K–$15K for a workstation-class GPU server, or $30K+ for an enterprise A100 setup)

10. Estimated Resource Requirements

Resource Estimate
AI/ML Engineer 1 (RAG pipeline, LLM setup, embedding tuning)
Backend Developer 1 (API, document ingestion, SharePoint integration)
GPU Server 1× with minimum 24GB VRAM (48GB+ preferred for larger LLM)
Application Server 1 VM (8+ cores, 32GB RAM) for API + vector DB
Storage 500GB–1TB for documents + vector index + model weights
Timeline ~10 weeks to full production

Feature Traditional Search (Elasticsearch) RAG (Vector + LLM)
Query type Keywords only Natural language questions
Answer format List of matching documents Direct answer + sources
Understanding Exact string matching Semantic understanding
Cross-document synthesis No Yes — combines info from multiple files
Setup complexity Lower Higher (needs GPU for LLM)
Handles unstructured data Partial Excellent

RAG is the right choice given the client's requirements for ChatGPT-like natural language interaction and cross-document question answering.