Project 2: Marketing / CDMO Internal Search Engine¶

Implementation Ideas for IT Head Discussion¶

1. Executive Summary¶

Build an on-premise, AI-powered search engine that enables the marketing and CDMO business development teams to instantly query unstructured internal documents (Excel, Word, PDF) stored across SharePoint and local drives. The system functions like an internal ChatGPT — users ask natural language questions and receive answers with full document traceability (source file citations and direct links to open the original document).

Key constraint: The system must be entirely air-gapped from the internet — no external API calls, no cloud LLMs, no data leaving Divi's infrastructure.

2. Current State Assessment¶

Aspect	Current State
Data Locations	SharePoint, OneDrive, local drives
Formats	Excel, Word, PDF (rarely images)
Data Nature	Unstructured — meeting notes, financial reports, trade show records, customer research, FDA pipeline data
Users	Marketing & BD team (junior associates + senior managers)
Current Process	Manual document searching — can take 1–2 weeks for a single question
Access Model	No hierarchy needed — all marketing data is accessible to the whole team
External Data	Contains publicly sourced data (financial reports, FDA databases) already saved internally
Key Constraint	No internet connectivity — strictly internal documents only

Types of Information Stored¶

Commercial customers (~50 in various project stages)
Target/wish list companies (~50 companies with responsible BD contacts mapped)
Molecule tracking — which molecules are being pursued, what stage, patent expiry data from FDA database
Meeting notes & summaries — from trade shows, customer visits, unit visits
Export/import data — tracking competitor supply of molecules to the same customers
Customer interaction history — who met whom, when, what was discussed
Financial reports — revenue data for public companies (stored from public sources)
Product pipeline information — from stored customer/industry reports

3. Technical Architecture: RAG (Retrieval-Augmented Generation)¶

The recommended approach is a RAG architecture — combining vector search over indexed documents with a locally-hosted LLM for natural language question answering.

┌──────────────────────────────────────────────────────────────┐
│                  Divi's Internal Network                     │
│                                                              │
│  ┌──────────────┐                                            │
│  │  SharePoint / │──┐                                        │
│  │  OneDrive     │  │    ┌─────────────────────┐             │
│  └──────────────┘  ├───▶│  Document Processor  │             │
│  ┌──────────────┐  │    │  (Parsing + Chunking)│             │
│  │  Local Drives │──┘    └──────────┬──────────┘             │
│  └──────────────┘                   │                        │
│                                     ▼                        │
│                          ┌─────────────────────┐             │
│                          │  Embedding Model     │             │
│                          │  (On-Premise)        │             │
│                          └──────────┬──────────┘             │
│                                     │                        │
│                          ┌──────────▼──────────┐             │
│                          │  Vector Database     │             │
│                          │  (ChromaDB / Milvus) │             │
│                          └──────────┬──────────┘             │
│                                     │                        │
│   User Query ──▶ ┌─────────────────▼──────────────────┐     │
│                  │       RAG Orchestrator              │     │
│                  │  1. Embed query                     │     │
│                  │  2. Search vector DB for relevant   │     │
│                  │     document chunks                 │     │
│                  │  3. Pass chunks + query to LLM      │     │
│                  │  4. Return answer + source citations│     │
│                  └─────────────────┬──────────────────┘     │
│                                    │                         │
│                         ┌──────────▼──────────┐             │
│                         │  Local LLM           │             │
│                         │  (Llama 3 / Mistral) │             │
│                         └──────────┬──────────┘             │
│                                    │                         │
│                         ┌──────────▼──────────┐             │
│                         │  Chat UI (Web App)   │             │
│                         │  Answer + Source Docs │             │
│                         └─────────────────────┘             │
│                                                              │
│   No external network access                                 │
└──────────────────────────────────────────────────────────────┘

4. Component Details¶

4.1 Document Processing Pipeline¶

Parsing by Format:

Format	Library / Tool	Extraction Strategy
Excel	`openpyxl` / `pandas`	Convert each sheet to structured text; preserve table headers and row relationships
Word	`python-docx`	Extract paragraphs, tables, headers; maintain document structure
PDF	`pdfplumber` / `PyMuPDF`	Extract text with layout preservation; handle tables via `tabula-py`
Images	`Tesseract OCR` (if needed)	OCR for any scanned documents or embedded images

Chunking Strategy:

Split documents into overlapping chunks of ~500–1000 tokens
Preserve document metadata with each chunk: filename, file path, page/sheet number, section heading, last modified date
Use semantic chunking where possible (split on section boundaries rather than arbitrary token counts)
For Excel: treat each sheet or logical table as a separate chunk with full context

Metadata to Store per Chunk:

{
  "chunk_id": "uuid",
  "text": "extracted text content...",
  "source_file": "Customer_Interactions_Q4_2025.xlsx",
  "source_path": "//sharepoint/marketing/cdmo/reports/...",
  "file_type": "excel",
  "sheet_name": "GSK Meetings",
  "page_or_section": "Sheet 3, Rows 15-45",
  "last_modified": "2025-12-15T10:30:00",
  "business_segment": "CDMO",
  "tags": ["customer:GSK", "molecule:compound-x"]
}

4.2 Embedding Model (On-Premise)¶

Option	Parameters	Performance	Hardware Needed
`all-MiniLM-L6-v2` (Sentence-Transformers)	22M	Good general-purpose, fast	CPU only
`bge-large-en-v1.5` (BAAI)	335M	Excellent retrieval accuracy	GPU recommended
`nomic-embed-text`	137M	Strong, 8K context window	CPU or light GPU
`e5-large-v2` (Microsoft)	335M	Top-tier retrieval	GPU recommended

Recommendation: Start with all-MiniLM-L6-v2 for the mock-up (runs on CPU), move to bge-large-en-v1.5 in production for better accuracy.

All models are open-source, downloadable, and run entirely offline.

4.3 Vector Database (On-Premise)¶

Option	Pros	Cons
ChromaDB	Easy setup, Python-native, good for POC	Less scalable at very large volume
Milvus	Production-grade, highly scalable	More complex to deploy
Qdrant	Fast, Rust-based, good filtering	Newer, smaller community
pgvector (PostgreSQL extension)	Reuse PostgreSQL, SQL-based	Slightly less optimized for pure vector search

Recommendation: ChromaDB for prototyping → Milvus or Qdrant for production.

4.4 Local LLM (On-Premise, Air-Gapped)¶

Model	Parameters	Quality	Hardware
Llama 3.1 8B	8B	Good for Q&A, fast	1× GPU (16GB VRAM)
Llama 3.1 70B	70B	Excellent quality	2–4× A100 GPUs or quantized on 1× 48GB GPU
Mistral 7B	7B	Strong reasoning, fast	1× GPU (16GB VRAM)
Mixtral 8x7B	46.7B (MoE)	Excellent, near GPT-3.5	2× 24GB GPUs
Phi-3 Medium	14B	Strong for size	1× 24GB GPU

Serving Framework: Use Ollama, vLLM, or llama.cpp to host the LLM locally.

Recommendation: - Mock-up: Mistral 7B via Ollama (easy setup, runs on a single GPU) - Production: Llama 3.1 70B quantized (Q4) or Mixtral 8x7B for best answer quality

4.5 RAG Orchestration¶

Use LangChain or LlamaIndex as the orchestration framework:

User Question
     │
     ▼
┌──────────────────┐
│ Query Processing │  ← Rephrase/expand query for better retrieval
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Vector Search    │  ← Retrieve top-K relevant chunks (K=5-10)
│ + Metadata Filter│  ← Optional: filter by segment, customer, date range
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Re-Ranking       │  ← Re-rank retrieved chunks for relevance (optional)
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ LLM Generation   │  ← Generate answer from chunks + question
│ with Citations   │  ← Force model to cite source document for each fact
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│ Response + Links │  ← Answer text + clickable links to source files
└──────────────────┘

Citation Mechanism: - Each answer must include inline citations: "According to [Customer_Interactions_Q4_2025.xlsx, Sheet: GSK Meetings]..." - Citations are clickable — opening the source file directly (or linking to the SharePoint location) - System prompt instructs the LLM to only answer from provided context, never hallucinate

5. User Interface¶

5.1 Chat Interface¶

A web-based chat application (similar to ChatGPT) served on the internal network:

Core Features: - Natural language question input - Streaming answer display - Source document citations with clickable file links - Conversation history (for follow-up questions) - Ability to filter search scope (e.g., "search only CDMO documents" or "search only GSK-related files")

Technology: - Frontend: React or simple HTML/JS with a chat UI - Backend: Python FastAPI serving the RAG pipeline - Hosting: On-premise, internal network only

5.2 Example Interactions¶

Query 1: "Is anyone else supplying Molecule X to GSK?"

Based on the export data report [Export_Tracking_2025.xlsx, Sheet: Competitor Analysis], Company ABC shipped 500kg of Molecule X to GSK's facility in Ireland in Q3 2025. Additionally, [Market_Intelligence_CDMO.pdf, Page 12] notes that a second supplier from India began trial shipments in early 2025.

Sources: - 📄 Export_Tracking_2025.xlsx — Sheet: Competitor Analysis, Row 47 - 📄 Market_Intelligence_CDMO.pdf — Page 12

Query 2: "Have we met Novartis before? What did we discuss?"

Yes. Divi's met with Novartis representatives on three occasions: 1. CPhI Barcelona 2024 — Discussed potential supply of Intermediates A and B. Contact: John Smith, Head of Procurement. [Meeting_Notes_CPhI_2024.docx, Section: Novartis] 2. Unit Visit, Jan 2025 — Novartis team visited Unit 2 facility. Discussed manufacturing capacity and quality systems. [Visit_Report_Jan2025.pdf, Pages 3-5] 3. Email Exchange, Mar 2025 — Follow-up on pricing for Intermediate A. No agreement reached. [Customer_Tracker.xlsx, Sheet: Novartis]

Sources: - 📄 Meeting_Notes_CPhI_2024.docx - 📄 Visit_Report_Jan2025.pdf - 📄 Customer_Tracker.xlsx

Query 3: "What molecules are on our wish list with patent expiry in the next 3 years?"

Based on [Patent_Expiry_Tracker.xlsx], the following molecules on the wish list have patent expiry before 2029:

Molecule Patent Expiry Target Customer BD Contact

Compound A Jun 2027 Teva Ravi K.

Compound B Dec 2027 Mylan Priya S.

Compound C Mar 2028 Sun Pharma Anil M.

Source: 📄 Patent_Expiry_Tracker.xlsx — Sheet: Wish List

6. Document Traceability Implementation¶

This was explicitly requested — every answer must trace back to its source.

How it works:

At ingestion time: Every document chunk is tagged with full file metadata (path, name, page/section, last modified)
At query time: The RAG pipeline returns the top-K chunks along with their metadata
At display time: The UI renders source citations as clickable links that:
Open the file in SharePoint (if web-accessible internally)
Or show the file path for the user to navigate to manually
Highlight the specific section/page where the answer was found

Verification workflow:

User asks question → AI answers with citations → User clicks source link → 
Opens original document → User verifies → Confidence in answer

7. Security Architecture¶

┌──────────────────────────────────────────────────────────┐
│                Divi's Internal Network                    │
│                                                          │
│  ┌─────────────────────────────────────────────────────┐ │
│  │                Application Server                    │ │
│  │                                                      │ │
│  │  ┌──────────┐ ┌───────────┐ ┌─────────────────────┐ │ │
│  │  │ Chat UI  │ │ FastAPI   │ │ LangChain/LlamaIndex│ │ │
│  │  │ (React)  │ │ Backend   │ │ RAG Pipeline        │ │ │
│  │  └──────────┘ └───────────┘ └─────────────────────┘ │ │
│  └─────────────────────────┬───────────────────────────┘ │
│                            │                              │
│  ┌─────────────────────────▼───────────────────────────┐ │
│  │                 GPU Server                           │ │
│  │  ┌──────────────────┐  ┌──────────────────────────┐ │ │
│  │  │ Embedding Model  │  │ LLM (Llama 3 / Mistral)  │ │ │
│  │  │ (all-MiniLM)     │  │ via Ollama / vLLM        │ │ │
│  │  └──────────────────┘  └──────────────────────────┘ │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                          │
│  ┌─────────────────────────────────────────────────────┐ │
│  │  Vector DB (ChromaDB/Milvus) + PostgreSQL metadata  │ │
│  └─────────────────────────────────────────────────────┘ │
│                                                          │
│  ✗ NO outbound internet access                           │
│  ✗ NO cloud API calls                                    │
│  ✓ All models downloaded once and run offline            │
│  ✓ All data stays within Divi's network                  │
│  ✓ Access via internal network / VPN only                │
└──────────────────────────────────────────────────────────┘

Air-Gap Compliance: - All AI models (embedding + LLM) are open-source and downloaded once during setup - No runtime internet dependency — everything runs locally - No telemetry, no external logging, no cloud sync - Model weights stored on the same on-premise server

8. Development Phases¶

Phase 1: Mock-Up with Synthetic Data (Weeks 1–3)¶

Task	Details
Generate synthetic documents	Create fake Excel, Word, PDF files mimicking real document structures (fake company names, molecules, meeting notes)
Set up RAG pipeline	ChromaDB + all-MiniLM + Mistral 7B via Ollama
Build basic chat UI	Simple web interface with question input and cited answers
Demo to client	Show 5–10 example queries with source citations

Deliverable: Working chatbot answering questions from synthetic documents with source traceability.

Phase 2: Document Integration & Tuning (Weeks 4–7)¶

Task	Details
Connect to SharePoint / actual document sources	Ingest all marketing and CDMO documents
Optimize chunking strategy	Tune chunk sizes and overlap for best retrieval accuracy
Fine-tune retrieval	Test and optimize embedding model choice, re-ranking
Expand metadata tagging	Auto-tag documents by segment, customer, molecule where possible
Test with real queries	Work with BD team to validate answers against known facts

Deliverable: System connected to real documents with validated accuracy.

Phase 3: Production Hardening (Weeks 8–10)¶

Task	Details
Deploy production LLM	Upgrade to Llama 3.1 70B or Mixtral for better answer quality
Implement scheduled re-indexing	Auto-detect new/modified documents in SharePoint and re-index
Add conversation memory	Support follow-up questions in the same chat session
Add search filters	Allow scoping by segment, customer, date range
User testing	Train marketing team, collect feedback, iterate
Documentation	Admin guide, user guide

Deliverable: Production-ready internal search engine deployed on-premise.

9. Key Questions for IT Head¶

GPU Hardware: Is there an existing GPU server on-premise, or does one need to be procured? Minimum requirement: 1× NVIDIA GPU with 24GB VRAM (e.g., RTX 4090, A5000, or A100).
SharePoint API Access: Can we get read-only API access to the marketing team's SharePoint document libraries? Or should we use a mounted file share?
Document Volume: Approximately how many total documents are we indexing? (Rough estimate — hundreds, thousands?)
Network Isolation: Can we set up a dedicated internal server/VM that has NO outbound internet access at all (full air-gap)?
User Authentication: Should the chat interface use AD/LDAP login, or is it open to anyone on the internal network?
File Protocol: When users click a source citation, should it open via a SharePoint URL, a UNC file path (\\server\share\...), or a mapped drive letter?
Existing IT Infrastructure: Is Docker / Kubernetes available on-premise? Or should we plan for bare-metal deployment?
Budget for Hardware: If no GPU server exists, is there budget for procuring one? (Approximate cost: $5K–$15K for a workstation-class GPU server, or $30K+ for an enterprise A100 setup)

10. Estimated Resource Requirements¶

Resource	Estimate
AI/ML Engineer	1 (RAG pipeline, LLM setup, embedding tuning)
Backend Developer	1 (API, document ingestion, SharePoint integration)
GPU Server	1× with minimum 24GB VRAM (48GB+ preferred for larger LLM)
Application Server	1 VM (8+ cores, 32GB RAM) for API + vector DB
Storage	500GB–1TB for documents + vector index + model weights
Timeline	~10 weeks to full production

11. Comparison: RAG vs. Traditional Search¶

Feature	Traditional Search (Elasticsearch)	RAG (Vector + LLM)
Query type	Keywords only	Natural language questions
Answer format	List of matching documents	Direct answer + sources
Understanding	Exact string matching	Semantic understanding
Cross-document synthesis	No	Yes — combines info from multiple files
Setup complexity	Lower	Higher (needs GPU for LLM)
Handles unstructured data	Partial	Excellent

RAG is the right choice given the client's requirements for ChatGPT-like natural language interaction and cross-document question answering.

Molecule	Patent Expiry	Target Customer	BD Contact
Compound A	Jun 2027	Teva	Ravi K.
Compound B	Dec 2027	Mylan	Priya S.
Compound C	Mar 2028	Sun Pharma	Anil M.