Project 2: Marketing / CDMO Internal Search Engine¶
Implementation Ideas for IT Head Discussion¶
1. Executive Summary¶
Build an on-premise, AI-powered search engine that enables the marketing and CDMO business development teams to instantly query unstructured internal documents (Excel, Word, PDF) stored across SharePoint and local drives. The system functions like an internal ChatGPT — users ask natural language questions and receive answers with full document traceability (source file citations and direct links to open the original document).
Key constraint: The system must be entirely air-gapped from the internet — no external API calls, no cloud LLMs, no data leaving Divi's infrastructure.
2. Current State Assessment¶
| Aspect | Current State |
|---|---|
| Data Locations | SharePoint, OneDrive, local drives |
| Formats | Excel, Word, PDF (rarely images) |
| Data Nature | Unstructured — meeting notes, financial reports, trade show records, customer research, FDA pipeline data |
| Users | Marketing & BD team (junior associates + senior managers) |
| Current Process | Manual document searching — can take 1–2 weeks for a single question |
| Access Model | No hierarchy needed — all marketing data is accessible to the whole team |
| External Data | Contains publicly sourced data (financial reports, FDA databases) already saved internally |
| Key Constraint | No internet connectivity — strictly internal documents only |
Types of Information Stored¶
- Commercial customers (~50 in various project stages)
- Target/wish list companies (~50 companies with responsible BD contacts mapped)
- Molecule tracking — which molecules are being pursued, what stage, patent expiry data from FDA database
- Meeting notes & summaries — from trade shows, customer visits, unit visits
- Export/import data — tracking competitor supply of molecules to the same customers
- Customer interaction history — who met whom, when, what was discussed
- Financial reports — revenue data for public companies (stored from public sources)
- Product pipeline information — from stored customer/industry reports
3. Technical Architecture: RAG (Retrieval-Augmented Generation)¶
The recommended approach is a RAG architecture — combining vector search over indexed documents with a locally-hosted LLM for natural language question answering.
┌──────────────────────────────────────────────────────────────┐
│ Divi's Internal Network │
│ │
│ ┌──────────────┐ │
│ │ SharePoint / │──┐ │
│ │ OneDrive │ │ ┌─────────────────────┐ │
│ └──────────────┘ ├───▶│ Document Processor │ │
│ ┌──────────────┐ │ │ (Parsing + Chunking)│ │
│ │ Local Drives │──┘ └──────────┬──────────┘ │
│ └──────────────┘ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ Embedding Model │ │
│ │ (On-Premise) │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Vector Database │ │
│ │ (ChromaDB / Milvus) │ │
│ └──────────┬──────────┘ │
│ │ │
│ User Query ──▶ ┌─────────────────▼──────────────────┐ │
│ │ RAG Orchestrator │ │
│ │ 1. Embed query │ │
│ │ 2. Search vector DB for relevant │ │
│ │ document chunks │ │
│ │ 3. Pass chunks + query to LLM │ │
│ │ 4. Return answer + source citations│ │
│ └─────────────────┬──────────────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Local LLM │ │
│ │ (Llama 3 / Mistral) │ │
│ └──────────┬──────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Chat UI (Web App) │ │
│ │ Answer + Source Docs │ │
│ └─────────────────────┘ │
│ │
│ No external network access │
└──────────────────────────────────────────────────────────────┘
4. Component Details¶
4.1 Document Processing Pipeline¶
Parsing by Format:
| Format | Library / Tool | Extraction Strategy |
|---|---|---|
| Excel | openpyxl / pandas |
Convert each sheet to structured text; preserve table headers and row relationships |
| Word | python-docx |
Extract paragraphs, tables, headers; maintain document structure |
pdfplumber / PyMuPDF |
Extract text with layout preservation; handle tables via tabula-py |
|
| Images | Tesseract OCR (if needed) |
OCR for any scanned documents or embedded images |
Chunking Strategy:
- Split documents into overlapping chunks of ~500–1000 tokens
- Preserve document metadata with each chunk: filename, file path, page/sheet number, section heading, last modified date
- Use semantic chunking where possible (split on section boundaries rather than arbitrary token counts)
- For Excel: treat each sheet or logical table as a separate chunk with full context
Metadata to Store per Chunk:
{
"chunk_id": "uuid",
"text": "extracted text content...",
"source_file": "Customer_Interactions_Q4_2025.xlsx",
"source_path": "//sharepoint/marketing/cdmo/reports/...",
"file_type": "excel",
"sheet_name": "GSK Meetings",
"page_or_section": "Sheet 3, Rows 15-45",
"last_modified": "2025-12-15T10:30:00",
"business_segment": "CDMO",
"tags": ["customer:GSK", "molecule:compound-x"]
}
4.2 Embedding Model (On-Premise)¶
| Option | Parameters | Performance | Hardware Needed |
|---|---|---|---|
all-MiniLM-L6-v2 (Sentence-Transformers) |
22M | Good general-purpose, fast | CPU only |
bge-large-en-v1.5 (BAAI) |
335M | Excellent retrieval accuracy | GPU recommended |
nomic-embed-text |
137M | Strong, 8K context window | CPU or light GPU |
e5-large-v2 (Microsoft) |
335M | Top-tier retrieval | GPU recommended |
Recommendation: Start with all-MiniLM-L6-v2 for the mock-up (runs on CPU), move to bge-large-en-v1.5 in production for better accuracy.
All models are open-source, downloadable, and run entirely offline.
4.3 Vector Database (On-Premise)¶
| Option | Pros | Cons |
|---|---|---|
| ChromaDB | Easy setup, Python-native, good for POC | Less scalable at very large volume |
| Milvus | Production-grade, highly scalable | More complex to deploy |
| Qdrant | Fast, Rust-based, good filtering | Newer, smaller community |
| pgvector (PostgreSQL extension) | Reuse PostgreSQL, SQL-based | Slightly less optimized for pure vector search |
Recommendation: ChromaDB for prototyping → Milvus or Qdrant for production.
4.4 Local LLM (On-Premise, Air-Gapped)¶
| Model | Parameters | Quality | Hardware |
|---|---|---|---|
| Llama 3.1 8B | 8B | Good for Q&A, fast | 1× GPU (16GB VRAM) |
| Llama 3.1 70B | 70B | Excellent quality | 2–4× A100 GPUs or quantized on 1× 48GB GPU |
| Mistral 7B | 7B | Strong reasoning, fast | 1× GPU (16GB VRAM) |
| Mixtral 8x7B | 46.7B (MoE) | Excellent, near GPT-3.5 | 2× 24GB GPUs |
| Phi-3 Medium | 14B | Strong for size | 1× 24GB GPU |
Serving Framework: Use Ollama, vLLM, or llama.cpp to host the LLM locally.
Recommendation: - Mock-up: Mistral 7B via Ollama (easy setup, runs on a single GPU) - Production: Llama 3.1 70B quantized (Q4) or Mixtral 8x7B for best answer quality
4.5 RAG Orchestration¶
Use LangChain or LlamaIndex as the orchestration framework:
User Question
│
▼
┌──────────────────┐
│ Query Processing │ ← Rephrase/expand query for better retrieval
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Vector Search │ ← Retrieve top-K relevant chunks (K=5-10)
│ + Metadata Filter│ ← Optional: filter by segment, customer, date range
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Re-Ranking │ ← Re-rank retrieved chunks for relevance (optional)
└────────┬─────────┘
│
▼
┌──────────────────┐
│ LLM Generation │ ← Generate answer from chunks + question
│ with Citations │ ← Force model to cite source document for each fact
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Response + Links │ ← Answer text + clickable links to source files
└──────────────────┘
Citation Mechanism: - Each answer must include inline citations: "According to [Customer_Interactions_Q4_2025.xlsx, Sheet: GSK Meetings]..." - Citations are clickable — opening the source file directly (or linking to the SharePoint location) - System prompt instructs the LLM to only answer from provided context, never hallucinate
5. User Interface¶
5.1 Chat Interface¶
A web-based chat application (similar to ChatGPT) served on the internal network:
Core Features: - Natural language question input - Streaming answer display - Source document citations with clickable file links - Conversation history (for follow-up questions) - Ability to filter search scope (e.g., "search only CDMO documents" or "search only GSK-related files")
Technology: - Frontend: React or simple HTML/JS with a chat UI - Backend: Python FastAPI serving the RAG pipeline - Hosting: On-premise, internal network only
5.2 Example Interactions¶
Query 1: "Is anyone else supplying Molecule X to GSK?"
Based on the export data report [Export_Tracking_2025.xlsx, Sheet: Competitor Analysis], Company ABC shipped 500kg of Molecule X to GSK's facility in Ireland in Q3 2025. Additionally, [Market_Intelligence_CDMO.pdf, Page 12] notes that a second supplier from India began trial shipments in early 2025.
Sources: - 📄 Export_Tracking_2025.xlsx — Sheet: Competitor Analysis, Row 47 - 📄 Market_Intelligence_CDMO.pdf — Page 12
Query 2: "Have we met Novartis before? What did we discuss?"
Yes. Divi's met with Novartis representatives on three occasions: 1. CPhI Barcelona 2024 — Discussed potential supply of Intermediates A and B. Contact: John Smith, Head of Procurement. [Meeting_Notes_CPhI_2024.docx, Section: Novartis] 2. Unit Visit, Jan 2025 — Novartis team visited Unit 2 facility. Discussed manufacturing capacity and quality systems. [Visit_Report_Jan2025.pdf, Pages 3-5] 3. Email Exchange, Mar 2025 — Follow-up on pricing for Intermediate A. No agreement reached. [Customer_Tracker.xlsx, Sheet: Novartis]
Sources: - 📄 Meeting_Notes_CPhI_2024.docx - 📄 Visit_Report_Jan2025.pdf - 📄 Customer_Tracker.xlsx
Query 3: "What molecules are on our wish list with patent expiry in the next 3 years?"
Based on [Patent_Expiry_Tracker.xlsx], the following molecules on the wish list have patent expiry before 2029:
Molecule Patent Expiry Target Customer BD Contact Compound A Jun 2027 Teva Ravi K. Compound B Dec 2027 Mylan Priya S. Compound C Mar 2028 Sun Pharma Anil M. Source: 📄 Patent_Expiry_Tracker.xlsx — Sheet: Wish List
6. Document Traceability Implementation¶
This was explicitly requested — every answer must trace back to its source.
How it works:
- At ingestion time: Every document chunk is tagged with full file metadata (path, name, page/section, last modified)
- At query time: The RAG pipeline returns the top-K chunks along with their metadata
- At display time: The UI renders source citations as clickable links that:
- Open the file in SharePoint (if web-accessible internally)
- Or show the file path for the user to navigate to manually
- Highlight the specific section/page where the answer was found
Verification workflow:
User asks question → AI answers with citations → User clicks source link →
Opens original document → User verifies → Confidence in answer
7. Security Architecture¶
┌──────────────────────────────────────────────────────────┐
│ Divi's Internal Network │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Application Server │ │
│ │ │ │
│ │ ┌──────────┐ ┌───────────┐ ┌─────────────────────┐ │ │
│ │ │ Chat UI │ │ FastAPI │ │ LangChain/LlamaIndex│ │ │
│ │ │ (React) │ │ Backend │ │ RAG Pipeline │ │ │
│ │ └──────────┘ └───────────┘ └─────────────────────┘ │ │
│ └─────────────────────────┬───────────────────────────┘ │
│ │ │
│ ┌─────────────────────────▼───────────────────────────┐ │
│ │ GPU Server │ │
│ │ ┌──────────────────┐ ┌──────────────────────────┐ │ │
│ │ │ Embedding Model │ │ LLM (Llama 3 / Mistral) │ │ │
│ │ │ (all-MiniLM) │ │ via Ollama / vLLM │ │ │
│ │ └──────────────────┘ └──────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Vector DB (ChromaDB/Milvus) + PostgreSQL metadata │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ✗ NO outbound internet access │
│ ✗ NO cloud API calls │
│ ✓ All models downloaded once and run offline │
│ ✓ All data stays within Divi's network │
│ ✓ Access via internal network / VPN only │
└──────────────────────────────────────────────────────────┘
Air-Gap Compliance: - All AI models (embedding + LLM) are open-source and downloaded once during setup - No runtime internet dependency — everything runs locally - No telemetry, no external logging, no cloud sync - Model weights stored on the same on-premise server
8. Development Phases¶
Phase 1: Mock-Up with Synthetic Data (Weeks 1–3)¶
| Task | Details |
|---|---|
| Generate synthetic documents | Create fake Excel, Word, PDF files mimicking real document structures (fake company names, molecules, meeting notes) |
| Set up RAG pipeline | ChromaDB + all-MiniLM + Mistral 7B via Ollama |
| Build basic chat UI | Simple web interface with question input and cited answers |
| Demo to client | Show 5–10 example queries with source citations |
Deliverable: Working chatbot answering questions from synthetic documents with source traceability.
Phase 2: Document Integration & Tuning (Weeks 4–7)¶
| Task | Details |
|---|---|
| Connect to SharePoint / actual document sources | Ingest all marketing and CDMO documents |
| Optimize chunking strategy | Tune chunk sizes and overlap for best retrieval accuracy |
| Fine-tune retrieval | Test and optimize embedding model choice, re-ranking |
| Expand metadata tagging | Auto-tag documents by segment, customer, molecule where possible |
| Test with real queries | Work with BD team to validate answers against known facts |
Deliverable: System connected to real documents with validated accuracy.
Phase 3: Production Hardening (Weeks 8–10)¶
| Task | Details |
|---|---|
| Deploy production LLM | Upgrade to Llama 3.1 70B or Mixtral for better answer quality |
| Implement scheduled re-indexing | Auto-detect new/modified documents in SharePoint and re-index |
| Add conversation memory | Support follow-up questions in the same chat session |
| Add search filters | Allow scoping by segment, customer, date range |
| User testing | Train marketing team, collect feedback, iterate |
| Documentation | Admin guide, user guide |
Deliverable: Production-ready internal search engine deployed on-premise.
9. Key Questions for IT Head¶
- GPU Hardware: Is there an existing GPU server on-premise, or does one need to be procured? Minimum requirement: 1× NVIDIA GPU with 24GB VRAM (e.g., RTX 4090, A5000, or A100).
- SharePoint API Access: Can we get read-only API access to the marketing team's SharePoint document libraries? Or should we use a mounted file share?
- Document Volume: Approximately how many total documents are we indexing? (Rough estimate — hundreds, thousands?)
- Network Isolation: Can we set up a dedicated internal server/VM that has NO outbound internet access at all (full air-gap)?
- User Authentication: Should the chat interface use AD/LDAP login, or is it open to anyone on the internal network?
- File Protocol: When users click a source citation, should it open via a SharePoint URL, a UNC file path (
\\server\share\...), or a mapped drive letter? - Existing IT Infrastructure: Is Docker / Kubernetes available on-premise? Or should we plan for bare-metal deployment?
- Budget for Hardware: If no GPU server exists, is there budget for procuring one? (Approximate cost: $5K–$15K for a workstation-class GPU server, or $30K+ for an enterprise A100 setup)
10. Estimated Resource Requirements¶
| Resource | Estimate |
|---|---|
| AI/ML Engineer | 1 (RAG pipeline, LLM setup, embedding tuning) |
| Backend Developer | 1 (API, document ingestion, SharePoint integration) |
| GPU Server | 1× with minimum 24GB VRAM (48GB+ preferred for larger LLM) |
| Application Server | 1 VM (8+ cores, 32GB RAM) for API + vector DB |
| Storage | 500GB–1TB for documents + vector index + model weights |
| Timeline | ~10 weeks to full production |
11. Comparison: RAG vs. Traditional Search¶
| Feature | Traditional Search (Elasticsearch) | RAG (Vector + LLM) |
|---|---|---|
| Query type | Keywords only | Natural language questions |
| Answer format | List of matching documents | Direct answer + sources |
| Understanding | Exact string matching | Semantic understanding |
| Cross-document synthesis | No | Yes — combines info from multiple files |
| Setup complexity | Lower | Higher (needs GPU for LLM) |
| Handles unstructured data | Partial | Excellent |
RAG is the right choice given the client's requirements for ChatGPT-like natural language interaction and cross-document question answering.