Project 1: Internal Data / Dashboard Platform¶

Implementation Ideas for IT Head Discussion¶

1. Executive Summary¶

Build an on-premise analytics dashboard platform that consolidates ~30+ management reports (Excel, Word, PDF) across three business segments — Generics, CDMO, and Nutraceuticals — into interactive, role-based dashboards for CEO, VP, and Manager-level users. The system replaces the current "CRM in Excel" approach with a unified data layer and visual reporting interface.

2. Current State Assessment¶

Aspect	Current State
Data Sources	SharePoint, OneDrive, local drives
Formats	Excel (primary), Word, PDF, rarely images
Report Count	~10 Generics + ~6 CDMO + ~15 Nutraceuticals ≈ 31 reports
Access Method	Manual review of individual files
CRM	Effectively an Excel-based CRM
BI Tools	Power BI recently launched (separate from this project)
Key Constraint	Data is extremely confidential — must remain 100% on-premise

3. Data Architecture¶

3.1 Data Ingestion Layer¶

Approach: Build an automated ETL pipeline that reads from existing file locations without disrupting current workflows.

Excel Ingestion
Use a Python-based ingestion service (openpyxl / pandas) to parse structured Excel reports
Handle multi-sheet workbooks, merged cells, and varying column layouts
Schedule periodic re-ingestion (e.g., daily or on file-change detection via file-system watchers)
Word / PDF Parsing
Use python-docx for Word documents and pdfplumber / tabula-py for PDF tables
Extract both structured tables and unstructured text sections
Apply template matching to identify known report formats
SharePoint / OneDrive Connector
Use Microsoft Graph API (on-premise Azure AD auth) to pull files from SharePoint document libraries
Alternatively, mount SharePoint as a network drive and ingest via file-system polling
Maintain a file manifest to track versions and detect changes

3.2 Data Normalization & Storage¶

Central Data Store Options (On-Premise):

Option	Pros	Cons
PostgreSQL	Mature, free, SQL-based, great for structured data	Requires setup and maintenance
Microsoft SQL Server	Already in enterprise stack, good Power BI integration	Licensing costs
SQLite + Parquet	Zero-config, lightweight for prototyping	Not ideal for concurrent multi-user access

Recommended: PostgreSQL on an on-premise server or VM within Divi's infrastructure.

Schema Design Concepts:

── dim_products (product_id, product_name, segment, category)
── dim_customers (customer_id, name, region, type)
── dim_regions (region_id, region_name, geography)
── dim_time (date_key, week, month, quarter, year)
── fact_sales (product_id, customer_id, region_id, date_key, quantity, price, status)
── fact_forecasts (product_id, region_id, date_key, forecast_qty, outlook_type)
── fact_opportunities (product_id, customer_id, region_id, stage, probability, est_close_date, notes)
── fact_dispatch (product_id, customer_id, region_id, date_key, dispatch_qty, dispatch_status)

This star-schema design supports the exact query types mentioned: "For Product A, Customer B — which regions did they buy from?"

3.3 Data Refresh Strategy¶

Batch refresh: Nightly scheduled job scans source folders for updated files, re-ingests changed reports
Incremental detection: Use file modification timestamps and checksums to only process changed files
Audit log: Track every ingestion run — what was read, what changed, any parsing errors

4. Dashboard Application¶

4.1 Technology Stack¶

Component	Recommended Technology	Rationale
Backend API	Python (FastAPI or Flask)	Quick development, excellent data library ecosystem
Dashboard Framework	Apache Superset or Metabase (self-hosted)	Open-source, on-premise, rich visualization
Alternative	Streamlit / Dash (Plotly)	Faster prototyping, Python-native
Database	PostgreSQL	Structured data, SQL queries, role-based access
Auth	Active Directory / LDAP integration	Leverage existing company identity system
Hosting	On-premise VM or Docker containers	Meets security requirement

Recommended for Phase 1 (Mock-up): Streamlit or Plotly Dash — fastest path to a working demo with synthetic data.

Recommended for Production: Apache Superset — mature, supports role-based access, self-service exploration, and can be fully self-hosted.

4.2 Dashboard Hierarchy¶

┌─────────────────────────────────────────────┐
│              CEO / Management               │
│  ─ Revenue overview across all 3 segments   │
│  ─ YoY comparison (this year vs last year)  │
│  ─ Sales forecast vs actuals                │
│  ─ Opportunity pipeline summary             │
│  ─ Top customers / top products             │
└─────────────────────┬───────────────────────┘
                      │
        ┌─────────────┼─────────────┐
        ▼             ▼             ▼
┌──────────────┐┌──────────────┐┌──────────────┐
│   VP View    ││   VP View    ││   VP View    │
│  Generics    ││    CDMO      ││ Nutraceutical│
│ ─ Segment    ││ ─ Segment    ││ ─ Segment    │
│   revenue    ││   revenue    ││   revenue    │
│ ─ Regional   ││ ─ Customer   ││ ─ Regional   │
│   breakdown  ││   projects   ││   breakdown  │
│ ─ Dispatch   ││ ─ Pipeline   ││ ─ Dispatch   │
│   status     ││   status     ││   status     │
└──────┬───────┘└──────┬───────┘└──────┬───────┘
       ▼               ▼               ▼
┌──────────────┐┌──────────────┐┌──────────────┐
│   Manager    ││   Manager    ││   Manager    │
│ ─ Product-   ││ ─ Molecule-  ││ ─ Product-   │
│   level data ││   level data ││   level data │
│ ─ Per-region ││ ─ Per-client ││ ─ Per-region │
│   drill-down ││   projects   ││   drill-down │
│ ─ Weekly     ││ ─ Feasibility││ ─ Weekly     │
│   forecast   ││   tracking   ││   forecast   │
└──────────────┘└──────────────┘└──────────────┘

4.3 Key Dashboard Views¶

Sales & Revenue Dashboard - Total revenue by segment, region, customer - Monthly/quarterly/annual comparisons - YoY growth rates - Top 10 products and customers

Forecast & Outlook Dashboard - Weekly/monthly dispatch forecast - Forecast vs actuals (variance analysis) - Next year projections - Sales pipeline by stage

Opportunity Tracker - Opportunities by stage (prospect → qualified → negotiation → closed) - Estimated revenue from pipeline - Opportunities unlikely to convert (with reasons) - Time-to-conversion by region

Dispatch & Logistics Dashboard - Products dispatched by week/month - Dispatch status (pending, shipped, delivered) - Regional dispatch volumes - Fulfillment rate

4.4 Role-Based Access Control (RBAC)¶

Role          │ Segments Visible    │ Data Granularity      │ Actions
──────────────┼─────────────────────┼───────────────────────┼──────────────────
CEO           │ All                 │ Aggregated            │ View only
VP            │ Assigned segment(s) │ Regional + customer   │ View, filter, export
Manager       │ Assigned segment    │ Full product-level    │ View, filter, export

Integrate with Active Directory for SSO
Map AD groups to dashboard roles
All access logged for audit compliance

5. Security Architecture¶

Given the client's emphasis on data confidentiality:

┌──────────────────────────────────────────┐
│          Divi's Internal Network         │
│                                          │
│  ┌─────────────┐    ┌────────────────┐   │
│  │ SharePoint / │───▶│  ETL Service   │   │
│  │ OneDrive     │    │  (Python)      │   │
│  └─────────────┘    └───────┬────────┘   │
│                             │             │
│                     ┌───────▼────────┐   │
│                     │  PostgreSQL    │   │
│                     │  (On-Premise)  │   │
│                     └───────┬────────┘   │
│                             │             │
│  ┌──────────────────────────▼──────────┐ │
│  │  Dashboard App (Superset/Streamlit) │ │
│  │  ─ Served on internal network only  │ │
│  │  ─ AD/LDAP authentication           │ │
│  │  ─ TLS encryption in transit        │ │
│  └─────────────────────────────────────┘ │
│                                          │
│  Access via: VPN or internal network     │
│  No external endpoints exposed           │
└──────────────────────────────────────────┘

Key Security Measures: - All components run on-premise within Divi's firewall - No external API calls or cloud dependencies - Database encrypted at rest (PostgreSQL TDE or disk-level encryption) - All traffic encrypted with TLS (internal CA certificates) - Audit logging on all data access - VPN access for authorized remote users only

6. Development Phases¶

Phase 1: Mock-Up with Synthetic Data (Weeks 1–3)¶

Task	Details
Generate synthetic datasets	Mimic structure of the ~31 reports across 3 segments with fake product/customer/region names
Build data model	Star schema in SQLite or PostgreSQL
Create 3–4 sample dashboards	CEO overview, VP generics, Manager generics, Forecast view
Demo to client	Show query capabilities and visualizations

Deliverable: Working prototype with synthetic data demonstrating all core interactions.

Phase 2: Data Integration (Weeks 4–7)¶

Task	Details
Review 4-page scope document	Align data model with exact report structures and management questions
Build production ETL pipeline	Connect to actual SharePoint/OneDrive/local sources
Map all ~31 reports	Parse and normalize each report into the star schema
Implement RBAC	Integrate with Active Directory
Set up on-premise infrastructure	Deploy PostgreSQL + Dashboard app on Divi's servers

Deliverable: Production-ready data pipeline reading from actual report sources.

Task	Details
Build all dashboard views	CEO, VP (×3 segments), Manager views
Implement drill-down capabilities	Click from segment → region → customer → product
Add scheduled refresh	Nightly data refresh jobs
User acceptance testing	Test with actual stakeholders at each level
Training and documentation	User guide and admin guide

Deliverable: Fully operational dashboard platform with all role-based views.

7. Key Questions for IT Head¶

Infrastructure: What on-premise servers / VMs are available for hosting the database and dashboard application? Is Docker/containerization supported?
Active Directory: Can we integrate with AD/LDAP for single sign-on and role mapping?
SharePoint Access: Can we get API access (Microsoft Graph) to SharePoint document libraries, or should we use mounted network drives?
Network: Is there an internal DNS / reverse proxy setup for hosting the dashboard on a clean internal URL?
Power BI Overlap: Since Power BI was recently launched — should this complement or replace certain Power BI reports? What's the intended relationship?
Data Refresh Frequency: Is nightly refresh sufficient, or is near-real-time needed for any reports?
VPN: What VPN infrastructure is in place for secure remote development access?
Backup & DR: What are the backup and disaster recovery requirements for the new data store?

8. Estimated Resource Requirements¶

Resource	Estimate
Data Engineer	1 (ETL pipeline, data modeling)
Full-Stack / Dashboard Developer	1 (Superset/Streamlit setup, RBAC)
On-Premise Server	1 VM (8+ cores, 32GB RAM, 500GB SSD)
PostgreSQL Instance	Self-hosted on same or separate VM
Timeline	~10 weeks to full production