Skip to content

Project 1: Internal Data / Dashboard Platform

Implementation Ideas for IT Head Discussion


1. Executive Summary

Build an on-premise analytics dashboard platform that consolidates ~30+ management reports (Excel, Word, PDF) across three business segments — Generics, CDMO, and Nutraceuticals — into interactive, role-based dashboards for CEO, VP, and Manager-level users. The system replaces the current "CRM in Excel" approach with a unified data layer and visual reporting interface.


2. Current State Assessment

Aspect Current State
Data Sources SharePoint, OneDrive, local drives
Formats Excel (primary), Word, PDF, rarely images
Report Count ~10 Generics + ~6 CDMO + ~15 Nutraceuticals ≈ 31 reports
Access Method Manual review of individual files
CRM Effectively an Excel-based CRM
BI Tools Power BI recently launched (separate from this project)
Key Constraint Data is extremely confidential — must remain 100% on-premise

3. Data Architecture

3.1 Data Ingestion Layer

Approach: Build an automated ETL pipeline that reads from existing file locations without disrupting current workflows.

  • Excel Ingestion
  • Use a Python-based ingestion service (openpyxl / pandas) to parse structured Excel reports
  • Handle multi-sheet workbooks, merged cells, and varying column layouts
  • Schedule periodic re-ingestion (e.g., daily or on file-change detection via file-system watchers)

  • Word / PDF Parsing

  • Use python-docx for Word documents and pdfplumber / tabula-py for PDF tables
  • Extract both structured tables and unstructured text sections
  • Apply template matching to identify known report formats

  • SharePoint / OneDrive Connector

  • Use Microsoft Graph API (on-premise Azure AD auth) to pull files from SharePoint document libraries
  • Alternatively, mount SharePoint as a network drive and ingest via file-system polling
  • Maintain a file manifest to track versions and detect changes

3.2 Data Normalization & Storage

Central Data Store Options (On-Premise):

Option Pros Cons
PostgreSQL Mature, free, SQL-based, great for structured data Requires setup and maintenance
Microsoft SQL Server Already in enterprise stack, good Power BI integration Licensing costs
SQLite + Parquet Zero-config, lightweight for prototyping Not ideal for concurrent multi-user access

Recommended: PostgreSQL on an on-premise server or VM within Divi's infrastructure.

Schema Design Concepts:

── dim_products (product_id, product_name, segment, category)
── dim_customers (customer_id, name, region, type)
── dim_regions (region_id, region_name, geography)
── dim_time (date_key, week, month, quarter, year)
── fact_sales (product_id, customer_id, region_id, date_key, quantity, price, status)
── fact_forecasts (product_id, region_id, date_key, forecast_qty, outlook_type)
── fact_opportunities (product_id, customer_id, region_id, stage, probability, est_close_date, notes)
── fact_dispatch (product_id, customer_id, region_id, date_key, dispatch_qty, dispatch_status)

This star-schema design supports the exact query types mentioned: "For Product A, Customer B — which regions did they buy from?"

3.3 Data Refresh Strategy

  • Batch refresh: Nightly scheduled job scans source folders for updated files, re-ingests changed reports
  • Incremental detection: Use file modification timestamps and checksums to only process changed files
  • Audit log: Track every ingestion run — what was read, what changed, any parsing errors

4. Dashboard Application

4.1 Technology Stack

Component Recommended Technology Rationale
Backend API Python (FastAPI or Flask) Quick development, excellent data library ecosystem
Dashboard Framework Apache Superset or Metabase (self-hosted) Open-source, on-premise, rich visualization
Alternative Streamlit / Dash (Plotly) Faster prototyping, Python-native
Database PostgreSQL Structured data, SQL queries, role-based access
Auth Active Directory / LDAP integration Leverage existing company identity system
Hosting On-premise VM or Docker containers Meets security requirement

Recommended for Phase 1 (Mock-up): Streamlit or Plotly Dash — fastest path to a working demo with synthetic data.

Recommended for Production: Apache Superset — mature, supports role-based access, self-service exploration, and can be fully self-hosted.

4.2 Dashboard Hierarchy

┌─────────────────────────────────────────────┐
│              CEO / Management               │
│  ─ Revenue overview across all 3 segments   │
│  ─ YoY comparison (this year vs last year)  │
│  ─ Sales forecast vs actuals                │
│  ─ Opportunity pipeline summary             │
│  ─ Top customers / top products             │
└─────────────────────┬───────────────────────┘
                      │
        ┌─────────────┼─────────────┐
        ▼             ▼             ▼
┌──────────────┐┌──────────────┐┌──────────────┐
│   VP View    ││   VP View    ││   VP View    │
│  Generics    ││    CDMO      ││ Nutraceutical│
│ ─ Segment    ││ ─ Segment    ││ ─ Segment    │
│   revenue    ││   revenue    ││   revenue    │
│ ─ Regional   ││ ─ Customer   ││ ─ Regional   │
│   breakdown  ││   projects   ││   breakdown  │
│ ─ Dispatch   ││ ─ Pipeline   ││ ─ Dispatch   │
│   status     ││   status     ││   status     │
└──────┬───────┘└──────┬───────┘└──────┬───────┘
       ▼               ▼               ▼
┌──────────────┐┌──────────────┐┌──────────────┐
│   Manager    ││   Manager    ││   Manager    │
│ ─ Product-   ││ ─ Molecule-  ││ ─ Product-   │
│   level data ││   level data ││   level data │
│ ─ Per-region ││ ─ Per-client ││ ─ Per-region │
│   drill-down ││   projects   ││   drill-down │
│ ─ Weekly     ││ ─ Feasibility││ ─ Weekly     │
│   forecast   ││   tracking   ││   forecast   │
└──────────────┘└──────────────┘└──────────────┘

4.3 Key Dashboard Views

Sales & Revenue Dashboard - Total revenue by segment, region, customer - Monthly/quarterly/annual comparisons - YoY growth rates - Top 10 products and customers

Forecast & Outlook Dashboard - Weekly/monthly dispatch forecast - Forecast vs actuals (variance analysis) - Next year projections - Sales pipeline by stage

Opportunity Tracker - Opportunities by stage (prospect → qualified → negotiation → closed) - Estimated revenue from pipeline - Opportunities unlikely to convert (with reasons) - Time-to-conversion by region

Dispatch & Logistics Dashboard - Products dispatched by week/month - Dispatch status (pending, shipped, delivered) - Regional dispatch volumes - Fulfillment rate

4.4 Role-Based Access Control (RBAC)

Role          │ Segments Visible    │ Data Granularity      │ Actions
──────────────┼─────────────────────┼───────────────────────┼──────────────────
CEO           │ All                 │ Aggregated            │ View only
VP            │ Assigned segment(s) │ Regional + customer   │ View, filter, export
Manager       │ Assigned segment    │ Full product-level    │ View, filter, export
  • Integrate with Active Directory for SSO
  • Map AD groups to dashboard roles
  • All access logged for audit compliance

5. Security Architecture

Given the client's emphasis on data confidentiality:

┌──────────────────────────────────────────┐
│          Divi's Internal Network         │
│                                          │
│  ┌─────────────┐    ┌────────────────┐   │
│  │ SharePoint / │───▶│  ETL Service   │   │
│  │ OneDrive     │    │  (Python)      │   │
│  └─────────────┘    └───────┬────────┘   │
│                             │             │
│                     ┌───────▼────────┐   │
│                     │  PostgreSQL    │   │
│                     │  (On-Premise)  │   │
│                     └───────┬────────┘   │
│                             │             │
│  ┌──────────────────────────▼──────────┐ │
│  │  Dashboard App (Superset/Streamlit) │ │
│  │  ─ Served on internal network only  │ │
│  │  ─ AD/LDAP authentication           │ │
│  │  ─ TLS encryption in transit        │ │
│  └─────────────────────────────────────┘ │
│                                          │
│  Access via: VPN or internal network     │
│  No external endpoints exposed           │
└──────────────────────────────────────────┘

Key Security Measures: - All components run on-premise within Divi's firewall - No external API calls or cloud dependencies - Database encrypted at rest (PostgreSQL TDE or disk-level encryption) - All traffic encrypted with TLS (internal CA certificates) - Audit logging on all data access - VPN access for authorized remote users only


6. Development Phases

Phase 1: Mock-Up with Synthetic Data (Weeks 1–3)

Task Details
Generate synthetic datasets Mimic structure of the ~31 reports across 3 segments with fake product/customer/region names
Build data model Star schema in SQLite or PostgreSQL
Create 3–4 sample dashboards CEO overview, VP generics, Manager generics, Forecast view
Demo to client Show query capabilities and visualizations

Deliverable: Working prototype with synthetic data demonstrating all core interactions.

Phase 2: Data Integration (Weeks 4–7)

Task Details
Review 4-page scope document Align data model with exact report structures and management questions
Build production ETL pipeline Connect to actual SharePoint/OneDrive/local sources
Map all ~31 reports Parse and normalize each report into the star schema
Implement RBAC Integrate with Active Directory
Set up on-premise infrastructure Deploy PostgreSQL + Dashboard app on Divi's servers

Deliverable: Production-ready data pipeline reading from actual report sources.

Phase 3: Dashboard Refinement & Rollout (Weeks 8–10)

Task Details
Build all dashboard views CEO, VP (×3 segments), Manager views
Implement drill-down capabilities Click from segment → region → customer → product
Add scheduled refresh Nightly data refresh jobs
User acceptance testing Test with actual stakeholders at each level
Training and documentation User guide and admin guide

Deliverable: Fully operational dashboard platform with all role-based views.


7. Key Questions for IT Head

  1. Infrastructure: What on-premise servers / VMs are available for hosting the database and dashboard application? Is Docker/containerization supported?
  2. Active Directory: Can we integrate with AD/LDAP for single sign-on and role mapping?
  3. SharePoint Access: Can we get API access (Microsoft Graph) to SharePoint document libraries, or should we use mounted network drives?
  4. Network: Is there an internal DNS / reverse proxy setup for hosting the dashboard on a clean internal URL?
  5. Power BI Overlap: Since Power BI was recently launched — should this complement or replace certain Power BI reports? What's the intended relationship?
  6. Data Refresh Frequency: Is nightly refresh sufficient, or is near-real-time needed for any reports?
  7. VPN: What VPN infrastructure is in place for secure remote development access?
  8. Backup & DR: What are the backup and disaster recovery requirements for the new data store?

8. Estimated Resource Requirements

Resource Estimate
Data Engineer 1 (ETL pipeline, data modeling)
Full-Stack / Dashboard Developer 1 (Superset/Streamlit setup, RBAC)
On-Premise Server 1 VM (8+ cores, 32GB RAM, 500GB SSD)
PostgreSQL Instance Self-hosted on same or separate VM
Timeline ~10 weeks to full production