Project 1: Internal Data / Dashboard Platform¶
Implementation Ideas for IT Head Discussion¶
1. Executive Summary¶
Build an on-premise analytics dashboard platform that consolidates ~30+ management reports (Excel, Word, PDF) across three business segments — Generics, CDMO, and Nutraceuticals — into interactive, role-based dashboards for CEO, VP, and Manager-level users. The system replaces the current "CRM in Excel" approach with a unified data layer and visual reporting interface.
2. Current State Assessment¶
| Aspect | Current State |
|---|---|
| Data Sources | SharePoint, OneDrive, local drives |
| Formats | Excel (primary), Word, PDF, rarely images |
| Report Count | ~10 Generics + ~6 CDMO + ~15 Nutraceuticals ≈ 31 reports |
| Access Method | Manual review of individual files |
| CRM | Effectively an Excel-based CRM |
| BI Tools | Power BI recently launched (separate from this project) |
| Key Constraint | Data is extremely confidential — must remain 100% on-premise |
3. Data Architecture¶
3.1 Data Ingestion Layer¶
Approach: Build an automated ETL pipeline that reads from existing file locations without disrupting current workflows.
- Excel Ingestion
- Use a Python-based ingestion service (
openpyxl/pandas) to parse structured Excel reports - Handle multi-sheet workbooks, merged cells, and varying column layouts
-
Schedule periodic re-ingestion (e.g., daily or on file-change detection via file-system watchers)
-
Word / PDF Parsing
- Use
python-docxfor Word documents andpdfplumber/tabula-pyfor PDF tables - Extract both structured tables and unstructured text sections
-
Apply template matching to identify known report formats
-
SharePoint / OneDrive Connector
- Use Microsoft Graph API (on-premise Azure AD auth) to pull files from SharePoint document libraries
- Alternatively, mount SharePoint as a network drive and ingest via file-system polling
- Maintain a file manifest to track versions and detect changes
3.2 Data Normalization & Storage¶
Central Data Store Options (On-Premise):
| Option | Pros | Cons |
|---|---|---|
| PostgreSQL | Mature, free, SQL-based, great for structured data | Requires setup and maintenance |
| Microsoft SQL Server | Already in enterprise stack, good Power BI integration | Licensing costs |
| SQLite + Parquet | Zero-config, lightweight for prototyping | Not ideal for concurrent multi-user access |
Recommended: PostgreSQL on an on-premise server or VM within Divi's infrastructure.
Schema Design Concepts:
── dim_products (product_id, product_name, segment, category)
── dim_customers (customer_id, name, region, type)
── dim_regions (region_id, region_name, geography)
── dim_time (date_key, week, month, quarter, year)
── fact_sales (product_id, customer_id, region_id, date_key, quantity, price, status)
── fact_forecasts (product_id, region_id, date_key, forecast_qty, outlook_type)
── fact_opportunities (product_id, customer_id, region_id, stage, probability, est_close_date, notes)
── fact_dispatch (product_id, customer_id, region_id, date_key, dispatch_qty, dispatch_status)
This star-schema design supports the exact query types mentioned: "For Product A, Customer B — which regions did they buy from?"
3.3 Data Refresh Strategy¶
- Batch refresh: Nightly scheduled job scans source folders for updated files, re-ingests changed reports
- Incremental detection: Use file modification timestamps and checksums to only process changed files
- Audit log: Track every ingestion run — what was read, what changed, any parsing errors
4. Dashboard Application¶
4.1 Technology Stack¶
| Component | Recommended Technology | Rationale |
|---|---|---|
| Backend API | Python (FastAPI or Flask) | Quick development, excellent data library ecosystem |
| Dashboard Framework | Apache Superset or Metabase (self-hosted) | Open-source, on-premise, rich visualization |
| Alternative | Streamlit / Dash (Plotly) | Faster prototyping, Python-native |
| Database | PostgreSQL | Structured data, SQL queries, role-based access |
| Auth | Active Directory / LDAP integration | Leverage existing company identity system |
| Hosting | On-premise VM or Docker containers | Meets security requirement |
Recommended for Phase 1 (Mock-up): Streamlit or Plotly Dash — fastest path to a working demo with synthetic data.
Recommended for Production: Apache Superset — mature, supports role-based access, self-service exploration, and can be fully self-hosted.
4.2 Dashboard Hierarchy¶
┌─────────────────────────────────────────────┐
│ CEO / Management │
│ ─ Revenue overview across all 3 segments │
│ ─ YoY comparison (this year vs last year) │
│ ─ Sales forecast vs actuals │
│ ─ Opportunity pipeline summary │
│ ─ Top customers / top products │
└─────────────────────┬───────────────────────┘
│
┌─────────────┼─────────────┐
▼ ▼ ▼
┌──────────────┐┌──────────────┐┌──────────────┐
│ VP View ││ VP View ││ VP View │
│ Generics ││ CDMO ││ Nutraceutical│
│ ─ Segment ││ ─ Segment ││ ─ Segment │
│ revenue ││ revenue ││ revenue │
│ ─ Regional ││ ─ Customer ││ ─ Regional │
│ breakdown ││ projects ││ breakdown │
│ ─ Dispatch ││ ─ Pipeline ││ ─ Dispatch │
│ status ││ status ││ status │
└──────┬───────┘└──────┬───────┘└──────┬───────┘
▼ ▼ ▼
┌──────────────┐┌──────────────┐┌──────────────┐
│ Manager ││ Manager ││ Manager │
│ ─ Product- ││ ─ Molecule- ││ ─ Product- │
│ level data ││ level data ││ level data │
│ ─ Per-region ││ ─ Per-client ││ ─ Per-region │
│ drill-down ││ projects ││ drill-down │
│ ─ Weekly ││ ─ Feasibility││ ─ Weekly │
│ forecast ││ tracking ││ forecast │
└──────────────┘└──────────────┘└──────────────┘
4.3 Key Dashboard Views¶
Sales & Revenue Dashboard - Total revenue by segment, region, customer - Monthly/quarterly/annual comparisons - YoY growth rates - Top 10 products and customers
Forecast & Outlook Dashboard - Weekly/monthly dispatch forecast - Forecast vs actuals (variance analysis) - Next year projections - Sales pipeline by stage
Opportunity Tracker - Opportunities by stage (prospect → qualified → negotiation → closed) - Estimated revenue from pipeline - Opportunities unlikely to convert (with reasons) - Time-to-conversion by region
Dispatch & Logistics Dashboard - Products dispatched by week/month - Dispatch status (pending, shipped, delivered) - Regional dispatch volumes - Fulfillment rate
4.4 Role-Based Access Control (RBAC)¶
Role │ Segments Visible │ Data Granularity │ Actions
──────────────┼─────────────────────┼───────────────────────┼──────────────────
CEO │ All │ Aggregated │ View only
VP │ Assigned segment(s) │ Regional + customer │ View, filter, export
Manager │ Assigned segment │ Full product-level │ View, filter, export
- Integrate with Active Directory for SSO
- Map AD groups to dashboard roles
- All access logged for audit compliance
5. Security Architecture¶
Given the client's emphasis on data confidentiality:
┌──────────────────────────────────────────┐
│ Divi's Internal Network │
│ │
│ ┌─────────────┐ ┌────────────────┐ │
│ │ SharePoint / │───▶│ ETL Service │ │
│ │ OneDrive │ │ (Python) │ │
│ └─────────────┘ └───────┬────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ PostgreSQL │ │
│ │ (On-Premise) │ │
│ └───────┬────────┘ │
│ │ │
│ ┌──────────────────────────▼──────────┐ │
│ │ Dashboard App (Superset/Streamlit) │ │
│ │ ─ Served on internal network only │ │
│ │ ─ AD/LDAP authentication │ │
│ │ ─ TLS encryption in transit │ │
│ └─────────────────────────────────────┘ │
│ │
│ Access via: VPN or internal network │
│ No external endpoints exposed │
└──────────────────────────────────────────┘
Key Security Measures: - All components run on-premise within Divi's firewall - No external API calls or cloud dependencies - Database encrypted at rest (PostgreSQL TDE or disk-level encryption) - All traffic encrypted with TLS (internal CA certificates) - Audit logging on all data access - VPN access for authorized remote users only
6. Development Phases¶
Phase 1: Mock-Up with Synthetic Data (Weeks 1–3)¶
| Task | Details |
|---|---|
| Generate synthetic datasets | Mimic structure of the ~31 reports across 3 segments with fake product/customer/region names |
| Build data model | Star schema in SQLite or PostgreSQL |
| Create 3–4 sample dashboards | CEO overview, VP generics, Manager generics, Forecast view |
| Demo to client | Show query capabilities and visualizations |
Deliverable: Working prototype with synthetic data demonstrating all core interactions.
Phase 2: Data Integration (Weeks 4–7)¶
| Task | Details |
|---|---|
| Review 4-page scope document | Align data model with exact report structures and management questions |
| Build production ETL pipeline | Connect to actual SharePoint/OneDrive/local sources |
| Map all ~31 reports | Parse and normalize each report into the star schema |
| Implement RBAC | Integrate with Active Directory |
| Set up on-premise infrastructure | Deploy PostgreSQL + Dashboard app on Divi's servers |
Deliverable: Production-ready data pipeline reading from actual report sources.
Phase 3: Dashboard Refinement & Rollout (Weeks 8–10)¶
| Task | Details |
|---|---|
| Build all dashboard views | CEO, VP (×3 segments), Manager views |
| Implement drill-down capabilities | Click from segment → region → customer → product |
| Add scheduled refresh | Nightly data refresh jobs |
| User acceptance testing | Test with actual stakeholders at each level |
| Training and documentation | User guide and admin guide |
Deliverable: Fully operational dashboard platform with all role-based views.
7. Key Questions for IT Head¶
- Infrastructure: What on-premise servers / VMs are available for hosting the database and dashboard application? Is Docker/containerization supported?
- Active Directory: Can we integrate with AD/LDAP for single sign-on and role mapping?
- SharePoint Access: Can we get API access (Microsoft Graph) to SharePoint document libraries, or should we use mounted network drives?
- Network: Is there an internal DNS / reverse proxy setup for hosting the dashboard on a clean internal URL?
- Power BI Overlap: Since Power BI was recently launched — should this complement or replace certain Power BI reports? What's the intended relationship?
- Data Refresh Frequency: Is nightly refresh sufficient, or is near-real-time needed for any reports?
- VPN: What VPN infrastructure is in place for secure remote development access?
- Backup & DR: What are the backup and disaster recovery requirements for the new data store?
8. Estimated Resource Requirements¶
| Resource | Estimate |
|---|---|
| Data Engineer | 1 (ETL pipeline, data modeling) |
| Full-Stack / Dashboard Developer | 1 (Superset/Streamlit setup, RBAC) |
| On-Premise Server | 1 VM (8+ cores, 32GB RAM, 500GB SSD) |
| PostgreSQL Instance | Self-hosted on same or separate VM |
| Timeline | ~10 weeks to full production |