Howard University - Data Lake Solution Accelerator for Low Latency Data Processing
Business Problem
The requirement was to develop a data platform to collect, organize, process and provide insights into business, operational aspects and enable development of customer value added, data driven product features and dashboards.
Raw data was stored with no oversight of the contents​
The platform needs to have defined mechanisms to catalog, and secure data. Without these elements, data cannot be found, or trusted resulting in a “data swamp “.​
Solution
Real time streaming data from source systems (batch load scripts are also in place).​
Connectors developed for Oracle DB (PeopleSoft, Banner) and Workday.
Uses Oracle DB Streams feature to identify changes from Redo Logs and stream to Kafka​
Data access is controlled with views set up in Apache Hive which is connected to data lake.​
ETLS run in loop and identify changed files (via Hive) and update Report Mart. Sample ET L scripts and reports developed for HR Diversity data. PostgreSQL acts as Report Mart.​
Data changes form source system are reflected in the reports within two minutes.​
Outcome
Easier and quicker to populate as no transformation is involved
Allows to import any amount of data that can come in real-time​
Allows organizations to generate different types of insights including reporting on historical data​
Ability to store all types of structured and unstructured data​
Elimination of data silos​
Democratized access to data via a single, unified view of data​