Howard University - Data Lake Solution Accelerator for Low Latency Data Processing

Business Problem

The requirement was to develop a data platform to collect, organize, process and provide insights into business, operational aspects and enable development of customer value added, data driven product features and dashboards.

Raw data was stored with no oversight of the contents

The platform needs to have defined mechanisms to catalog, and secure data. Without these elements, data cannot be found, or trusted resulting in a “data swamp “.

Solution

Real time streaming data from source systems (batch load scripts are also in place).

Connectors developed for Oracle DB (PeopleSoft, Banner) and Workday.

Uses Oracle DB Streams feature to identify changes from Redo Logs and stream to Kafka

Data access is controlled with views set up in Apache Hive which is connected to data lake.

ETLS run in loop and identify changed files (via Hive) and update Report Mart. Sample ET L scripts and reports developed for HR Diversity data. PostgreSQL acts as Report Mart.

Data changes form source system are reflected in the reports within two minutes.

Outcome

Easier and quicker to populate as no transformation is involved

Allows to import any amount of data that can come in real-time

Allows organizations to generate different types of insights including reporting on historical data

Ability to store all types of structured and unstructured data

Elimination of data silos

Democratized access to data via a single, unified view of data