Spark Implementation on Microsoft Synapse
Spark Implementation on Microsoft Synapse
OVERVIEW
- People Tech works in multiple groups across Microsoft and brings this experience to our client projects including best practices and real-world guidance sourced from our Microsoft partnership.
- A pioneer partner with Microsoft for more than 2 decades, we are driven with a goal to transform enterprise into smart businesses with our Microsoft innovations.
- Microsoft CCM team provides metrics on usage of cloud products. Microsoft wanted to reduce the latency when data was made available to customers. Their current system processed data in 4-hour batches and data was not available to customer for 24 hours.
- Data had to go through several layers after integration before being made available to customer.
- The scope of this project included developing a solution in Azure Synapse that handled concurrent customer requests and reduced the lag time in data availability.
SOLUTION PROVIDED BY PEOPLE TECH GROUP
People Tech proposed the following solution:
- Separate pipelines for serving data while concurrently refreshing view when new batches arrive New batch of data is loaded into memory and indexed in the background.
- Separate pipelines for serving data while concurrently refreshing view when new batches arrive.
- New batch of data is loaded into memory and indexed in the background.
- Creating a view post data preprocessing. New requests under process to be completed and the new view is context-switched in.
- Hyperspace Indexing to reduce query time on data.
- Open-source indexing on Spark developed by Microsoft.
- Reduced query response time by half.
BENEFITS
- ETL Performance improved by at least 50%, sometimes up to 80% depending on tables.
- Presence of indexes is important to achieve optimal performance.
- Reduced complexity, saved costs, and made it easier for development.
- Increased performance allowed for in-depth automated validations which would otherwise be too costly and time-consuming to run.
- Performance of ad hoc querying increased by 70% or more.
- Database backups are taken automatically, 3 times per day.
- Enabled data loading in the background without interrupting service to customers.
- Seamless context-switching to updated data when pre-processing and indexing are completed.
- Handled large amounts of incoming data.