Streaming Data Changes to a Data Lake with Debezium and Delta Lake Pipeline

栏目: IT技术 · 发布时间: 6年前

内容简介：WORK-IN-PROGRESSStreaming data changes to a Data Lake with Debezium and Delta Lake pipelineThis is an example end-to-end project that demonstrates the Debezium-Delta Lake combo pipeline

WORK-IN-PROGRESS

delta-architecture

Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline https://medium.com/@yinondn/streaming-data-changes-to-a-data-lake-with-debezium-and-delta-lake-pipeline-299821053dc3

This is an example end-to-end project that demonstrates the Debezium-Delta Lake combo pipeline

See medium post for more details

High Level Strategy Overview

Debezium reads database logs, produces json messages that describe the changes and streams them to Kafka
Kafka streams the messages and stores them in a S3 folder. We call it Bronze table as it stores raw messages
Using Spark with Delta Lake we transform the messages to INSERT, UPDATE and DELETE operations, and run them on the target data lake table. This is the table that holds the latest state of all source databases. We call it Silver table
Next we can perform further aggregations on the Silver table for analytics. We call it Gold table

Components

compose: Docker-Compose configuration that deploys containers with Debezium stack (Kafka, Zookeepr and Kafka-Connect), reads changes from the source databases and streams them to S3
voter-processing: Notebook with PySpark code that transforms Debezium messages to INSERT, UPDATE and DELETE operations
fake_it: For an end-to-end example, a simulator of a voters book application's database with live input

Instructions

Start up docker compose

export DEBEZIUM_VERSION=1.0
cd compose
docker-compose up -d

Config Debezium connector

curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8084/connectors/ -d @debezium/config.json

Run spark notebook

Import the notebook file in \voter-processing\voter-processing.html to a Databricks Community account and follow the instructions inside the notebook

https://community.cloud.databricks.com/

TODO - To complete the end-to-end example flow

Change the voter-processing from notebook to PySpark application
Add the PySpark application to the Docker-Compose
Change the configurations so that Kafka writes to local file system instead of S3
Change the Spark application so that it read Kafka's output instead of generating it's own mock data

What's Next?

Make it a configurable generic tool that can be assembled on top of any supported database

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

SNS浪潮

李翔昊 / 人民邮电出版社 / 2010 / 35.00元

内容提要：本书通过互联网行业与社会生活中的案例内容，向读者介绍了一些互联网技术和新型网站的发展，揭示了社交网站兴起的因素。在探讨社交网站发展和网络开放平台的同时，也介绍、描述了其对社会信息传播、行业组织、广告营销等方面的影响。最后通过新技术和产品应用，展望了未来社会化网络的趋势走向。本书适合从事信息技术、社会传播、市场营销相关工作，以及广大互联网用户，或对IT行业有兴趣的人士阅读。......一起来看看《SNS浪潮》这本书的介绍吧!

码农工具