内容简介:WORK-IN-PROGRESSStreaming data changes to a Data Lake with Debezium and Delta Lake pipelineThis is an example end-to-end project that demonstrates the Debezium-Delta Lake combo pipeline
WORK-IN-PROGRESS
delta-architecture
Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline https://medium.com/@yinondn/streaming-data-changes-to-a-data-lake-with-debezium-and-delta-lake-pipeline-299821053dc3
This is an example end-to-end project that demonstrates the Debezium-Delta Lake combo pipeline
See medium post for more details
High Level Strategy Overview
- Debezium reads database logs, produces json messages that describe the changes and streams them to Kafka
- Kafka streams the messages and stores them in a S3 folder. We call it Bronze table as it stores raw messages
- Using Spark with Delta Lake we transform the messages to INSERT, UPDATE and DELETE operations, and run them on the target data lake table. This is the table that holds the latest state of all source databases. We call it Silver table
- Next we can perform further aggregations on the Silver table for analytics. We call it Gold table
Components
- compose: Docker-Compose configuration that deploys containers with Debezium stack (Kafka, Zookeepr and Kafka-Connect), reads changes from the source databases and streams them to S3
- voter-processing: Notebook with PySpark code that transforms Debezium messages to INSERT, UPDATE and DELETE operations
- fake_it: For an end-to-end example, a simulator of a voters book application's database with live input
Instructions
Start up docker compose
- export DEBEZIUM_VERSION=1.0
- cd compose
- docker-compose up -d
Config Debezium connector
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8084/connectors/ -d @debezium/config.json
Run spark notebook
Import the notebook file in \voter-processing\voter-processing.html to a Databricks Community account and follow the instructions inside the notebook
https://community.cloud.databricks.com/
TODO - To complete the end-to-end example flow
- Change the voter-processing from notebook to PySpark application
- Add the PySpark application to the Docker-Compose
- Change the configurations so that Kafka writes to local file system instead of S3
- Change the Spark application so that it read Kafka's output instead of generating it's own mock data
What's Next?
Make it a configurable generic tool that can be assembled on top of any supported database
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Java JDK6学习笔记
林信良 / 清华大学出版社 / 2007-4 / 59.90元
《Java JDK6学习笔记》是作者良葛格本人近几年来学习Java的心得笔记,结构按照作者的学习脉络依次展开,从什么是Java、如何配置Java开发环境、基本的Java语法到程序流程控制、管理类文件、异常处理、枚举类型、泛型、J2SE中标准的API等均进行了详细介绍。本书还安排了一个“文字编辑器”的专题制作。此外,Java SE6的新功能,对Java lang等套件的功能加强,以及JDBC4.0、......一起来看看 《Java JDK6学习笔记》 这本书的介绍吧!