Streaming Data Changes to a Data Lake with Debezium and Delta Lake Pipeline

栏目: IT技术 · 发布时间: 4年前

内容简介:WORK-IN-PROGRESSStreaming data changes to a Data Lake with Debezium and Delta Lake pipelineThis is an example end-to-end project that demonstrates the Debezium-Delta Lake combo pipeline

WORK-IN-PROGRESS

delta-architecture

Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline https://medium.com/@yinondn/streaming-data-changes-to-a-data-lake-with-debezium-and-delta-lake-pipeline-299821053dc3

This is an example end-to-end project that demonstrates the Debezium-Delta Lake combo pipeline

See medium post for more details

High Level Strategy Overview

  • Debezium reads database logs, produces json messages that describe the changes and streams them to Kafka
  • Kafka streams the messages and stores them in a S3 folder. We call it Bronze table as it stores raw messages
  • Using Spark with Delta Lake we transform the messages to INSERT, UPDATE and DELETE operations, and run them on the target data lake table. This is the table that holds the latest state of all source databases. We call it Silver table
  • Next we can perform further aggregations on the Silver table for analytics. We call it Gold table

Components

  • compose: Docker-Compose configuration that deploys containers with Debezium stack (Kafka, Zookeepr and Kafka-Connect), reads changes from the source databases and streams them to S3
  • voter-processing: Notebook with PySpark code that transforms Debezium messages to INSERT, UPDATE and DELETE operations
  • fake_it: For an end-to-end example, a simulator of a voters book application's database with live input

Instructions

Start up docker compose

  • export DEBEZIUM_VERSION=1.0
  • cd compose
  • docker-compose up -d

Config Debezium connector

curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8084/connectors/ -d @debezium/config.json

Run spark notebook

Import the notebook file in \voter-processing\voter-processing.html to a Databricks Community account and follow the instructions inside the notebook

https://community.cloud.databricks.com/

TODO - To complete the end-to-end example flow

  • Change the voter-processing from notebook to PySpark application
  • Add the PySpark application to the Docker-Compose
  • Change the configurations so that Kafka writes to local file system instead of S3
  • Change the Spark application so that it read Kafka's output instead of generating it's own mock data

What's Next?

Make it a configurable generic tool that can be assembled on top of any supported database


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Java从入门到精通

Java从入门到精通

李钟尉、马文强、陈丹丹 / 清华大学出版社 / 2008-9 / 59.80元

《Java从入门到精通》(软件开发视频大讲堂)从初学者角度出发,通过通俗易懂的语言、丰富多彩的实例,详细介绍了使用Java语言进行程序开发应该掌握的各方面技术。全书共分28章,包括:初识Java,熟悉Eclipse开发工具,Java语言基础,流程控制,字符串,数组,类和对象,包装类,数字处理类,接口、继承与多态,类的高级特性,异常处理,Swing程序设计,集合类,I/O输入输出,反射,枚举类型与泛......一起来看看 《Java从入门到精通》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

XML、JSON 在线转换
XML、JSON 在线转换

在线XML、JSON转换工具

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具