Building a Simple ETL Pipeline with Python and Google Cloud Platform

栏目: IT技术 · 发布时间: 4年前

内容简介:At the end of this article, you will be able to extract a file from an FTP server and load it into a data-warehouse using Python in Google Cloud Functions.In this article, we will look do the following:Extract, Transform, Load (ETL) is the general procedur

Extracting data from an FTP server using Google Cloud Functions

Apr 22 ·5min read

Building a Simple ETL Pipeline with Python and Google Cloud Platform

GCP

At the end of this article, you will be able to extract a file from an FTP server and load it into a data-warehouse using Python in Google Cloud Functions.

In this article, we will look do the following:

  • Set up a Cloud Function
  • Extract data
  • Transform data
  • Load data
  • Automate our pipeline

Firstly, what is ETL?

Extract, Transform, Load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source or in a different context than the source.

There are a lot of ETL tools out there and sometimes they can be overwhelming at times, especially when you simply want to copy a file from point A to B. So today, I am going to show you how to extract a CSV file from an FTP server (Extract), modify it (Transform) and automatically load it into a BigQuery table (Load) using python 3.6 and Google Cloud Functions.

Google Cloud Functions: Cloud Functions (CF) is Google Cloud’s Serverless platform set to execute scripts responding to specified events, such as a HTTP request or a database update. An alternative to CF is AWS Lambda or Azure Functions .

Scenario

Let me give you some background info before we begin. A 3rd-party service provider had an FTP server which housed several CSV files, each containing the daily transaction data of a logistic company. We needed that data in our data warehouse so we could share it internally with stakeholders and to monitor performance.

The infrastructure team at this logistics company exports one CSV file daily from their database, and uploads it to an FTP server.

Our job was to copy those files daily from the server, clean it and then load it into our data warehouse so we could connect it to other data sources and run analytics on them.

This is what our ETL pipeline diagram will look like in the end:

Building a Simple ETL Pipeline with Python and Google Cloud Platform

Created with Lucidchart

Setting up your Cloud Function

  • Open the Functions Overview page in the Cloud Console:
    Go to the Cloud Functions Overview page
    Make sure that the project for which you enabled Cloud Functions is selected.
  • Click Create function.
  • Name your function.
  • In the Trigger field, select HTTP Trigger.
  • In the Source code field, select Inline editor. In this exercise, you will use the code we are going to work on together so you can delete the default code in the editor.
  • Use the Runtime dropdown to select a runtime.

Make sure your runtime is set to “Python 3.7” and under “Advanced options” change the region to one closest to you. As at the writing of this post, CF isn’t available in every Google data-centre region, so check here to see where Cloud Functions is enabled.

When you complete these steps, your display should look like this:

Building a Simple ETL Pipeline with Python and Google Cloud Platform

screenshot from GCP console

Our custom code

A Cloud Function has two files; a main.py and a requirements.txt file. The latter host all the file dependencies we need for our script to work, so click on the requirements.txt tab and make sure you have them included in the editor like so:

A quick summary of all dependencies:

  • google-cloud-bigquery: this library allows us to access and interact with BigQuery
  • python-csv: this library is used to manipulate CSV files with Python
  • requests: is a HTTP library used to send HTTP requests, which we will need to access the FTP URL.
  • wget: used to download files from the internet
  • pytest-shutil: this is used for SSH access

Extract

Now in the main.py tab, you can start including the code below. Looking at line 1 to 4. We created a function called “ftp_function”, which we will refer to in the future when accessing the Cloud Function with a HTTP request. We then login to the FTP server with the necessary credentials, and navigate to the appropriate directory where the file is stored on the server.

Please note, the FTP server I was working on, had multiple CSVs representing transaction data for different days. So to get the most recent file in the directory, I used the code starting at line 7 to line 9.

The rest of the code grabs the file and downloads it.

Transform

To “transform” the data, we are going to make a simple change to the CSV file we just downloaded. We will simply change every occurrence of “FBA” to “AMAZON” in the CSV file. Here is the code below.

Note here too, Google Cloud Function has a /tmp directory where you can temporarily store files. Files in this directory are stored in the instance’s RAM, therefore writing to /tmp takes up system memory. Once the instance stops running, all temporary files in the directory are deleted.

Load

Now, make sure you create your Bigquery table . And then we simply use this code below to load the transformed CSV file into the Bigquery table you have created. Since we are enabling “ auto-detection ” , the Bigquery table doesn’t have to have a schema when creating it as it will be inferred based on the data in the CSV file.

When you finish writing the script you can deploy the Cloud Function by clicking “Create". When the function has deployed successfully, you should see a green check mark.

Building a Simple ETL Pipeline with Python and Google Cloud Platform

source

Automate

Now all that is left to do is to create a CRON job (Cloud Scheduler) that will automate the call to the CF at some predefined interval. The good news is, creating cron jobs on Google Cloud Platform is very straightforward and really the simplest part. You can follow the instructions here .

source

Now that our data is sitting snug in the data warehouse, we can connect it to any visualisation tool and perform analysis on it.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Spark大数据分析技术与实战

Spark大数据分析技术与实战

董轶群、曹正凤、赵仁乾、王安 / 电子工业出版社 / 2017-7 / 59.00

Spark作为下一代大数据处理引擎,经过短短几年的飞跃式发展,正在以燎原之势席卷业界,现已成为大数据产业中的一股中坚力量。 《Spark大数据分析技术与实战》着重讲解了Spark内核、Spark GraphX、Spark SQL、Spark Streaming和Spark MLlib的核心概念与理论框架,并提供了相应的示例与解析。 《Spark大数据分析技术与实战》共分为8章,其中前4......一起来看看 《Spark大数据分析技术与实战》 这本书的介绍吧!

随机密码生成器
随机密码生成器

多种字符组合密码

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

HEX CMYK 转换工具
HEX CMYK 转换工具

HEX CMYK 互转工具