Building a Simple ETL Pipeline with Python and Google Cloud Platform

栏目: IT技术 · 发布时间: 4年前

内容简介:At the end of this article, you will be able to extract a file from an FTP server and load it into a data-warehouse using Python in Google Cloud Functions.In this article, we will look do the following:Extract, Transform, Load (ETL) is the general procedur

Extracting data from an FTP server using Google Cloud Functions

Apr 22 ·5min read

Building a Simple ETL Pipeline with Python and Google Cloud Platform

GCP

At the end of this article, you will be able to extract a file from an FTP server and load it into a data-warehouse using Python in Google Cloud Functions.

In this article, we will look do the following:

  • Set up a Cloud Function
  • Extract data
  • Transform data
  • Load data
  • Automate our pipeline

Firstly, what is ETL?

Extract, Transform, Load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source or in a different context than the source.

There are a lot of ETL tools out there and sometimes they can be overwhelming at times, especially when you simply want to copy a file from point A to B. So today, I am going to show you how to extract a CSV file from an FTP server (Extract), modify it (Transform) and automatically load it into a BigQuery table (Load) using python 3.6 and Google Cloud Functions.

Google Cloud Functions: Cloud Functions (CF) is Google Cloud’s Serverless platform set to execute scripts responding to specified events, such as a HTTP request or a database update. An alternative to CF is AWS Lambda or Azure Functions .

Scenario

Let me give you some background info before we begin. A 3rd-party service provider had an FTP server which housed several CSV files, each containing the daily transaction data of a logistic company. We needed that data in our data warehouse so we could share it internally with stakeholders and to monitor performance.

The infrastructure team at this logistics company exports one CSV file daily from their database, and uploads it to an FTP server.

Our job was to copy those files daily from the server, clean it and then load it into our data warehouse so we could connect it to other data sources and run analytics on them.

This is what our ETL pipeline diagram will look like in the end:

Building a Simple ETL Pipeline with Python and Google Cloud Platform

Created with Lucidchart

Setting up your Cloud Function

  • Open the Functions Overview page in the Cloud Console:
    Go to the Cloud Functions Overview page
    Make sure that the project for which you enabled Cloud Functions is selected.
  • Click Create function.
  • Name your function.
  • In the Trigger field, select HTTP Trigger.
  • In the Source code field, select Inline editor. In this exercise, you will use the code we are going to work on together so you can delete the default code in the editor.
  • Use the Runtime dropdown to select a runtime.

Make sure your runtime is set to “Python 3.7” and under “Advanced options” change the region to one closest to you. As at the writing of this post, CF isn’t available in every Google data-centre region, so check here to see where Cloud Functions is enabled.

When you complete these steps, your display should look like this:

Building a Simple ETL Pipeline with Python and Google Cloud Platform

screenshot from GCP console

Our custom code

A Cloud Function has two files; a main.py and a requirements.txt file. The latter host all the file dependencies we need for our script to work, so click on the requirements.txt tab and make sure you have them included in the editor like so:

A quick summary of all dependencies:

  • google-cloud-bigquery: this library allows us to access and interact with BigQuery
  • python-csv: this library is used to manipulate CSV files with Python
  • requests: is a HTTP library used to send HTTP requests, which we will need to access the FTP URL.
  • wget: used to download files from the internet
  • pytest-shutil: this is used for SSH access

Extract

Now in the main.py tab, you can start including the code below. Looking at line 1 to 4. We created a function called “ftp_function”, which we will refer to in the future when accessing the Cloud Function with a HTTP request. We then login to the FTP server with the necessary credentials, and navigate to the appropriate directory where the file is stored on the server.

Please note, the FTP server I was working on, had multiple CSVs representing transaction data for different days. So to get the most recent file in the directory, I used the code starting at line 7 to line 9.

The rest of the code grabs the file and downloads it.

Transform

To “transform” the data, we are going to make a simple change to the CSV file we just downloaded. We will simply change every occurrence of “FBA” to “AMAZON” in the CSV file. Here is the code below.

Note here too, Google Cloud Function has a /tmp directory where you can temporarily store files. Files in this directory are stored in the instance’s RAM, therefore writing to /tmp takes up system memory. Once the instance stops running, all temporary files in the directory are deleted.

Load

Now, make sure you create your Bigquery table . And then we simply use this code below to load the transformed CSV file into the Bigquery table you have created. Since we are enabling “ auto-detection ” , the Bigquery table doesn’t have to have a schema when creating it as it will be inferred based on the data in the CSV file.

When you finish writing the script you can deploy the Cloud Function by clicking “Create". When the function has deployed successfully, you should see a green check mark.

Building a Simple ETL Pipeline with Python and Google Cloud Platform

source

Automate

Now all that is left to do is to create a CRON job (Cloud Scheduler) that will automate the call to the CF at some predefined interval. The good news is, creating cron jobs on Google Cloud Platform is very straightforward and really the simplest part. You can follow the instructions here .

source

Now that our data is sitting snug in the data warehouse, we can connect it to any visualisation tool and perform analysis on it.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

司法的过程

司法的过程

(美)亨利·J.亚伯拉罕 / 泮伟江 宦盛奎 韩阳 / 北京大学出版社 / 2009-07-28 / 58.00元

本书是以比较研究的方法来分析司法哲学的经典文本之一。作者以敏锐的眼光透视了司法过程背后的理论、实践和参与其中的人。比较了美国、英国、法国的具体法院运作,审视了“司法能动主义”和“司法克制主义”之间的争辩。本书第七版的介绍吸收了美国、英国、法国和欧洲法院体系运作中的最新和重要的发展。 目前国内非常关注司法的运作过程、法官的裁判过程,此书的翻译对于这方面的研究很有助益,对于英国和法国法院的介绍填补了国......一起来看看 《司法的过程》 这本书的介绍吧!

CSS 压缩/解压工具
CSS 压缩/解压工具

在线压缩/解压 CSS 代码

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具