Building a Simple ETL Pipeline with Python and Google Cloud Platform

栏目: IT技术 · 发布时间: 4年前

内容简介：At the end of this article, you will be able to extract a file from an FTP server and load it into a data-warehouse using Python in Google Cloud Functions.In this article, we will look do the following:Extract, Transform, Load (ETL) is the general procedur

Extracting data from an FTP server using Google Cloud Functions

Tobi Sam

Apr 22 ·5min read

Building a Simple ETL Pipeline with Python and Google Cloud Platform — GCP

At the end of this article, you will be able to extract a file from an FTP server and load it into a data-warehouse using Python in Google Cloud Functions.

In this article, we will look do the following:

Set up a Cloud Function
Extract data
Transform data
Load data
Automate our pipeline

Firstly, what is ETL?

Extract, Transform, Load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source or in a different context than the source.

There are a lot of ETL tools out there and sometimes they can be overwhelming at times, especially when you simply want to copy a file from point A to B. So today, I am going to show you how to extract a CSV file from an FTP server (Extract), modify it (Transform) and automatically load it into a BigQuery table (Load) using python 3.6 and Google Cloud Functions.

Google Cloud Functions: Cloud Functions (CF) is Google Cloud’s Serverless platform set to execute scripts responding to specified events, such as a HTTP request or a database update. An alternative to CF is AWS Lambda or Azure Functions .

Scenario

Let me give you some background info before we begin. A 3rd-party service provider had an FTP server which housed several CSV files, each containing the daily transaction data of a logistic company. We needed that data in our data warehouse so we could share it internally with stakeholders and to monitor performance.

The infrastructure team at this logistics company exports one CSV file daily from their database, and uploads it to an FTP server.

Our job was to copy those files daily from the server, clean it and then load it into our data warehouse so we could connect it to other data sources and run analytics on them.

This is what our ETL pipeline diagram will look like in the end:

Setting up your Cloud Function

Open the Functions Overview page in the Cloud Console:
Go to the Cloud Functions Overview page
Make sure that the project for which you enabled Cloud Functions is selected.
Click Create function.
Name your function.
In the Trigger field, select HTTP Trigger.
In the Source code field, select Inline editor. In this exercise, you will use the code we are going to work on together so you can delete the default code in the editor.
Use the Runtime dropdown to select a runtime.

Make sure your runtime is set to “Python 3.7” and under “Advanced options” change the region to one closest to you. As at the writing of this post, CF isn’t available in every Google data-centre region, so check here to see where Cloud Functions is enabled.

When you complete these steps, your display should look like this:

Our custom code

A Cloud Function has two files; a main.py and a requirements.txt file. The latter host all the file dependencies we need for our script to work, so click on the requirements.txt tab and make sure you have them included in the editor like so:

A quick summary of all dependencies:

google-cloud-bigquery: this library allows us to access and interact with BigQuery
python-csv: this library is used to manipulate CSV files with Python
requests: is a HTTP library used to send HTTP requests, which we will need to access the FTP URL.
wget: used to download files from the internet
pytest-shutil: this is used for SSH access

Extract

Now in the main.py tab, you can start including the code below. Looking at line 1 to 4. We created a function called “ftp_function”, which we will refer to in the future when accessing the Cloud Function with a HTTP request. We then login to the FTP server with the necessary credentials, and navigate to the appropriate directory where the file is stored on the server.

Please note, the FTP server I was working on, had multiple CSVs representing transaction data for different days. So to get the most recent file in the directory, I used the code starting at line 7 to line 9.

The rest of the code grabs the file and downloads it.

Transform

To “transform” the data, we are going to make a simple change to the CSV file we just downloaded. We will simply change every occurrence of “FBA” to “AMAZON” in the CSV file. Here is the code below.

Note here too, Google Cloud Function has a /tmp directory where you can temporarily store files. Files in this directory are stored in the instance’s RAM, therefore writing to /tmp takes up system memory. Once the instance stops running, all temporary files in the directory are deleted.

Load

Now, make sure you create your Bigquery table . And then we simply use this code below to load the transformed CSV file into the Bigquery table you have created. Since we are enabling “ auto-detection ” , the Bigquery table doesn’t have to have a schema when creating it as it will be inferred based on the data in the CSV file.

When you finish writing the script you can deploy the Cloud Function by clicking “Create". When the function has deployed successfully, you should see a green check mark.

Automate

Now all that is left to do is to create a CRON job (Cloud Scheduler) that will automate the call to the CF at some predefined interval. The good news is, creating cron jobs on Google Cloud Platform is very straightforward and really the simplest part. You can follow the instructions here .

source

Now that our data is sitting snug in the data warehouse, we can connect it to any visualisation tool and perform analysis on it.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网