内容简介:At the end of this article, you will be able to extract a file from an FTP server and load it into a data-warehouse using Python in Google Cloud Functions.In this article, we will look do the following:Extract, Transform, Load (ETL) is the general procedur
Extracting data from an FTP server using Google Cloud Functions
At the end of this article, you will be able to extract a file from an FTP server and load it into a data-warehouse using Python in Google Cloud Functions.
In this article, we will look do the following:
- Set up a Cloud Function
- Extract data
- Transform data
- Load data
- Automate our pipeline
Firstly, what is ETL?
Extract, Transform, Load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source or in a different context than the source.
There are a lot of ETL tools out there and sometimes they can be overwhelming at times, especially when you simply want to copy a file from point A to B. So today, I am going to show you how to extract a CSV file from an FTP server (Extract), modify it (Transform) and automatically load it into a BigQuery table (Load) using python 3.6 and Google Cloud Functions.
Google Cloud Functions: Cloud Functions (CF) is Google Cloud’s Serverless platform set to execute scripts responding to specified events, such as a HTTP request or a database update. An alternative to CF is AWS Lambda or Azure Functions .
Scenario
Let me give you some background info before we begin. A 3rd-party service provider had an FTP server which housed several CSV files, each containing the daily transaction data of a logistic company. We needed that data in our data warehouse so we could share it internally with stakeholders and to monitor performance.
The infrastructure team at this logistics company exports one CSV file daily from their database, and uploads it to an FTP server.
Our job was to copy those files daily from the server, clean it and then load it into our data warehouse so we could connect it to other data sources and run analytics on them.
This is what our ETL pipeline diagram will look like in the end:
Setting up your Cloud Function
- Open the Functions Overview page in the Cloud Console:
Go to the Cloud Functions Overview page
Make sure that the project for which you enabled Cloud Functions is selected. - Click Create function.
- Name your function.
- In the Trigger field, select HTTP Trigger.
- In the Source code field, select Inline editor. In this exercise, you will use the code we are going to work on together so you can delete the default code in the editor.
- Use the Runtime dropdown to select a runtime.
Make sure your runtime is set to “Python 3.7” and under “Advanced options” change the region to one closest to you. As at the writing of this post, CF isn’t available in every Google data-centre region, so check here to see where Cloud Functions is enabled.
When you complete these steps, your display should look like this:
Our custom code
A Cloud Function has two files; a main.py and a requirements.txt file. The latter host all the file dependencies we need for our script to work, so click on the requirements.txt tab and make sure you have them included in the editor like so:
A quick summary of all dependencies:
- google-cloud-bigquery: this library allows us to access and interact with BigQuery
- python-csv: this library is used to manipulate CSV files with Python
- requests: is a HTTP library used to send HTTP requests, which we will need to access the FTP URL.
- wget: used to download files from the internet
- pytest-shutil: this is used for SSH access
Extract
Now in the main.py tab, you can start including the code below. Looking at line 1 to 4. We created a function called “ftp_function”, which we will refer to in the future when accessing the Cloud Function with a HTTP request. We then login to the FTP server with the necessary credentials, and navigate to the appropriate directory where the file is stored on the server.
Please note, the FTP server I was working on, had multiple CSVs representing transaction data for different days. So to get the most recent file in the directory, I used the code starting at line 7 to line 9.
The rest of the code grabs the file and downloads it.
Transform
To “transform” the data, we are going to make a simple change to the CSV file we just downloaded. We will simply change every occurrence of “FBA” to “AMAZON” in the CSV file. Here is the code below.
Note here too, Google Cloud Function has a /tmp
directory where you can temporarily store files. Files in this directory are stored in the instance’s RAM, therefore writing to /tmp
takes up system memory. Once the instance stops running, all temporary files in the directory are deleted.
Load
Now, make sure you create your Bigquery table . And then we simply use this code below to load the transformed CSV file into the Bigquery table you have created. Since we are enabling “ auto-detection ” , the Bigquery table doesn’t have to have a schema when creating it as it will be inferred based on the data in the CSV file.
When you finish writing the script you can deploy the Cloud Function by clicking “Create". When the function has deployed successfully, you should see a green check mark.
Automate
Now all that is left to do is to create a CRON job (Cloud Scheduler) that will automate the call to the CF at some predefined interval. The good news is, creating cron jobs on Google Cloud Platform is very straightforward and really the simplest part. You can follow the instructions here .
Now that our data is sitting snug in the data warehouse, we can connect it to any visualisation tool and perform analysis on it.
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Spark大数据分析技术与实战
董轶群、曹正凤、赵仁乾、王安 / 电子工业出版社 / 2017-7 / 59.00
Spark作为下一代大数据处理引擎,经过短短几年的飞跃式发展,正在以燎原之势席卷业界,现已成为大数据产业中的一股中坚力量。 《Spark大数据分析技术与实战》着重讲解了Spark内核、Spark GraphX、Spark SQL、Spark Streaming和Spark MLlib的核心概念与理论框架,并提供了相应的示例与解析。 《Spark大数据分析技术与实战》共分为8章,其中前4......一起来看看 《Spark大数据分析技术与实战》 这本书的介绍吧!