内容简介:[TOC]您可以使用
目录
[TOC]
1、什么是 Sqoop? 2、下载应用程序及配置环境变量 2.1、下载 Sqoop 1.4.7 2.2、设置环境变量 2.3、设置安装所需环境 3、安装 Sqoop 1.4.7 3.1、修改 Sqoop 配置文件 3.1.1、修改配置文件 sqoop-env.sh 3.1.2、修改配置文件 configure-sqoop 3.2、查看 Sqoop 版本 4、启动和测试 Sqoop 的数据导入、导出 4.1、Sqoop 通过 Hive 导入数据到 Sqoop 4.2、Sqoop 通过 MySql 导入数据到 Hive 4.3、Sqoop 通过 MySql 导入数据到 Hbase
1、什么是 Sqoop?
Sqoop
是一种用于在 Hadoop
和关系数据库或大型机之间传输数据的工具。
您可以使用 Sqoop
将数据从关系数据库管理系统 RDBMS
(如 MySQL
或 Oracle
)导入 Hadoop
分布式文件系统 HDFS
,转换 Hadoop MapReduce
中的数据,然后将数据导出回 RDBMS
。
Sqoop
自动执行此过程的大部分过程,依靠数据库来描述要导入的数据的模式。 Sqoop
使用 MapReduce
导入和导出数据,提供并行操作和容错。
2、下载应用程序及配置环境变量
2.1、下载 Sqoop 1.4.7
通过以下命令下载 Sqoop
,解压后,放到 /home/work/_app/
目录中:
[root@c0 _src]# pwd /home/work/_src [root@c0 _src]# wget http://mirrors.shu.edu.cn/apache/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz [root@c0 _src]# tar -xzvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz [root@c0 _src]# mv sqoop-1.4.7.bin__hadoop-2.6.0 /home/work/_app/
2.2、设置环境变量
在每一台机器上设置 Sqoop
环境变量,运行以下命令
echo "" >> /etc/bashrc echo "# Sqoop 1.4.7" >> /etc/bashrc echo "export SQOOP_HOME=/home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0" >> /etc/bashrc echo "" >> /etc/bashrc echo "# Path" >> /etc/bashrc echo "export PATH=\$PATH:\$SQOOP_HOME/bin" >> /etc/bashrc source /etc/bashrc
2.3、设置安装所需环境
安装和运行 Sqoop
需要用到 Hive
、 MySql
、 Hadoop
环境。可以参考文章: Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建
3、安装 Sqoop 1.4.7
3.1、修改 Sqoop 配置文件
3.1.1、修改配置文件 sqoop-env.sh
创建 /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/conf/sqoop-env.sh
文件编辑并保存,内容为空,因为我们在本文的配置环境变量章节中已经配置了环境变量,同时也在< Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建 >一文中配置了 Hive
和 Hadoop
环境变量:
[root@c0 ~]# echo "" > /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/conf/sqoop-env.sh
将 /home/work/_app/hive-2.3.4/lib/
目录下的 hive-hcatalog-core-2.3.4.jar
、 mysql-connector-java-5.1.47-bin.jar
、 hive-common-2.3.4.jar
、 libthrift-0.9.3.jar
文件,复制到 /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/lib
目录下
[root@c0 ~]# cp /home/work/_app/hive-2.3.4/lib/hive-hcatalog-core-2.3.4.jar /home/work/_app/hive-2.3.4/lib/mysql-connector-java-5.1.47-bin.jar /home/work/_app/hive-2.3.4/lib/libthrift-0.9.3.jar /home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/lib/
3.1.2、修改配置文件 configure-sqoop
编辑 /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/bin/configure-sqoop
文件并保存,内容如下:
[root@c0 _src]# cat /home/work/_app/sqoop-1.4.7.bin__hadoop-2.6.0/bin/configure-sqoop #!/bin/bash # # Copyright 2011 The Apache Software Foundation # # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # This is sourced in by bin/sqoop to set environment variables prior to # invoking Hadoop. bin="$1" if [ -z "${bin}" ]; then bin=`dirname $0` bin=`cd ${bin} && pwd` fi if [ -z "$SQOOP_HOME" ]; then export SQOOP_HOME=${bin}/.. fi SQOOP_CONF_DIR=${SQOOP_CONF_DIR:-${SQOOP_HOME}/conf} if [ -f "${SQOOP_CONF_DIR}/sqoop-env.sh" ]; then . "${SQOOP_CONF_DIR}/sqoop-env.sh" fi # Find paths to our dependency systems. If they are unset, use CDH defaults. if [ -z "${HADOOP_COMMON_HOME}" ]; then if [ -n "${HADOOP_HOME}" ]; then HADOOP_COMMON_HOME=${HADOOP_HOME} else if [ -d "/usr/lib/hadoop" ]; then HADOOP_COMMON_HOME=/usr/lib/hadoop else HADOOP_COMMON_HOME=${SQOOP_HOME}/../hadoop fi fi fi if [ -z "${HADOOP_MAPRED_HOME}" ]; then HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce if [ ! -d "${HADOOP_MAPRED_HOME}" ]; then if [ -n "${HADOOP_HOME}" ]; then HADOOP_MAPRED_HOME=${HADOOP_HOME} else HADOOP_MAPRED_HOME=${SQOOP_HOME}/../hadoop-mapreduce fi fi fi # We are setting HADOOP_HOME to HADOOP_COMMON_HOME if it is not set # so that hcat script works correctly on BigTop if [ -z "${HADOOP_HOME}" ]; then if [ -n "${HADOOP_COMMON_HOME}" ]; then HADOOP_HOME=${HADOOP_COMMON_HOME} export HADOOP_HOME fi fi if [ -z "${HBASE_HOME}" ]; then if [ -d "/usr/lib/hbase" ]; then HBASE_HOME=/usr/lib/hbase else HBASE_HOME=${SQOOP_HOME}/../hbase fi fi #if [ -z "${HCAT_HOME}" ]; then # if [ -d "/usr/lib/hive-hcatalog" ]; then # HCAT_HOME=/usr/lib/hive-hcatalog # elif [ -d "/usr/lib/hcatalog" ]; then # HCAT_HOME=/usr/lib/hcatalog # else # HCAT_HOME=${SQOOP_HOME}/../hive-hcatalog # if [ ! -d ${HCAT_HOME} ]; then # HCAT_HOME=${SQOOP_HOME}/../hcatalog # fi # fi #fi #if [ -z "${ACCUMULO_HOME}" ]; then # if [ -d "/usr/lib/accumulo" ]; then # ACCUMULO_HOME=/usr/lib/accumulo # else # ACCUMULO_HOME=${SQOOP_HOME}/../accumulo # fi #fi if [ -z "${ZOOKEEPER_HOME}" ]; then if [ -d "/usr/lib/zookeeper" ]; then ZOOKEEPER_HOME=/usr/lib/zookeeper else ZOOKEEPER_HOME=${SQOOP_HOME}/../zookeeper fi fi if [ -z "${HIVE_HOME}" ]; then if [ -d "/usr/lib/hive" ]; then export HIVE_HOME=/usr/lib/hive elif [ -d ${SQOOP_HOME}/../hive ]; then export HIVE_HOME=${SQOOP_HOME}/../hive fi fi # Check: If we can't find our dependencies, give up here. if [ ! -d "${HADOOP_COMMON_HOME}" ]; then echo "Error: $HADOOP_COMMON_HOME does not exist!" echo 'Please set $HADOOP_COMMON_HOME to the root of your Hadoop installation.' exit 1 fi if [ ! -d "${HADOOP_MAPRED_HOME}" ]; then echo "Error: $HADOOP_MAPRED_HOME does not exist!" echo 'Please set $HADOOP_MAPRED_HOME to the root of your Hadoop MapReduce installation.' exit 1 fi ## Moved to be a runtime check in sqoop. if [ ! -d "${HBASE_HOME}" ]; then echo "Warning: $HBASE_HOME does not exist! HBase imports will fail." echo 'Please set $HBASE_HOME to the root of your HBase installation.' fi ## Moved to be a runtime check in sqoop. #if [ ! -d "${HCAT_HOME}" ]; then # echo "Warning: $HCAT_HOME does not exist! HCatalog jobs will fail." # echo 'Please set $HCAT_HOME to the root of your HCatalog installation.' #fi #if [ ! -d "${ACCUMULO_HOME}" ]; then # echo "Warning: $ACCUMULO_HOME does not exist! Accumulo imports will fail." # echo 'Please set $ACCUMULO_HOME to the root of your Accumulo installation.' #fi if [ ! -d "${ZOOKEEPER_HOME}" ]; then echo "Warning: $ZOOKEEPER_HOME does not exist! Accumulo imports will fail." echo 'Please set $ZOOKEEPER_HOME to the root of your Zookeeper installation.' fi # Where to find the main Sqoop jar SQOOP_JAR_DIR=$SQOOP_HOME # If there's a "build" subdir, override with this, so we use # the newly-compiled copy. if [ -d "$SQOOP_JAR_DIR/build" ]; then SQOOP_JAR_DIR="${SQOOP_JAR_DIR}/build" fi function add_to_classpath() { dir=$1 for f in $dir/*.jar; do SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:$f; done export SQOOP_CLASSPATH } # Add sqoop dependencies to classpath. SQOOP_CLASSPATH="" if [ -d "$SQOOP_HOME/lib" ]; then add_to_classpath $SQOOP_HOME/lib fi # Add HBase to dependency list if [ -e "$HBASE_HOME/bin/hbase" ]; then TMP_SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:`$HBASE_HOME/bin/hbase classpath` SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH} fi # Add HCatalog to dependency list if [ -e "${HCAT_HOME}/bin/hcat" ]; then TMP_SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:`${HCAT_HOME}/bin/hcat -classpath` if [ -z "${HIVE_CONF_DIR}" ]; then TMP_SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH}:${HIVE_CONF_DIR} fi SQOOP_CLASSPATH=${TMP_SQOOP_CLASSPATH} fi # Add Accumulo to dependency list if [ -e "$ACCUMULO_HOME/bin/accumulo" ]; then for jn in `$ACCUMULO_HOME/bin/accumulo classpath | grep file:.*accumulo.*jar | cut -d':' -f2`; do SQOOP_CLASSPATH=$SQOOP_CLASSPATH:$jn done for jn in `$ACCUMULO_HOME/bin/accumulo classpath | grep file:.*zookeeper.*jar | cut -d':' -f2`; do SQOOP_CLASSPATH=$SQOOP_CLASSPATH:$jn done fi ZOOCFGDIR=${ZOOCFGDIR:-/etc/zookeeper} if [ -d "${ZOOCFGDIR}" ]; then SQOOP_CLASSPATH=$ZOOCFGDIR:$SQOOP_CLASSPATH fi SQOOP_CLASSPATH=${SQOOP_CONF_DIR}:${SQOOP_CLASSPATH} # If there's a build subdir, use Ivy-retrieved dependencies too. if [ -d "$SQOOP_HOME/build/ivy/lib/sqoop" ]; then for f in $SQOOP_HOME/build/ivy/lib/sqoop/*/*.jar; do SQOOP_CLASSPATH=${SQOOP_CLASSPATH}:$f; done fi add_to_classpath ${SQOOP_JAR_DIR} HADOOP_CLASSPATH="${SQOOP_CLASSPATH}:${HADOOP_CLASSPATH}" if [ ! -z "$SQOOP_USER_CLASSPATH" ]; then # User has elements to prepend to the classpath, forcibly overriding # Sqoop's own lib directories. export HADOOP_CLASSPATH="${SQOOP_USER_CLASSPATH}:${HADOOP_CLASSPATH}" fi export SQOOP_CLASSPATH export SQOOP_CONF_DIR export SQOOP_JAR_DIR export HADOOP_CLASSPATH export HADOOP_COMMON_HOME export HADOOP_MAPRED_HOME export HBASE_HOME export HCAT_HOME export HIVE_CONF_DIR export ACCUMULO_HOME export ZOOKEEPER_HOME
3.2、查看 Sqoop 版本
[root@c0 _src]# sqoop version 2019-03-11 22:30:16,837 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7 Sqoop 1.4.7 git commit id 2328971411f57f0cb683dfb79d19d4d19d185dd8 Compiled by maugli on Thu Dec 21 15:59:58 STD 2017
4、启动和测试 Sqoop 的数据导入、导出
4.1、Sqoop 通过 Hive 导入数据到 Sqoop
在 mysql
中创建数据库 testmshk
并授权给 root
用户,同时创建 hive2mysql_mshk
表
[root@c0 _src]# mysql -uroot -p123456 mysql: [Warning] Using a password on the command line interface can be insecure. Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 21 Server version: 5.7.25 MySQL Community Server (GPL) Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> CREATE DATABASE testmshk DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci; Query OK, 1 row affected (0.00 sec) mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | hive | | mysql | | performance_schema | | sys | | testmshk | +--------------------+ 6 rows in set (0.02 sec) mysql> grant select,insert,update,delete,create on testmshk.* to root; Query OK, 0 rows affected (0.01 sec) mysql> flush privileges; Query OK, 0 rows affected (0.01 sec) mysql> use testmshk; Database changed mysql> create table hive2mysql_mshk(id int,namea varchar(50),nameb varchar(50)); Query OK, 0 rows affected (0.02 sec) mysql> quit; Bye
通过 Sqoop
查询 Mysql
中表的内容,这时可以看到表中的内容是空的
[root@c0 ~]# sqoop eval --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --query "select * from hive2mysql_mshk" 2019-03-11 23:44:06,894 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7 2019-03-11 23:44:06,945 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 2019-03-11 23:44:07,100 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. ------------------------------------------------------------- | id | namea | nameb | ------------------------------------------------------------- -------------------------------------------------------------
在< Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建 >一文中,我们在测试 Hive
时创建了测试数据 /hive/warehouse/testtable/testdata001.dat
我们将这个数据,导入到 Mysql
[root@c0 ~]# sqoop export --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --table hive2mysql_mshk --export-dir /hive/warehouse/testtable/testdata001.dat --input-fields-terminated-by ',' 2019-03-11 23:47:10,400 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7 2019-03-11 23:47:10,437 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 2019-03-11 23:47:10,571 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 2019-03-11 23:47:10,574 INFO tool.CodeGenTool: Beginning code generation 2019-03-11 23:47:10,914 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1 2019-03-11 23:47:10,943 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1 2019-03-11 23:47:10,952 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/work/_app/hadoop-3.1.2 Note: /tmp/sqoop-root/compile/9ea7f54fe87f35ed071ed75c293f25d8/hive2mysql_mshk.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 2019-03-11 23:47:12,652 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/9ea7f54fe87f35ed071ed75c293f25d8/hive2mysql_mshk.jar 2019-03-11 23:47:12,669 INFO mapreduce.ExportJobBase: Beginning export of hive2mysql_mshk 2019-03-11 23:47:12,669 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2019-03-11 23:47:12,804 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 2019-03-11 23:47:14,106 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 2019-03-11 23:47:14,112 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 2019-03-11 23:47:14,112 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 2019-03-11 23:47:14,479 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 2019-03-11 23:47:14,808 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1552315366846_0003 2019-03-11 23:47:16,429 INFO input.FileInputFormat: Total input files to process : 1 2019-03-11 23:47:16,432 INFO input.FileInputFormat: Total input files to process : 1 2019-03-11 23:47:16,513 INFO mapreduce.JobSubmitter: number of splits:4 2019-03-11 23:47:16,577 INFO Configuration.deprecation: mapred.map.tasks.speculative.execution is deprecated. Instead, use mapreduce.map.speculative 2019-03-11 23:47:16,684 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552315366846_0003 2019-03-11 23:47:16,686 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2019-03-11 23:47:16,924 INFO conf.Configuration: resource-types.xml not found 2019-03-11 23:47:16,924 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2019-03-11 23:47:17,213 INFO impl.YarnClientImpl: Submitted application application_1552315366846_0003 2019-03-11 23:47:17,261 INFO mapreduce.Job: The url to track the job: http://c1:8088/proxy/application_1552315366846_0003/ 2019-03-11 23:47:17,262 INFO mapreduce.Job: Running job: job_1552315366846_0003 2019-03-11 23:47:23,359 INFO mapreduce.Job: Job job_1552315366846_0003 running in uber mode : false 2019-03-11 23:47:23,360 INFO mapreduce.Job: map 0% reduce 0% 2019-03-11 23:47:31,454 INFO mapreduce.Job: map 75% reduce 0% 2019-03-11 23:47:32,462 INFO mapreduce.Job: map 100% reduce 0% 2019-03-11 23:47:32,473 INFO mapreduce.Job: Job job_1552315366846_0003 completed successfully 2019-03-11 23:47:32,619 INFO mapreduce.Job: Counters: 32 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=913424 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=782 HDFS: Number of bytes written=0 HDFS: Number of read operations=19 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Launched map tasks=4 Data-local map tasks=4 Total time spent by all maps in occupied slots (ms)=23446 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=23446 Total vcore-milliseconds taken by all map tasks=23446 Total megabyte-milliseconds taken by all map tasks=24008704 Map-Reduce Framework Map input records=2 Map output records=2 Input split bytes=636 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=582 CPU time spent (ms)=3960 Physical memory (bytes) snapshot=830259200 Virtual memory (bytes) snapshot=11165683712 Total committed heap usage (bytes)=454557696 Peak Map Physical memory (bytes)=208502784 Peak Map Virtual memory (bytes)=2793611264 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 2019-03-11 23:47:32,626 INFO mapreduce.ExportJobBase: Transferred 782 bytes in 18.5015 seconds (42.2668 bytes/sec) 2019-03-11 23:47:32,629 INFO mapreduce.ExportJobBase: Exported 2 records.
--export-dir
表示在 HDFS
对应的 Hive
数据库文件位置
–input-fields-terminated-by
表示要处理的间隔符
再次通过 Sqoop
查看 MySql
中的内容,可以看到数据已经成功导入
[root@c0 ~]# sqoop eval --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --query "select * from hive2mysql_mshk" 2019-03-11 23:48:56,848 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7 2019-03-11 23:48:56,884 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 2019-03-11 23:48:57,024 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. ------------------------------------------------------------- | id | namea | nameb | ------------------------------------------------------------- | 10086 | my.mshk.top | you.mshk.top | | 12306 | mname | yname | -------------------------------------------------------------
在 MySql
中能够看到我们创建的 hive2mysql_mshk
表有2行数据
4.2、Sqoop 通过 MySql 导入数据到 Hive
刚刚我们创建的 hive2mysql_mshk
表没有任何主键,我们只是从 Hive
中添加了一些记录到 Mysql
。
默认情况下, Sqoop
将识别表中的主键列(如果存在)并将其用作拆分列。
从数据库中检索拆分列的低值和高值,并且映射任务在总范围的大小均匀的组件上运行。
如果主键的实际值在其范围内不均匀分布,则可能导致任务不平衡。
您应该使用 --split-by
参数明确选择不同的列。例如 -- split-by id
。
在将 MySql
的数据导入到 Hive
中的 Sqoop
命令添加了更多参数:
sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id --table hive2mysql_mshk --target-dir /hive/warehouse/mysql2hive_mshk --fields-terminated-by "," --hive-import --hive-table testmshk.mysql2hive_mshk
--split-by <column-name>
用哪个列来拆分
--table
告诉计算机您要从MySQL导入哪个表
--target-dir <dir>
HDFS要存储的目录
--hive-import
将表导入Hive
--hive-overwrite
覆盖Hive表中的现有数据
--hive-table <table-name>
设置导入Hive时要使用的表名
--fields-terminated-by <char>
设置字段分隔符
接下来 Sqoop
的操作是一个 map-reduce
工作。
[root@c0 _src]# sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id --table hive2mysql_mshk --target-dir /hive/warehouse/mysql2hive_mshk --fields-terminated-by "," --hive-import --hive-table testmshk.mysql2hive_mshk --hive-overwrite 2019-03-12 20:21:05,060 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7 2019-03-12 20:21:05,137 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 2019-03-12 20:21:05,337 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 2019-03-12 20:21:05,348 INFO tool.CodeGenTool: Beginning code generation 2019-03-12 20:21:05,785 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1 2019-03-12 20:21:05,821 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1 2019-03-12 20:21:05,831 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /home/work/_app/hadoop-3.1.2 Note: /tmp/sqoop-root/compile/202a5bda3950a7ccc6782f3cfcc3a99d/hive2mysql_mshk.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 2019-03-12 20:21:08,747 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-root/compile/202a5bda3950a7ccc6782f3cfcc3a99d/hive2mysql_mshk.jar 2019-03-12 20:21:08,761 WARN manager.MySQLManager: It looks like you are importing from mysql. 2019-03-12 20:21:08,761 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 2019-03-12 20:21:08,761 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 2019-03-12 20:21:08,762 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 2019-03-12 20:21:08,764 INFO mapreduce.ImportJobBase: Beginning import of hive2mysql_mshk 2019-03-12 20:21:08,765 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 2019-03-12 20:21:08,928 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 2019-03-12 20:21:09,656 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 2019-03-12 20:21:10,332 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 2019-03-12 20:21:10,688 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1552315366846_0011 2019-03-12 20:21:12,618 INFO db.DBInputFormat: Using read commited transaction isolation 2019-03-12 20:21:12,619 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `hive2mysql_mshk` 2019-03-12 20:21:12,622 INFO db.IntegerSplitter: Split size: 555; Num splits: 4 from: 10086 to: 12306 2019-03-12 20:21:12,696 INFO mapreduce.JobSubmitter: number of splits:4 2019-03-12 20:21:13,137 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1552315366846_0011 2019-03-12 20:21:13,140 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2019-03-12 20:21:13,443 INFO conf.Configuration: resource-types.xml not found 2019-03-12 20:21:13,443 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2019-03-12 20:21:13,533 INFO impl.YarnClientImpl: Submitted application application_1552315366846_0011 2019-03-12 20:21:13,593 INFO mapreduce.Job: The url to track the job: http://c1:8088/proxy/application_1552315366846_0011/ 2019-03-12 20:21:13,594 INFO mapreduce.Job: Running job: job_1552315366846_0011 2019-03-12 20:21:20,705 INFO mapreduce.Job: Job job_1552315366846_0011 running in uber mode : false 2019-03-12 20:21:20,727 INFO mapreduce.Job: map 0% reduce 0% 2019-03-12 20:21:29,927 INFO mapreduce.Job: map 50% reduce 0% 2019-03-12 20:21:29,930 INFO mapreduce.Job: Task Id : attempt_1552315366846_0011_m_000000_0, Status : FAILED [2019-03-12 20:21:28.236]Container [pid=19941,containerID=container_e15_1552315366846_0011_01_000002] is running 539445760B beyond the 'VIRTUAL' memory limit. Current usage: 199.9 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_e15_1552315366846_0011_01_000002 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 20026 19941 19941 19941 (java) 950 81 2678403072 50861 /opt/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx820m -Djava.io.tmpdir=/home/work/_data/hadoop-3.1.2/nm-local-dir/usercache/root/appcache/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.11.32 41274 attempt_1552315366846_0011_m_000000_0 16492674416642 |- 19941 19939 19941 19941 (bash) 1 2 115900416 307 /bin/bash -c /opt/jdk1.8.0_201/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx820m -Djava.io.tmpdir=/home/work/_data/hadoop-3.1.2/nm-local-dir/usercache/root/appcache/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA -Dhadoop.root.logfile=syslog org.apache.hadoop.mapred.YarnChild 192.168.11.32 41274 attempt_1552315366846_0011_m_000000_0 16492674416642 1>/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/stdout 2>/home/work/_logs/hadoop-3.1.2/userlogs/application_1552315366846_0011/container_e15_1552315366846_0011_01_000002/stderr [2019-03-12 20:21:28.324]Container killed on request. Exit code is 143 [2019-03-12 20:21:28.335]Container exited with a non-zero exit code 143. 2019-03-12 20:21:30,978 INFO mapreduce.Job: map 75% reduce 0% 2019-03-12 20:21:37,021 INFO mapreduce.Job: map 100% reduce 0% 2019-03-12 20:21:37,032 INFO mapreduce.Job: Job job_1552315366846_0011 completed successfully 2019-03-12 20:21:37,145 INFO mapreduce.Job: Counters: 33 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=915840 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=425 HDFS: Number of bytes written=49 HDFS: Number of read operations=24 HDFS: Number of large read operations=0 HDFS: Number of write operations=8 Job Counters Failed map tasks=1 Launched map tasks=5 Other local map tasks=5 Total time spent by all maps in occupied slots (ms)=31981 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=31981 Total vcore-milliseconds taken by all map tasks=31981 Total megabyte-milliseconds taken by all map tasks=32748544 Map-Reduce Framework Map input records=2 Map output records=2 Input split bytes=425 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=318 CPU time spent (ms)=6520 Physical memory (bytes) snapshot=815542272 Virtual memory (bytes) snapshot=11174408192 Total committed heap usage (bytes)=437780480 Peak Map Physical memory (bytes)=206934016 Peak Map Virtual memory (bytes)=2795565056 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=49 2019-03-12 20:21:37,154 INFO mapreduce.ImportJobBase: Transferred 49 bytes in 27.4776 seconds (1.7833 bytes/sec) 2019-03-12 20:21:37,159 INFO mapreduce.ImportJobBase: Retrieved 2 records. 2019-03-12 20:21:37,159 INFO mapreduce.ImportJobBase: Publishing Hive/Hcat import job data to Listeners for table hive2mysql_mshk 2019-03-12 20:21:37,188 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1 2019-03-12 20:21:37,208 INFO hive.HiveImport: Loading uploaded data into Hive 2019-03-12 20:21:37,220 INFO conf.HiveConf: Found configuration file file:/home/work/_app/hive-2.3.4/conf/hive-site.xml 2019-03-12 20:21:49,491 INFO hive.HiveImport: 2019-03-12 20:21:49,492 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true 2019-03-12 20:21:56,558 INFO hive.HiveImport: OK 2019-03-12 20:21:56,561 INFO hive.HiveImport: Time taken: 5.954 seconds 2019-03-12 20:21:57,005 INFO hive.HiveImport: Loading data to table testmshk.mysql2hive_mshk 2019-03-12 20:21:58,181 INFO hive.HiveImport: OK 2019-03-12 20:21:58,181 INFO hive.HiveImport: Time taken: 1.619 seconds 2019-03-12 20:21:58,681 INFO hive.HiveImport: Hive import complete.
最后,让我们验证 Hive
中的输出:
[root@c0 ~]# hive Logging initialized using configuration in jar:file:/home/work/_app/hive-2.3.4/lib/hive-common-2.3.4.jar!/hive-log4j2.properties Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. hive> show databases; OK default testmshk Time taken: 3.512 seconds, Fetched: 2 row(s) hive> use testmshk; OK Time taken: 0.032 seconds hive> show tables; OK mysql2hive_mshk testtable Time taken: 0.083 seconds, Fetched: 2 row(s) hive> select * from mysql2hive_mshk; OK 10086 my.mshk.top you.mshk.top 12306 mname yname Time taken: 1.634 seconds, Fetched: 2 row(s) hive> quit;
同时我们在 HDFS
中也可以看到创建的数据:
4.3、Sqoop 通过 MySql 导入数据到 Hbase
接下来我们将 MySql
中的表 hive2mysql_mshk
数据,导入到 Hbase
,同时在 Hbase
中创建表 mysql2hase_mshk
[root@c0 ~]# sqoop import --connect jdbc:mysql://c0:3306/testmshk?useSSL=false --username root --password 123456 --split-by id --table hive2mysql_mshk --hbase-table mysql2hase_mshk --hbase-create-table --hbase-row-key id --column-family id 2019-03-13 12:04:33,647 INFO sqoop.Sqoop: Running Sqoop version: 1.4.7 2019-03-13 12:04:33,694 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead. 2019-03-13 12:04:33,841 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 2019-03-13 12:04:33,841 INFO tool.CodeGenTool: Beginning code generation 2019-03-13 12:04:34,162 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1 2019-03-13 12:04:34,197 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `hive2mysql_mshk` AS t LIMIT 1 ... 2019-03-13 12:05:13,782 INFO mapreduce.Job: map 75% reduce 0% 2019-03-13 12:05:15,813 INFO mapreduce.Job: map 100% reduce 0% 2019-03-13 12:05:16,827 INFO mapreduce.Job: Job job_1552397454797_0002 completed successfully 2019-03-13 12:05:16,942 INFO mapreduce.Job: Counters: 33 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=1041632 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=425 HDFS: Number of bytes written=0 HDFS: Number of read operations=4 HDFS: Number of large read operations=0 HDFS: Number of write operations=0 Job Counters Failed map tasks=5 Launched map tasks=9 Other local map tasks=9 Total time spent by all maps in occupied slots (ms)=68882 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=68882 Total vcore-milliseconds taken by all map tasks=68882 Total megabyte-milliseconds taken by all map tasks=70535168 Map-Reduce Framework Map input records=2 Map output records=2 Input split bytes=425 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=801 CPU time spent (ms)=15480 Physical memory (bytes) snapshot=1097326592 Virtual memory (bytes) snapshot=11271196672 Total committed heap usage (bytes)=629669888 Peak Map Physical memory (bytes)=295751680 Peak Map Virtual memory (bytes)=2828283904 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=0 2019-03-13 12:05:16,949 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 38.224 seconds (0 bytes/sec) 2019-03-13 12:05:16,954 INFO mapreduce.ImportJobBase: Retrieved 2 records.
这时,我们再用 shell
测试连接 Hbase
,查看我们刚刚导入的数据,能够看到 mysql2hase_mshk
已经存在,并且可以获取其中的数据
[root@c0 ~]# hbase shell HBase Shell Use "help" to get list of supported commands. Use "exit" to quit this interactive shell. Version 1.4.9, rd625b212e46d01cb17db9ac2e9e927fdb201afa1, Wed Dec 5 11:54:10 PST 2018 hbase(main):001:0> list TABLE mysql2hase_mshk 1 row(s) in 0.1870 seconds => ["mysql2hase_mshk"] hbase(main):002:0> scan 'mysql2hase_mshk' ROW COLUMN+CELL 10086 column=id:namea, timestamp=1552449912494, value=my.mshk.top 10086 column=id:nameb, timestamp=1552449912494, value=you.mshk.top 12306 column=id:namea, timestamp=1552449906986, value=mname 12306 column=id:nameb, timestamp=1552449906986, value=yname 2 row(s) in 0.1330 seconds hbase(main):003:0> hbase(main):003:0> get 'mysql2hase_mshk','10086' COLUMN CELL id:namea timestamp=1552449912494, value=my.mshk.top id:nameb timestamp=1552449912494, value=you.mshk.top 1 row(s) in 0.0230 seconds hbase(main):004:0>
如何在 Hbase
和 Hive
中互相导入、导出数据,请参考文章: Hadoop 3.1.2(HA)+Zookeeper3.4.13+Hbase1.4.9(HA)+Hive2.3.4+Spark2.4.0(HA)高可用集群搭建 中的 9.2.4
和 9.2.5
章节。
希望您发现它很有用,感谢您的支持和阅读我的博客。
博文作者:迦壹
博客地址: 通过 Sqoop1.4.7 将 Mysql5.7、Hive2.3.4、Hbase1.4.9 之间的数据导入导出
转载声明:可以转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明,谢谢合作!
假设您认为这篇文章对您有帮助,可以通过以下方式进行捐赠,谢谢!
以太坊地址:0xbB0a92d634D7b9Ac69079ed0e521CC2e0a97c420
以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网
猜你喜欢:- Phoenix 数据导入与导出
- Angular Excel 导入与导出
- MongoDB导入导出备份恢复实践
- JS module的导出和导入
- ASP.NET 开源导入导出库Magicodes.IE 导出Pdf教程
- 使用oracle自带的命令进行导入导出
本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。