sqoop 从mysql直接导入到hive表

栏目: 数据库 · 发布时间: 5年前

内容简介：mysql的数据库数据过大，做数据分析，需要从mysql转向hadoop。从mysql转数据到hive中，本想用parquet格式，但是一直都没有成功，提示Hive import and create hive table is not compatible with importing into ParquetFile format.

mysql的数据库数据过大，做数据分析，需要从 mysql 转向hadoop。

1，遇到的问题

从mysql转数据到hive中，本想用parquet格式，但是一直都没有成功，提示

Hive import and create hive table is not compatible with importing into ParquetFile format.

sqoop不管是mysql直接到hive。还是把mysql导出成parquet文件，然后在把parquet文件，在导入到hive的外部表，都没有成功

存为avro格式也是一样。

2，安装sqoop

下载：http://mirrors.shu.edu.cn/apache/sqoop/1.4.7/

# tar zxvf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
# cp -r sqoop-1.4.7.bin__hadoop-2.6.0 /bigdata/sqoop

3，配置sqoop

3.1，配置用户环境变量

# cd ~
# vim .bashrc
export SQOOP_HOME=/bigdata/sqoop
export PATH=$ZOOKEEPER_HOME/bin:$SPARK_HOME/bin:$HIVE_HOME/bin:/bigdata/hadoop/bin:$SQOOP_HOME/bin:$PATH

# source .bashrc

3.2，配置sqoop-env.sh

# vim /bigdata/sqoop/sqoop-env.sh 

#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/bigdata/hadoop

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/bigdata/hadoop

#set the path to where bin/hbase is available
#export HBASE_HOME=

#Set the path to where bin/hive is available
export HIVE_HOME=/bigdata/hive
export HIVE_CONF_DIR=/bigdata/hive/conf   //要加上，不然会提示hiveconf找不到

#Set the path for where zookeper config dir is
export ZOOCFGDIR=/bigdata/zookeeper/conf

3.3，导入数据

# sqoop import \
--connect jdbc:mysql://10.0.0.237:3306/bigdata \
--username root \
--password ******* \
--table track_app \
-m 1 \
--warehouse-dir /user/hive/warehouse/tanktest.db \
--hive-database tanktest \
--create-hive-table \
--hive-import \
--hive-table track_app

这样就可以导入了，不过导入hive后，在hdfs上面存储的文件格式是文本形势。

hive> describe formatted track_app;
OK
# col_name data_type comment 

id int
log_date int
log_time int
user_id int
ticket string 

# Detailed Table Information
Database: tanktest
Owner: root
CreateTime: Fri Feb 15 18:08:55 CST 2019
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://bigserver1:9000/user/hive/warehouse/tanktest.db/track_app
Table Type: MANAGED_TABLE
Table Parameters:
 comment Imported by sqoop on 2019/02/15 18:08:42
 numFiles 1
 numRows 0
 rawDataSize 0
 totalSize 208254
 transient_lastDdlTime 1550225337 

# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat //文本格式，也可以在hdfs上面，打开文件查看内容
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
 field.delim \u0001
 line.delim \n
 serialization.format \u0001
Time taken: 0.749 seconds, Fetched: 56 row(s)

注意：

导致hive表后，通过sql（hive，spark-sql）的方式，看一下，能不能查询到数据。如果查不到数据，说明没有导入成功。

4，sqoop参数

Import和export参数解释

Common arguments:

--connect <jdbc-uri> ：连接RDBMS的jdbc连接字符串，例如：–connect jdbc:mysql:// MYSQL_SERVER:PORT/DBNAME。

--connection-manager <class-name> ：

--hadoop-home <hdir> ：

--username <username> ：连接RDBMS所使用的用户名。

--password <password> ：连接RDBMS所使用的密码，明文。

--password-file <password-file> ：使用文件存储密码。

-p ：交互式连接RDBMS的密码。

Import control arguments:

--append ：追加数据到HDFS已经存在的文件中。

--as-sequencefile ：import序列化的文件。

--as-textfile ：import文本文件，默认。

--columns <col,col,col…> ：指定列import，逗号分隔，比如：–columns “id,name”。

--delete-target-dir ：删除存在的import目标目录。

--direct ：直连模式，速度更快（HBase不支持）

--split-by ：分割导入任务所使用的字段，需要明确指定，推荐使用主键。

--inline-lob-limit < n > ：设置内联的BLOB对象的大小。

--fetch-size <n> ：一次从数据库读取n个实例，即n条数据。

-e,--query <statement> ：构建表达式<statement>执行。

--target-dir <d> ：指定HDFS目标存储目录。

--warehouse-dir <d> ：可以指定为-warehouse-dir/user/hive/warehouse/即导入数据的存放路径，如果该路径不存在，会首先创建。

--table <table-name> ：将要导入到hive的表。

--where <where clause> ：指定where从句，如果有双引号，注意转义 \$CONDITIONS，不能用or，子查询，join。

-z,--compress ：开启压缩。

--null-string <null-string> ：string列为空指定为此值。

--null-non-string <null-string> ：非string列为空指定为此值，-null这两个参数are optional, 如果不设置，会指定为”null”。

--autoreset-to-one-mapper ：如果没有主键和split-by用one mapper import （split-by和此选项不共存）。

-m,--num-mappers <n> ：建立n个并发执行import，默认4个线程。

Incremental import arguments:

--check-column <column> ：Source column to check for incremental change

--incremental <import-type> ：Define an incremental import of type ‘append’ or ‘lastmodified’

--last-value <value> ：Last imported value in the incremental check column

Hive arguments:

--create-hive-table ：自动推断表字段类型直接建表，hive-overwrite功能可以替代掉了，但Hive里此表不能存在，不然操作会报错。

--hive-database <database-name> ：指定要把HDFS数据导入到哪个Hive库。

--hive-table <table-name> ：设置到Hive当中的表名。

--hive-delims-replacement <arg> ：导入到hive时用自定义的字符替换掉\n, \r, and \01。

--hive-drop-import-delims ：导入到hive时删除字段中\n, \r，\t and \01等符号；避免字段中有空格导致导入数据被截断。

--hive-home <dir> ：指定Hive的存储目录。

--hive-import ：将HDFS数据导入到Hive中，会自动创建Hive表，使用hive的默认分隔符。

--hive-overwrite ：对Hive表进行覆盖操作（需配合--hive-import使用，如果Hive里没有表会先创建之），不然就是追加数据。

--hive-partition-key <partition-key> ：hive分区的key。

--hive-partition-value <partition-value> ：hive分区的值。

--map-column-hive <arg> ：类型匹配，SQL类型对应到hive类型。

HBase arguments:

--column-family < family > ：把内容导入到hbase当中，默认是用主键作为split列。

--hbase-create-table ：创建Hbase表。

--hbase-row-key < col > ：指定字段作为row key ，如果输入表包含复合主键，用逗号分隔。

--hbase-table < table-name > ：指定hbase表。

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

SQL必知必会

福达 (Ben Forta) / 钟鸣、刘晓霞 / 人民邮电出版社 / 2013-5-1 / 29.00元

SQL语法简洁，使用方式灵活，功能强大，已经成为当今程序员不可或缺的技能。本书是深受世界各地读者欢迎的SQL经典畅销书，内容丰富，文字简洁明快，针对Oracle、SQL Server、MySQL、DB2、PostgreSQL、SQLite等各种主流数据库提供了大量简明的实例。与其他同类图书不同，它没有过多阐述数据库基础理论，而是专门针对一线软件开发人员，直接从SQL SELECT开始，讲述......一起来看看《SQL必知必会》这本书的介绍吧!

码农工具