hive仓库创建json格式表

2018/10/17 swaiter

在现有的数据资产管理系统中,系统支持text、orc等格式的表,其中text格式的表列可以由多种分割符进行分割,由于需要,仓库中需要存储json相关的数据,因此,系统需要支持json格式表创建,遇到的问题总结如下:

实现

首先就是引入JsonSerDe包的问题。 首先,系统自带的有相关的jar包。
另外,也有第三方开源的jar包。第三方 第三方做了相关的优化,下面有有介绍。

###hive官方文档 hive官方文档有相关json serde介绍

Stored as plain text file in JSON format.

The JsonSerDe for JSON files is available in Hive 0.12 and later.

In some distributions, a reference to hive-hcatalog-core.jar is required.
ADD JAR /usr/lib/hive-hcatalog/lib/hive-hcatalog-core.jar;

CREATE TABLE my_table(a string, b bigint, ...)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;

The JsonSerDe was moved to Hive from HCatalog and before it was in hive-contrib project. It was added to the Hive distribution by HIVE-4895.
An Amazon SerDe is available at s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar for releases prior to 0.12.0.

The JsonSerDe for JSON files is available in Hive 0.12 and later.
Starting in Hive 3.0.0, JsonSerDe is added to Hive Serde as "org.apache.hadoop.hive.serde2.JsonSerDe" (HIVE-19211).
CREATE TABLE my_table(a string, b bigint, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.JsonSerDe'
STORED AS TEXTFILE;

Or STORED AS JSONFILE is supported starting in Hive 4.0.0 (HIVE-19899), so you can create table as follows:

CREATE TABLE my_table(a string, b bigint, ...) STORED AS JSONFILE;

创建

在hive客户端执行以下

create EXTERNAL table employees_json1 (firstName string, lastname string,        employeenumber int )
 ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
;

可能会遇到以下问题: 1、创建时Error:”java.lang.ClassNotFoundException Class org.apache.hive.hcatalog.data.JsonSerDe not found” 原因是缺少相关的包。 导入json数据到 test_data.json文件中

{ "firstName" : "asMike", "lastName" : "asdfpesky", "employeeNumber" : 40192 }
{ "firstName" : "asMike", "lastName" : "asdfesky", "employeeNumber" : 123192 }
{ "firstName" : "Mikasde", "lastName" : "Chepesky", "employeeNumber" : 192 }

然后上传数据到hdfs上

hadoop fs -put -f test_data.json /user/datacenter/test_data.json

插入数据到表中

load data inpath '/user/datacenter/test_data.json' overwrite into table employees_json1;

查询数据

select * from employees_json1
hive> select * from employees_json1;
OK
asMike	asdfpesky	40192
asMike	asdfesky	123192
Mikasde	Chepesky	192

如果需要用到映射关系可以使用第三方开源jar,上面有所提及。 如:json文件中使用的是firstName,表里面的字段名称是name,则可以通过以下方式:

create EXTERNAL table employees_json2 (name string, lastname string,        employeenumber int )
 ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
 WITH SERDEPROPERTIES (
  "mapping.name"="firstName"
);

同样,查询出来的结果是:

hive> select * from employees_json1;
OK
asMike	asdfpesky	40192
asMike	asdfesky	123192
Mikasde	Chepesky	192

第三方的jar需要导入,有关导入jar遇到的坑,下次再写。

Search

    Table of Contents