许吉友 - 运维

Hive 分区

为了提高性能,我们可以在Hive中实现数据分区。在这种情况下,我们将根据日期字段创建一个带有分区列的表。仅查询所需的分区

创建外部表:

CREATE EXTERNAL TABLE history_raw (
    user_id STRING,
    datetime TIMESTAMP,
    ip STRING,
    browser STRING,
    os STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/admin/input';

查看表结构:

desc formatted history_raw;

/tmp/history.txt 内容如下:

eo90133cf9ql,2015-10-01 00:03:20,123.456.77.88,firefox,linux,20151001
eo90133cf9ql,2015-10-01 01:08:56,123.456.77.88,firefox,linux,20151001
eo90133cf9ql,2015-10-02 02:30:45,123.456.77.88,firefox,linux,20151002
sh1243ihn93n,2015-10-02 11:21:50,121.956.23.88,safari,mac,20151002
eo90133cf9ql,2015-10-10 15:02:11,133.555.23.88,firefox,android,20151010
aa9871kjn3l1,2015-10-10 18:20:43,155.215.23.88,chrome,windows,20151010
eo90133cf9ql,2015-10-11 12:18:09,123.456.23.88,firefox,android,20151011
eo90133cf9ql,2015-10-12 12:34:34,123.456.23.88,firefox,android,20151012
hh2342o2nkj4,2015-10-15 15:02:11,133.555.23.88,safari,ios,20151015
sh1243ihn93n,2015-10-15 21:21:21,121.956.23.88,safari,mac,20151015

传到 HDFS:

$ hdfs dfs -put /tmp/history.txt input/

这时在 Hive 中查询发现有数据了:

select * from history_raw;
+----------------------+------------------------+-----------------+----------------------+-----------------+
| history_raw.user_id  |  history_raw.datetime  | history_raw.ip  | history_raw.browser  | history_raw.os  |
+----------------------+------------------------+-----------------+----------------------+-----------------+
| eo90133cf9ql         | 2015-10-01 00:03:20.0  | 123.456.77.88   | firefox              | linux           |
| eo90133cf9ql         | 2015-10-01 01:08:56.0  | 123.456.77.88   | firefox              | linux           |
| eo90133cf9ql         | 2015-10-02 02:30:45.0  | 123.456.77.88   | firefox              | linux           |
| sh1243ihn93n         | 2015-10-02 11:21:50.0  | 121.956.23.88   | safari               | mac             |
| eo90133cf9ql         | 2015-10-10 15:02:11.0  | 133.555.23.88   | firefox              | android         |
| aa9871kjn3l1         | 2015-10-10 18:20:43.0  | 155.215.23.88   | chrome               | windows         |
| eo90133cf9ql         | 2015-10-11 12:18:09.0  | 123.456.23.88   | firefox              | android         |
| eo90133cf9ql         | 2015-10-12 12:34:34.0  | 123.456.23.88   | firefox              | android         |
| hh2342o2nkj4         | 2015-10-15 15:02:11.0  | 133.555.23.88   | safari               | ios             |
| sh1243ihn93n         | 2015-10-15 21:21:21.0  | 121.956.23.88   | safari               | mac             |
+----------------------+------------------------+-----------------+----------------------+-----------------+

创建一个带分区的表:

CREATE TABLE history (
    user_id STRING,
    datetime TIMESTAMP,
    ip STRING,
    browser STRING,
    os STRING)
PARTITIONED BY (day STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

查看 history 的表结构 :

DESC history;
+--------------------------+------------+----------+
|         col_name         | data_type  | comment  |
+--------------------------+------------+----------+
| user_id                  | string     |          |
| datetime                 | timestamp  |          |
| ip                       | string     |          |
| browser                  | string     |          |
| os                       | string     |          |
| day                      | string     |          |
|                          | NULL       | NULL     |
| # Partition Information  | NULL       | NULL     |
| # col_name               | data_type  | comment  |
| day                      | string     |          |
+--------------------------+------------+----------+

创建分区:

ALTER TABLE history ADD PARTITION (day='20151015');

查看分区:

SHOW PARTITIONS history;
+---------------+
|   partition   |
+---------------+
| day=20151015  |
+---------------+

插入数据:

INSERT OVERWRITE TABLE history PARTITION (day='20151015')
SELECT * FROM history_raw
  WHERE substr(datetime, 0, 10) = '2015-10-15';

查看 HDFS 中的数据组织方式:

$ hdfs dfs -ls /warehouse/tablespace/managed/hive/history
drwxrwx---+  - hive hadoop  /warehouse/tablespace/managed/hive/history/day=20151015

创建一个表:

CREATE TABLE history_buckets (
    user_id STRING,
    datetime TIMESTAMP,
    ip STRING,
    browser STRING,
    os STRING)
CLUSTERED BY (user_id) INTO 10 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

这个表会根据 user_id 分成 10 个目录!

插入数据:

INSERT OVERWRITE TABLE history_buckets
SELECT * FROM history_raw;

查看 HDFS 中的目录,发现有10个桶:

$ hdfs dfs -ls /warehouse/tablespace/managed/hive/history_buckets/base_0000001
-rw-rw----+  3 hive hadoop  /warehouse/tablespace/managed/hive/history_buckets/base_0000001/bucket_00000
-rw-rw----+  3 hive hadoop  /warehouse/tablespace/managed/hive/history_buckets/base_0000001/bucket_00001
-rw-rw----+  3 hive hadoop  /warehouse/tablespace/managed/hive/history_buckets/base_0000001/bucket_00002
-rw-rw----+  3 hive hadoop  /warehouse/tablespace/managed/hive/history_buckets/base_0000001/bucket_00003
-rw-rw----+  3 hive hadoop  /warehouse/tablespace/managed/hive/history_buckets/base_0000001/bucket_00004
-rw-rw----+  3 hive hadoop  /warehouse/tablespace/managed/hive/history_buckets/base_0000001/bucket_00005
-rw-rw----+  3 hive hadoop  /warehouse/tablespace/managed/hive/history_buckets/base_0000001/bucket_00006
-rw-rw----+  3 hive hadoop  /warehouse/tablespace/managed/hive/history_buckets/base_0000001/bucket_00007
-rw-rw----+  3 hive hadoop  /warehouse/tablespace/managed/hive/history_buckets/base_0000001/bucket_00008
-rw-rw----+  3 hive hadoop  /warehouse/tablespace/managed/hive/history_buckets/base_0000001/bucket_00009