Hive 可以知道数据有多大吗？

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

Apache Hadoop

Hortonworks Sandbox

Intel Hadoop Distribution

Treasure Data

这是一个创建于 3555 天前的主题，其中的信息可能已经有所发展或是发生改变。

是这样，接手了别人的一个项目。 Hive 表结构大概有 17 亿条数据。

我知道 hive 的存储是放到 HDFS 上的 /usr/hive/warehouse 目录下

但是因为它之前的数据是做了分区的，还有 hdfs 本来就是冗余存储所以就会是这样

/usr/hive/warehouse/dbname/tablename/hour=01/part00000001copy /usr/hive/warehouse/dbname/tablename/hour=01/part00000001 /usr/hive/warehouse/dbname/tablename/hour=02/part00000001copy

大概类似上面的效果

而且 HDFS 上的目录的文件是不显示大小的。

因为要做项目的数据评估效率分析之类的，如何才能知道这 17 亿条数据的数据大小呢？

4 条回复 2016-04-28 00:50:53 +08:00

cxzl25

2016 年 4 月 27 日

可用
hdfs dfs -help du

-du [-s] [-h] <path> ...: Show the amount of space, in bytes, used by the files that
match the specified file pattern. The following flags are optional:
-s Rather than showing the size of each individual file that
matches the pattern, shows the total (summary) size.
-h Formats the sizes of files in a human-readable fashion
rather than a number of bytes.

Note that, even without the -s option, this only shows size summaries
one level deep into a directory.
The output is in the form
size name(full path)

另一般 hive 计算完毕，日志会显示
stats: [numFiles=1, numRows=62xx9xx, totalSize=83xx236xx, rawDataSize=81x49xxx]

且有如下语法，对部分计算过产生的分区，可以统计大小

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [noscan];

参考： https://cwiki.apache.org/confluence/display/Hive/StatsDev

firstway

2016 年 4 月 27 日 via Android

楼上对的。
使用 du ，你可以查看整个数据文件大小。但是这个不能告诉你 record 的条数。一般文件是压缩过的(不压缩太费空间).
不知道你说大小是指什么？目的是什么？是想预测处理速度？这个不光看数据规模，还要看数据可分的细度，还有集群的处理能力。。。。

anonymoustian

2016 年 4 月 27 日

@firstway 是这样的，要给客户做系统性能的汇报~

firstway

2016 年 4 月 28 日 via Android

hive 最终是以 MapReduce 的 job 来运行的，所以你可以在 copy 几个小时，几天，几周的数据跑一下你们的 job 。如果你数据分布比较均匀，处理速度接近线性。
如果接近线性，就好估计整体处理时间了。