分组聚合实现与优化


一个通用问题

如何查找所有频道正在播出的节目? 如何查找所有分类下最热门的内容? 如何查找所有运动员的最好成绩?
这些问题都可以抽象为:如何取分组的最值?


SQL

数据库:MySQL

基于SQL的几种实现

任务:为每件物品找到标价最高的经销商

关联子查询

最好理解, 一般情况下,应该避免子查询,效率非常低。

SELECT article, dealer, price
FROM   shop s1
WHERE  price=(SELECT MAX(s2.price)
              FROM shop s2
              WHERE s1.article = s2.article);

非关联子查询

SELECT s1.article, dealer, s1.price
FROM shop s1
JOIN (
  SELECT article, MAX(price) AS price
  FROM shop
  GROUP BY article) AS s2
  ON s1.article = s2.article AND s1.price = s2.price;

左关联

SELECT s1.article, s1.dealer, s1.price
FROM shop s1
LEFT JOIN shop s2 ON s1.article = s2.article AND s1.price < s2.price
WHERE s2.article IS NULL;

名词解释

实例优化

查找所有频道正在播出的节目

表结构:

CREATE TABLE `pt_l_program` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT COMMENT 'ID',
  `lid` int(11) unsigned NOT NULL COMMENT '频道ID',
  `starttime` timestamp DEFAULT '0' COMMENT '节目开始时间',
  `program` varchar(255) DEFAULT '' COMMENT '节目单',
  PRIMARY KEY (`id`),
  UNIQUE KEY `unique` (`lid`,`starttime`),
) ENGINE=MyISAM AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='节目单';

SQL:

select SQL_SMALL_RESULT SQL_NO_CACHE  p1.id, p1.program, p1.starttime, p1.lid  from pt_l_program p1 inner join 
(select  max(starttime) as starttime,lid from pt_l_program where starttime<=UNIX_TIMESTAMP()  group by lid order by null) p2 
on p1.lid=p2.lid and p1.starttime=p2.starttime;

索引:

CREATE  INDEX Group_Search ON pt_l_program (lid,starttime,id,program);

Explain:
分组时使用松散索引扫描,聚合数据时使用覆盖索引,完全避免了磁盘IO。如果需要考虑插入性能,则可以去掉Group_Search

问题扩展

分组取前n条值/分组取第n条值

问题有点难度,但SQL还是可以实现:

set @num :=0, @lid := '';
select id, program,from_unixtime( starttime),channel_id  from (
	select id, program,starttime,
      @num := if(@lid = lid, @num + 1, 1) as row_number,
      @lid := lid as channel_id
  from pt_l_program
	where   starttime<=unix_timestamp()
  order by lid, starttime desc 
) as p where p.row_number <= 2;

利用Having过滤:

set @num :=0, @lid := '', @now := 0;
desc select id, program,from_unixtime(starttime),
      @num := if(@lid = lid and @now<starttime, @num + 1, 1) as row_number,
      @lid := lid as channel_id
from pt_l_program
where starttime<=unix_timestamp() group by lid, starttime desc having row_number<=2;

NoSQL

数据库:MongoDB

名词解释

聚合管道:管道是MongoDB2.2版本引入新的功能 ,它是数据聚合的一个新框架,其概念类似于数据处理的管道。

导入数据

mongoimport --type csv --file ~/program.csv --db pt --collection program --fields id,lid,starttime,program

聚合

db.program.aggregate(
[
{$match:{}},
{ $project : { id : 1 , lid : 1, program:1, starttime:{$multiply:["$starttime", 1000]} } },
{$sort:{"lid":-1, "starttime":1}},
{$group : { _id :"$lid", program:{$first:"$program"}, starttime:{$first:"$starttime"  }}}
],
{explain:true}
)

参考

How to select the first/least/max row per group in SQL
详解MySQL分组查询Group By实现原理
MySQL优化GROUP BY-松散索引扫描与紧凑索引扫描
MySQL如何优化GROUP BY
松散的索引扫描(Loose index scan)
[MySQL] 索引与性能(1)- 索引类型
MyISAM 索引结构了解 – MyISAM Index Structure
MySQL索引背后的数据结构及算法原理

Yan Peipan 11 December 2014