如何标记"传递组"与SQL?标记、QUOT、SQL

2023-09-11 22:38:26 作者:听、那心碎的声音

我有一个ID对一个表,在一个传递关系的 T 的,即,如果A的 T 的b和B的 T 的C,则A的 T 的C。示例:

I have a table with ID pairs that are in a transitive relation t, that is, if "A t B" AND "B t C" then "A t C". Sample:

  table T1
  ID1 | ID2 
  1   | 2
  1   | 5
  4   | 7
  7   | 8
  9   | 1

因此​​,有两组,

So there are two groups,

G1 :{1,2,5,9},因为1的 T 的2,1的 T 5和9的 T 的1 G2 :{4,7,8},因为4的 T 的7和7的 T 的8 g1: {1,2,5,9} because "1 t 2", "1 t 5" and "9 t 1" g2: {4,7,8} because "4 t 7" and "7 t 8"

和我需要制作,由纯和标准SQL,一个新的表或视图:

and I need to produce, by "pure and standard SQL", a new table or view:

  table T2
  ID1 | ID2 | LABEL 
  1   | 2   | 1
  1   | 5   | 1
  4   | 7   | 2
  7   | 8   | 2
  9   | 1   | 1

PS -1:我们可以通过

PS-1: we can list the "transitive groups" by

  SELECT DISTINCT label, id   
  FROM (SELECT id1 as id, * FROM T2) UNION (SELECT id2 as id, * FROM T2)
  ORDER BY 1,2;

PS -2:我使用PostgreSQL 9.1,但如果有与标准SQL的一个解决方案,我preFER

PS-2: I am using PostgreSQL 9.1, but if there are a solution with "standard SQL" I prefer.

推荐答案

现在,在2013年新的需求,我需要的工作,10000 itens : 使用的@ GordonLinoff优雅的解决方案(上图), 1000 itens需要1秒,与2000年需要一天的时间... 不会有很好的表现。性能问题也是这里记得,

Now, a new demand at 2013, I need to work with 10000 itens: using the @GordonLinoff's elegant solution (above), with 1000 itens need 1 second, with 2000 need 1 day... Not have a good performance. The problem of performance was also remembered here,

(这是最好的解决办法,这么快!) 请参见原件和教育方法的说明。这里表 T1 是用来一样的问题 - 文本,第二个(临时)表研究处理并显示结果,

(this is the best solution, so fast!) See original and didactical description. Here the table T1 is the same as the question-text, and a second (temporary) table R is used to process and to show the results,

 CREATE TABLE R (
   id integer NOT NULL, -- PRIMARY KEY,
   label integer NOT NULL DEFAULT 0
 );
 CREATE FUNCTION t1r_labeler() RETURNS void AS $funcBody$
   DECLARE
      label1 integer;
      label2 integer;
      newlabel integer;
      t t1%rowtype;
   BEGIN
       DELETE FROM R;
       INSERT INTO R(id) 
           SELECT DISTINCT unnest(array[id1,id2]) 
           FROM T1 ORDER BY 1;
      newlabel:=0;
      FOR t IN SELECT * FROM t1
      LOOP   -- --  BASIC LABELING:  -- --
         SELECT label INTO label1 FROM R WHERE id=t.id1;
         SELECT label INTO label2 FROM R WHERE id=t.id2;
         IF label1=0 AND label2=0 THEN 
              newlabel:=newlabel+1;
              UPDATE R set label=newlabel WHERE ID in (t.id1,t.id2);
         ELSIF label1=0 AND label2!=0 THEN 
              UPDATE R set label=label2 WHERE ID=t.id1;
         ELSIF label1!=0 AND label2=0 THEN 
              UPDATE R set label=label1 WHERE ID=t.id2;
         ELSIF label1!=label2 THEN -- time consuming
              UPDATE tmp.R set label=label1 WHERE label = label2;
         END IF;
      END LOOP;
    END;
 $funcBody$ LANGUAGE plpgsql VOLATILE; 

preparing和运行,

Preparing and running,

 -- same CREATE TABLE T1 (id1 integer, id2 integer);
 DELETE FROM T1; 
 INSERT INTO T1(id1,id2)  -- populate the standard input
 VALUES (1, 2), (1, 5), (4, 7), (7, 8), (9, 1);
   -- or  SELECT id1, id2 FROM table_with_1000000_items;

 SELECT t1r_labeler();       -- run
 SELECT * FROM R ORDER BY 2; -- show

与最坏的情况下处理

最后一个条件,当 LABEL1!= LABEL2 , 是最耗时的操作,并且必须避免或者可以在高连接的情况下,是最严重的那些分开。

The last condition, when label1!=label2, is the most time-consuming operation, and must be avoided or can be separated in cases of high connectivity, that are the worst ones.

要报告某种警惕的,你可以指望的时候,该程序正在运行的最后一个条件的比例,和/ COR可以separete最后一次更新。

To report some kind of alert you can count the proportion of times that the procedure is running the last condition, and/cor can separete the last update.

如果你分开,你可以分析和处理他们好一点 因此,消除了最后 ELSIF 和第一循环结束后加入你的支票,这第二个循环:

If you separate, you can analyse and deal a little better with them So, eliminating the last ELSIF and adding after the first loop your checks and this second loop:

      -- ... first loop and checks here ...
      FOR t IN SELECT * FROM tmp.t1 
      LOOP   -- --  MERGING LABELS:  -- --
         SELECT label INTO label1 FROM R WHERE id=t.id1;
         SELECT label INTO label2 FROM R WHERE id=t.id2;
         IF label1!=0 AND label2!=0 AND label1!=label2 THEN
             UPDATE R set label=label1 WHERE label=label2;
         END IF;
      END LOOP;
      -- ...

最糟糕的情况的例子:一组具有超过1000(连接)节点与10%的标组(核心),并只有在连接芯数路径的平均长度10000节点。

Example of worst case: a group with more than 1000 (connected) nodes into 10000 nodes with average length of "10 per labeled-group" (cores) and only few paths connecting cores.

这其他的解决办法就是慢(是的蛮力算法的),但可以当你需要直接处理数组UTIL,而不是需要这么快的解决方案(而不是有最坏的情况 )。

This other solution is slower (is a brute-force algorithm), but can be util when you need direct processing with arrays, and not need so fast solution (and not have "worst cases").

由于 @ peter.petrov和@RBarryYoung建议使用的更充足的数据结构 ......我又回到了我的数组,因为更充足的数据结构。毕竟,有良好的加速(用@ GordonLinoff算法comparating)与下面的解决方案(!)

As @peter.petrov and @RBarryYoung suggest to use a more adequate data-structure... I was back to my arrays as "more adequate data-structure". After all, there are good speed-up (comparating with @GordonLinoff's algorithm) with the solution below (!).

第一步是将问题文本的表 T1 转化为暂时的, transgroup1 ,在这里我们可以计算出新的进程,

The first step is to translate the table t1 of the question-text to a temporary one, transgroup1, where we can compute the new process,

 -- DROP table transgroup1;
 CREATE TABLE transgroup1 (
   id serial NOT NULL PRIMARY KEY,
   items integer[], -- two or more items in the transitive relationship
   dels integer[] DEFAULT array[]::integer[]
 );
 INSERT INTO transgroup1(items)
   SELECT array[id1, id2] FROM t1; -- now suppose t1 a 10000 items table;

他们,这两个功能就可以解决这个问题,

them, with these two functions we can solve the problem,

 CREATE FUNCTION array_uunion(anyarray,anyarray) RETURNS anyarray AS $$
     -- ensures distinct items of a concatemation
    SELECT ARRAY(SELECT  unnest($1) UNION SELECT  unnest($2))
 $$ LANGUAGE sql immutable;


  CREATE FUNCTION transgroup1_loop() RETURNS void AS
  $BODY$
    DECLARE
       cp_dels integer[];
       i integer;
       max_i integer;
    BEGIN
       i:=1;
       max_i:=10; -- or 100 or more, but need some control to be secure
       LOOP
           UPDATE transgroup1
           SET items = array_uunion(transgroup1.items,t2.items),
               dels =  transgroup1.dels || t2.id
           FROM transgroup1 AS t1, transgroup1 AS t2
           WHERE transgroup1.id=t1.id AND t1.id>t2.id AND t1.items && t2.items;

           cp_dels := array(
               SELECT DISTINCT unnest(dels) FROM transgroup1
           ); -- ensures all itens to del
           EXIT WHEN i>max_i OR array_length(cp_dels,1)=0;

           DELETE FROM transgroup1 WHERE id IN (SELECT unnest(cp_dels));
           UPDATE transgroup1 SET dels=array[]::integer[];
           i:=i+1;
       END LOOP;
       UPDATE transgroup1         -- only to beautify
       SET items = ARRAY(SELECT unnest(items) ORDER BY 1 desc);
     END;
  $BODY$ LANGUAGE plpgsql VOLATILE;

当然,运行和看到的结果,你可以使用

of course, to run and see results, you can use

 SELECT transgroup1_loop(); -- not 1 day but some hours!     
 SELECT *, dense_rank() over (ORDER BY id) AS group from transgroup1;

造成

 id |   items   | ssg_label | dels | group 
----+-----------+-----------+------+-------
  4 | {8,7,4}   | 1         | {}   |     1
  5 | {9,5,2,1} | 1         | {}   |     2