SQL and the (newly discovered) OVER operator

I have a tool that takes a range of ids and processes each document with an id within that range. I wanted to run this tool in parallel with each instance processing 100,000 documents. The problem is that document ids are not contiguous. There are gaps. So a simple criteria of document id + offset is not sufficient. I needed an SQL statement that detailed contiguous range of 100,000 document ids.

As with most ostensibly complicated SQL the solution is in creating a virtual table (ie inner select) from an existing table. My virtual table needed to be sorted on document id and a sequence number given for each row. To do that I needed to use this in the column selection of the virtual table.

ROW_NUMBER() OVER( ORDER BY document_id ) - 1 row_number

I had never heard of the OVER operator. It applies a function to a row of an ordered relation. So the ROW_NUMBER() function gives the first row a value 1, the second 2, etc. Since I would be grouping using division I need the row numbers to start from 0 and so the subtraction of 1.

Now that I have row numbers and rows ordered by document id I needed to find within groups of 100,000 rows the minimum and maximum document ids. I used a simple group by row_number / 100,000 to get the row groupings. Then used MIN(document_id) to get the group's minimum document id. Same for maximum document id. The final SQL is as follows and it runs fast enough for the 50M records I needed to use it with

SELECT
  FLOOR( y.row_number / 100000 ),
  MIN( y.document_id ),
  MAX( y.document_id ),
  COUNT( * )
FROM
  (
    SELECT
      x.*,
      ROW_NUMBER() OVER( ORDER BY document_id ) - 1 row_number
    FROM
      documents x
  ) y
GROUP BY
  FLOOR( y.row_number / 100000 )
ORDER BY
  MIN( y.document_id )

Now that I have the data I can use a bash script to read the data and produce a series of commands that are feed to xargs for parallel execution.

No comments: