I have a tool that takes a range of ids and processes each document with an id within that range. I wanted to run this tool in parallel with each instance processing 100,000 documents. The problem is that document ids are not contiguous. There are gaps. So a simple criteria of document id + offset is not sufficient. I needed an SQL statement that detailed contiguous range of 100,000 document ids.
As with most ostensibly complicated SQL the solution is in creating a virtual table (ie inner select) from an existing table. My virtual table needed to be sorted on document id and a sequence number given for each row. To do that I needed to use this in the column selection of the virtual table.
ROW_NUMBER() OVER( ORDER BY document_id ) - 1 row_number
I had never heard of the OVER operator. It applies a function to a row of an ordered relation. So the ROW_NUMBER() function gives the first row a value 1, the second 2, etc. Since I would be grouping using division I need the row numbers to start from 0 and so the subtraction of 1.
Now that I have row numbers and rows ordered by document id I needed to find within groups of 100,000 rows the minimum and maximum document ids. I used a simple group by row_number / 100,000 to get the row groupings. Then used MIN(document_id) to get the group's minimum document id. Same for maximum document id. The final SQL is as follows and it runs fast enough for the 50M records I needed to use it with
SELECT FLOOR( y.row_number / 100000 ), MIN( y.document_id ), MAX( y.document_id ), COUNT( * ) FROM ( SELECT x.*, ROW_NUMBER() OVER( ORDER BY document_id ) - 1 row_number FROM documents x ) y GROUP BY FLOOR( y.row_number / 100000 ) ORDER BY MIN( y.document_id )
Now that I have the data I can use a bash script to read the data and produce a series of commands that are feed to xargs for parallel execution.