How to Get a Random Row from a Large Table

Method 1, Bad: ORDER BY NEWID()

Easy to write, but it performs like hot, hot garbage because it scans the entire clustered index, calculating NEWID() on every row:

The plan with the scan

That took 6 seconds on my machine, going parallel across multiple threads, using tens of seconds of CPU for all that computing and sorting. (And the Users table isn’t even 1GB.)

Method 2, Better but Strange: TABLESAMPLE

This came out in 2005, and has a ton of gotchas. It’s kinda picking a random page, and then returning a bunch of rows from that page. The first row is kinda random, but the rest aren’t.

The plan looks like it’s doing a table scan, but it’s only doing 7 logical reads:

The plan with the fake scan

But here’s the results – you can see that it jumps to a random 8K page and then starts reading out rows in order. They’re not really random rows.

Random like mafia lottery numbers

You can use the ROWS sample size instead, but it has some rather odd results. For example, in the Stack Overflow Users table, when I said TABLESAMPLE (50 ROWS), I actually got 75 rows back. That’s because SQL Server converts your row size over to a percentage instead.

Method 3, Best but Requires Code: Random Primary Key

Get the top ID field in the table, generate a random number, and look for that ID. Here, we’re sorting by the ID because we wanna find the top record that actually exists (whereas a random number might have been deleted.) Pretty fast, but is only good for a single random row. If you wanted 10 rows, you’d have to call code like this 10 times (or generate 10 random numbers and use an IN clause.)

The execution plan shows a clustered index scan, but it’s only grabbing one row – we’re only talking 6 logical reads for everything you see here, and it finishes near instantaneously:

The plan that can

There’s one gotcha: if the Id has negative numbers, it won’t work as expected. (For example, say you start your identity field at -1 and step -1, heading ever downwards, like my morals.)

Method 4, OFFSET-FETCH (2012+)

Daniel Hutmacher added this one in the comments:

And said, “But it only performs properly with a clustered index. I’m guessing that’s because it’ll scan for (@rows) rows in a heap instead of doing an index seek.”

Bonus Track #1: Watch Us Discussing This

Ever wonder what it’s like to be in our company’s chat room? This 10-minute Slack discussion will give you a pretty good idea:

Spoiler alert: there was not. I just took screenshots.

Bonus Track #2: Mitch Wheat Digs Deeper

Want an in-depth analysis of the randomness of several different techniques? Mitch Wheat dives really deep, complete with graphs!

Previous Post
[Video] Office Hours 2018/02/28 (With Transcriptions)
Next Post
Troubleshooting Parameter Sniffing Issues the Right Way: Part 1

16 Comments. Leave new

  • Daniel Hutmacher
    March 5, 2018 8:41 am

    Method 4: using OFFSET-FETCH:

    DECLARE @row bigint=(
    SELECT RAND(CHECKSUM(NEWID()))*SUM([rows]) FROM sys.partitions
    WHERE index_id IN (0, 1) AND [object_id]=OBJECT_ID(‘dbo.thetable’));

    SELECT *
    FROM dbo.thetable
    ORDER BY (SELECT NULL)
    OFFSET @row ROWS FETCH NEXT 1 ROWS ONLY;

    But it only performs properly with a clustered index. I’m guessing that’s because it’ll scan for (@rows) rows in a heap instead of doing an index seek.

    Reply
  • It seems that chat transcript was heavily censored 🙂

    Reply
  • Looks like someone came across Method 3 a little bit ago. I like it! Thanks for the post.

    https://stackoverflow.com/questions/9463938/checksumnewid-executes-multiple-times-per-row

    Reply
  • This should allow for gaps and always return a row regardless of min and max id values: –

    DECLARE @maxid bigint, @minid bigint, @randid bigint, @rand float = rand()
    SELECT @minid = MIN(Id), @maxid = MAX(Id) FROM dbo.Users
    SELECT @randid = MIN(Id) FROM dbo.Users WHERE Id >= (@minid + ( (@maxid – @minid) * @rand) )
    SELECT * FROM dbo.Users WHERE Id = @randid

    Reply
  • I always wonder why MS doesn’t look at these things and make them built in. I’m certain it wouldn’t take much effort for a built in rand_row(#) function.

    Reply
  • It all depends on how random you want your results to be. With Option 2, assuming you randomly select a row from the returned page(s), and you can get the percent correct, you’ll be oversampling larger rows (and rows in pages with more free space). With Option 3, if you have any gaps in your key, you’ll oversample the rows after a gap (think identity columns where SQL Server decides to bump the identity value by 1000).

    Reply
  • I blogged a similar topic a while back and did some basic stats testing: https://mitchwheat.com/2011/08/07/t-sql-generating-random-numbers-random-sampling-and-random-goodness/

    It includes your Method 3 which I saw in a MSDN article (not sure where though)

    Reply
  • Seems to do the trick :
    DECLARE @MxRws BIGINT, @row BIGINT;
    AndOneMoreTime: /*Only if the row does not exist*/
    SELECT @MxRws = IDENT_CURRENT ( ‘[TableHere]’ ); –Assume using identity on PK
    SELECT @row = RAND () * @MxRws;
    /*Check that the row exists and has not been deleted*/
    IF (SELECT 1 FROM [TableHere] WHERE Id = @row) !=1 GOTO AndOneMoreTime: ELSE SELECT * FROM [TableHere] WHERE Id = @row;

    Reply
  • What about taking advantage of the pk (or index) histogram statistics?

    Something like:

    drop table if exists #stats

    create table #stats(
    id int identity(1,1) primary key,
    RANGE_HI_KEY int,
    RANGE_ROWS int,
    EQ_ROWS numeric(10,2),
    DISTINST_RANGE_ROWS int,
    AVG_RANGE_ROWS int
    )
    insert into #stats
    exec(‘dbcc show_statistics(”Users”,”IX_ID”) WITH HISTOGRAM’)

    declare @rows int
    select @rows=count(*) from #stats

    declare @id int =cast(round((@rows-1)* Rand()+1,0) as integer)
    declare @low int
    declare @high int

    select top 1 @low=a.range_hi_key , @high=lead(range_hi_key,1) over (order by id) from #stats a where a.id between @id and @id+1
    order by id

    select top 1 * from Users where ID >= @low+cast(round(((@high-@low)-1)* Rand()+1,0) as integer)

    Reply
  • Sean Redmond
    March 22, 2018 3:59 am

    I have a quick’n’dirty query that I use to get semi-random values from a table. I use time. It depends, of course, on whether one needs a result every execution, how many gaps in the table there are, how representative it has to be and how large the table is.
    declare @gosh int = datepart( hour, getdate() ) * datepart( minute, getdate() ) * datepart( s, getdate() ) * datepart( ms, getdate() );
    select *
    from schemaname.BigTable
    where BigTablePK = @gosh
    ;
    This allows me to address a row with 41 million rows. It is not especially repesentative but it does give me new rows on each execution and it is quite fast.

    Reply
  • Maybe this is obvious, but I think method 3 is not completely random. What if there are only 2 rows with the id of 1 and 100? “100” will be selected with 99% probability. Can it be improved?

    Reply
  • Dismiss it, but ORDER BY NEW_ID() appears to be the only way to get a truly random sample set (more than one record) in your list of methods.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.

Menu
{"cart_token":"","hash":"","cart_data":""}