How to Get a Random Row from a Large Table

T-SQL
23 Comments

Method 1, Bad: ORDER BY NEWID()

Easy to write, but it performs like hot, hot garbage because it scans the entire clustered index, calculating NEWID() on every row:

The plan with the scan

That took 6 seconds on my machine, going parallel across multiple threads, using tens of seconds of CPU for all that computing and sorting. (And the Users table isn’t even 1GB.)

Method 2, Better but Strange: TABLESAMPLE

This came out in 2005, and has a ton of gotchas. It’s kinda picking a random page, and then returning a bunch of rows from that page. The first row is kinda random, but the rest aren’t.

The plan looks like it’s doing a table scan, but it’s only doing 7 logical reads:

The plan with the fake scan

But here’s the results – you can see that it jumps to a random 8K page and then starts reading out rows in order. They’re not really random rows.

Random like mafia lottery numbers

You can use the ROWS sample size instead, but it has some rather odd results. For example, in the Stack Overflow Users table, when I said TABLESAMPLE (50 ROWS), I actually got 75 rows back. That’s because SQL Server converts your row size over to a percentage instead.

Method 3, Best but Requires Code: Random Primary Key

Get the top ID field in the table, generate a random number, and look for that ID. Here, we’re sorting by the ID because we wanna find the top record that actually exists (whereas a random number might have been deleted.) Pretty fast, but is only good for a single random row. If you wanted 10 rows, you’d have to call code like this 10 times (or generate 10 random numbers and use an IN clause.)

The execution plan shows a clustered index scan, but it’s only grabbing one row – we’re only talking 6 logical reads for everything you see here, and it finishes near instantaneously:

The plan that can

There’s one gotcha: if the Id has negative numbers, it won’t work as expected. (For example, say you start your identity field at -1 and step -1, heading ever downwards, like my morals.)

Method 4, OFFSET-FETCH (2012+)

Daniel Hutmacher added this one in the comments:

And said, “But it only performs properly with a clustered index. I’m guessing that’s because it’ll scan for (@rows) rows in a heap instead of doing an index seek.”

Bonus Track #1: Watch Us Discussing This

Ever wonder what it’s like to be in our company’s chat room? This 10-minute Slack discussion will give you a pretty good idea:

Spoiler alert: there was not. I just took screenshots.

Bonus Track #2: Mitch Wheat Digs Deeper

Want an in-depth analysis of the randomness of several different techniques? Mitch Wheat dives really deep, complete with graphs!

Previous Post
[Video] Office Hours 2018/02/28 (With Transcriptions)
Next Post
Troubleshooting Parameter Sniffing Issues the Right Way: Part 1

23 Comments. Leave new

  • Daniel Hutmacher
    March 5, 2018 8:41 am

    Method 4: using OFFSET-FETCH:

    DECLARE @row bigint=(
    SELECT RAND(CHECKSUM(NEWID()))*SUM([rows]) FROM sys.partitions
    WHERE index_id IN (0, 1) AND [object_id]=OBJECT_ID(‘dbo.thetable’));

    SELECT *
    FROM dbo.thetable
    ORDER BY (SELECT NULL)
    OFFSET @row ROWS FETCH NEXT 1 ROWS ONLY;

    But it only performs properly with a clustered index. I’m guessing that’s because it’ll scan for (@rows) rows in a heap instead of doing an index seek.

    Reply
  • It seems that chat transcript was heavily censored 🙂

    Reply
  • Looks like someone came across Method 3 a little bit ago. I like it! Thanks for the post.

    https://stackoverflow.com/questions/9463938/checksumnewid-executes-multiple-times-per-row

    Reply
  • This should allow for gaps and always return a row regardless of min and max id values: –

    DECLARE @maxid bigint, @minid bigint, @randid bigint, @rand float = rand()
    SELECT @minid = MIN(Id), @maxid = MAX(Id) FROM dbo.Users
    SELECT @randid = MIN(Id) FROM dbo.Users WHERE Id >= (@minid + ( (@maxid – @minid) * @rand) )
    SELECT * FROM dbo.Users WHERE Id = @randid

    Reply
  • I always wonder why MS doesn’t look at these things and make them built in. I’m certain it wouldn’t take much effort for a built in rand_row(#) function.

    Reply
  • It all depends on how random you want your results to be. With Option 2, assuming you randomly select a row from the returned page(s), and you can get the percent correct, you’ll be oversampling larger rows (and rows in pages with more free space). With Option 3, if you have any gaps in your key, you’ll oversample the rows after a gap (think identity columns where SQL Server decides to bump the identity value by 1000).

    Reply
  • I blogged a similar topic a while back and did some basic stats testing: https://mitchwheat.com/2011/08/07/t-sql-generating-random-numbers-random-sampling-and-random-goodness/

    It includes your Method 3 which I saw in a MSDN article (not sure where though)

    Reply
  • Seems to do the trick :
    DECLARE @MxRws BIGINT, @row BIGINT;
    AndOneMoreTime: /*Only if the row does not exist*/
    SELECT @MxRws = IDENT_CURRENT ( ‘[TableHere]’ ); –Assume using identity on PK
    SELECT @row = RAND () * @MxRws;
    /*Check that the row exists and has not been deleted*/
    IF (SELECT 1 FROM [TableHere] WHERE Id = @row) !=1 GOTO AndOneMoreTime: ELSE SELECT * FROM [TableHere] WHERE Id = @row;

    Reply
  • What about taking advantage of the pk (or index) histogram statistics?

    Something like:

    drop table if exists #stats

    create table #stats(
    id int identity(1,1) primary key,
    RANGE_HI_KEY int,
    RANGE_ROWS int,
    EQ_ROWS numeric(10,2),
    DISTINST_RANGE_ROWS int,
    AVG_RANGE_ROWS int
    )
    insert into #stats
    exec(‘dbcc show_statistics(”Users”,”IX_ID”) WITH HISTOGRAM’)

    declare @rows int
    select @rows=count(*) from #stats

    declare @id int =cast(round((@rows-1)* Rand()+1,0) as integer)
    declare @low int
    declare @high int

    select top 1 @low=a.range_hi_key , @high=lead(range_hi_key,1) over (order by id) from #stats a where a.id between @id and @id+1
    order by id

    select top 1 * from Users where ID >= @low+cast(round(((@high-@low)-1)* Rand()+1,0) as integer)

    Reply
  • Sean Redmond
    March 22, 2018 3:59 am

    I have a quick’n’dirty query that I use to get semi-random values from a table. I use time. It depends, of course, on whether one needs a result every execution, how many gaps in the table there are, how representative it has to be and how large the table is.
    declare @gosh int = datepart( hour, getdate() ) * datepart( minute, getdate() ) * datepart( s, getdate() ) * datepart( ms, getdate() );
    select *
    from schemaname.BigTable
    where BigTablePK = @gosh
    ;
    This allows me to address a row with 41 million rows. It is not especially repesentative but it does give me new rows on each execution and it is quite fast.

    Reply
  • Maybe this is obvious, but I think method 3 is not completely random. What if there are only 2 rows with the id of 1 and 100? “100” will be selected with 99% probability. Can it be improved?

    Reply
  • Dismiss it, but ORDER BY NEW_ID() appears to be the only way to get a truly random sample set (more than one record) in your list of methods.

    Reply
  • Am I the only person who thinks that 3rd or 4th case is bad if PK key has a large number of big gaps?

    because in case of big gaps in PK, it means that the probability of receiving is very large compared to another rows..

    e.g.

    You have a table for 1’000’000 rows, then you delete 700’000 of first rows

    so the probability to receive that 1st rows after from table (after deletion)
    is ~70%, where 30% “probability” is spread across another row

    Am I correct?

    Reply
    • neverrr neverrr
      September 23, 2020 6:47 am

      Yeah, method 3 is not random at all, just imagine a table with UserId = {1, 2, 99999 ,100000}. There 99999 will b picked close to always.

      To pick a decently random record you need at least a COUNT() of the table, and that requires a table scan. Its the scan that takes time, not the calculation of NewId().

      Reply
  • neverrr neverrr
    September 23, 2020 8:18 am

    Giving each row a random value is the only right way to do random selection unless you already have them numbered without gaps.

    If you know you have millions of rows you can filter out maybe 999/1000 of the users and then sort the rest:

    SELECT TOP (1)
    U.*
    FROM dbo.Users U
    CROSS APPLY(SELECT CHECKSUM(NEWID())) R(Rnd)
    WHERE
    R.Rnd%1000=0
    ORDER BY
    NEWID()

    This code should be a lot faster than if you drop the CROSS APPLY and WHERE.

    Also, don’t use ABS(CHECKSUM(NEWID())) in production.

    Reply
  • This post and comments have helped me thus far. My business case is a tad different. I need a random sample of claims that represent 2.5% of the population. So, if my population contains 3,517 distinct records, I need to return, consistently, 88 distinct records. Here’s what I’ve done thus far:

    DECLARE @FirstofLastMonth date = DATEADD(DD,1,EOMONTH(Getdate(),-2)),
    @EndofLastMonth date = EOMONTH(Getdate(), -1)

    DECLARE @Sample table (ID uniqueidentifier, ClaimNO varchar(50), Company_ID varchar(10), FinancialResp varchar(30), ProviderType varchar(50),
    DateOfService date, ClaimType varchar(20), TotalBilled numeric(10,2), TotalPaid numeric(10,2) )
    Insert into @Sample (ID, ClaimNo,Company_ID, FinancialResp, ProviderType, DateOfService, ClaimType, TotalBilled, TotalPaid)
    Select distinct NEWID() as ID, cd.ClaimNo, cd.Company_ID, cd.FinancialResp,
    CASE
    when cd.FinancialResp in (‘KEYMA’, ‘SHP’, ‘SFKD’) and lob.LOB_Network = ‘1’ then ‘Contracted’
    when cd.FinancialResp in (‘KEYMA’, ‘SHP’, ‘SFKD’) and lob.LOB_Network = ‘0’ then ‘Non-Contracted’
    when cd.FinancialResp = ‘KDMA’ and pu.UDF_Value = ‘Y’ then ‘Contracted’
    when cd.FinancialResp = ‘KDMA’ and pu.UDF_Value ‘Y’ then ‘Non-Contracted’
    else ‘Missing’
    end as ProviderType,
    Format( Min(cd.FromDateSvc), ‘MM/dd/yyyy’) as DateOfService,
    CASE
    when substring(cd.ClaimNo, 9,1) = ‘8’ then ‘Institutional’
    when substring(cd.ClaimNO, 9,1) = ‘9’ then ‘Professional’
    end as Claim_Type,
    sum(cd.Billed) as TotalBilled,
    SUM(cd.Net) as TotalPaid
    from Claim_Details cd
    inner join Claim_Masters_V cm on cd.ClaimNo = cm.ClaimNo and cd.Company_ID = cm.Company_ID
    inner join Prov_VendInfo pv on cm.Prov_KeyID = pv.Prov_KeyID and cm.Company_ID = pv.Company_ID
    inner join Vend_Masters vm on pv.Vendor = vm.VendorID
    and pv.Company_ID = vm.Environment_ID
    inner join Prov_Master_V pm on cm.Prov_KeyID = pm.Prov_KeyID
    and cm.Company_ID = pm.Company_ID
    and pv.Default_YN = 1
    inner join BM_Prov_HP_LineOfBuss lob on cm.Prov_KeyID = lob.Prov_KeyID
    and cm.Company_ID = lob.Company_ID
    and cm.HPCode = lob.HPCode
    and pv.Vendor = lob.Vendor
    and lob.LOB_Start = CAST(CHECKSUM(NEWID(),ID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

    When I run the above query repeatedly, the results vary in number from around 100 to the upper 80’s and lower 90’s. How do I refine the above to ensure that just 2.5% of the qualifying records are returned in a random sample?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.