How to Get a Random Row from a Large Table

Last Updated March 7, 2018

Method 1, Bad: ORDER BY NEWID()

Easy to write, but it performs like hot, hot garbage because it scans the entire clustered index, calculating NEWID() on every row:

That took 6 seconds on my machine, going parallel across multiple threads, using tens of seconds of CPU for all that computing and sorting. (And the Users table isn’t even 1GB.)

Method 2, Better but Strange: TABLESAMPLE

This came out in 2005, and has a ton of gotchas. It’s kinda picking a random page, and then returning a bunch of rows from that page. The first row is kinda random, but the rest aren’t.

SELECT * FROM StackOverflow.dbo.Users TABLESAMPLE (.01 PERCENT);

1	SELECT * FROM StackOverflow.dbo.Users TABLESAMPLE (.01 PERCENT);

The plan looks like it’s doing a table scan, but it’s only doing 7 logical reads:

But here’s the results – you can see that it jumps to a random 8K page and then starts reading out rows in order. They’re not really random rows.

You can use the ROWS sample size instead, but it has some rather odd results. For example, in the Stack Overflow Users table, when I said TABLESAMPLE (50 ROWS), I actually got 75 rows back. That’s because SQL Server converts your row size over to a percentage instead.

Method 3, Best but Requires Code: Random Primary Key

Get the top ID field in the table, generate a random number, and look for that ID. Here, we’re sorting by the ID because we wanna find the top record that actually exists (whereas a random number might have been deleted.) Pretty fast, but is only good for a single random row. If you wanted 10 rows, you’d have to call code like this 10 times (or generate 10 random numbers and use an IN clause.)

/* Get a random number smaller than the table's top ID */
DECLARE @rand BIGINT;
DECLARE @maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT @rand = ABS((CHECKSUM(NEWID()))) % @maxid;

/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= @rand;

/* Get a random number smaller than the table's top ID */

DECLARE @rand BIGINT;

DECLARE @maxid INT = (SELECT MAX(Id) FROM dbo.Users);

SELECT @rand = ABS((CHECKSUM(NEWID()))) % @maxid;

/* Get the first row around that ID */

SELECT TOP 1 *

FROM dbo.Users AS u

WHERE u.Id >= @rand;

The execution plan shows a clustered index scan, but it’s only grabbing one row – we’re only talking 6 logical reads for everything you see here, and it finishes near instantaneously:

There’s one gotcha: if the Id has negative numbers, it won’t work as expected. (For example, say you start your identity field at -1 and step -1, heading ever downwards, like my morals.)

Method 4, OFFSET-FETCH (2012+)

Daniel Hutmacher added this one in the comments:

DECLARE @row bigint=(
SELECT RAND(CHECKSUM(NEWID()))*SUM([rows]) FROM sys.partitions
WHERE index_id IN (0, 1) AND [object_id]=OBJECT_ID(‘dbo.thetable’));

SELECT *
FROM dbo.thetable
ORDER BY (SELECT NULL)
OFFSET @row ROWS FETCH NEXT 1 ROWS ONLY;

DECLARE @row bigint=(

SELECT RAND(CHECKSUM(NEWID()))*SUM([rows]) FROM sys.partitions

WHERE index_id IN (0, 1) AND [object_id]=OBJECT_ID(‘dbo.thetable’));

SELECT *

FROM dbo.thetable

ORDER BY (SELECT NULL)

OFFSET @row ROWS FETCH NEXT 1 ROWS ONLY;

And said, “But it only performs properly with a clustered index. I’m guessing that’s because it’ll scan for (@rows) rows in a heap instead of doing an index seek.”

Bonus Track #1: Watch Us Discussing This

Ever wonder what it’s like to be in our company’s chat room? This 10-minute Slack discussion will give you a pretty good idea:

Spoiler alert: there was not. I just took screenshots.

Bonus Track #2: Mitch Wheat Digs Deeper

Want an in-depth analysis of the randomness of several different techniques? Mitch Wheat dives really deep, complete with graphs!

[Video] Office Hours 2018/02/28 (With Transcriptions)

Troubleshooting Parameter Sniffing Issues the Right Way: Part 1

23 Comments. Leave new

Daniel Hutmacher
March 5, 2018 8:41 am

Method 4: using OFFSET-FETCH:

DECLARE @row bigint=(
SELECT RAND(CHECKSUM(NEWID()))*SUM([rows]) FROM sys.partitions
WHERE index_id IN (0, 1) AND [object_id]=OBJECT_ID(‘dbo.thetable’));

SELECT *
FROM dbo.thetable
ORDER BY (SELECT NULL)
OFFSET @row ROWS FETCH NEXT 1 ROWS ONLY;

But it only performs properly with a clustered index. I’m guessing that’s because it’ll scan for (@rows) rows in a heap instead of doing an index seek.

Reply
- Brent Ozar
  March 5, 2018 8:45 am
  
  Awesome! Added that to the post. Thanks sir!
  
  Reply
Koen Verbeeck
March 5, 2018 8:54 am

It seems that chat transcript was heavily censored 🙂

Reply
- Brent Ozar
  March 5, 2018 8:54 am
  
  Yeah, the company chat room is, ironically, not safe for work for our readers.
  
  Reply
Sean R
March 5, 2018 12:12 pm

Looks like someone came across Method 3 a little bit ago. I like it! Thanks for the post.

https://stackoverflow.com/questions/9463938/checksumnewid-executes-multiple-times-per-row

Reply
Marc Kent
March 5, 2018 1:02 pm

This should allow for gaps and always return a row regardless of min and max id values: –

DECLARE @maxid bigint, @minid bigint, @randid bigint, @rand float = rand()
SELECT @minid = MIN(Id), @maxid = MAX(Id) FROM dbo.Users
SELECT @randid = MIN(Id) FROM dbo.Users WHERE Id >= (@minid + ( (@maxid – @minid) * @rand) )
SELECT * FROM dbo.Users WHERE Id = @randid

Reply
trav
March 5, 2018 1:47 pm

I always wonder why MS doesn’t look at these things and make them built in. I’m certain it wouldn’t take much effort for a built in rand_row(#) function.

Reply
Wyatt
March 5, 2018 3:48 pm

It all depends on how random you want your results to be. With Option 2, assuming you randomly select a row from the returned page(s), and you can get the percent correct, you’ll be oversampling larger rows (and rows in pages with more free space). With Option 3, if you have any gaps in your key, you’ll oversample the rows after a gap (think identity columns where SQL Server decides to bump the identity value by 1000).

Reply
Mitch Wheat
March 5, 2018 9:03 pm

I blogged a similar topic a while back and did some basic stats testing: https://mitchwheat.com/2011/08/07/t-sql-generating-random-numbers-random-sampling-and-random-goodness/

It includes your Method 3 which I saw in a MSDN article (not sure where though)

Reply
- Brent Ozar
  March 7, 2018 6:03 am
  
  Mitch – wow, you put a ton of work into that post! Nice job. I’ve linked to that in the body of the post because it’s so interesting.
  
  Reply
Elliot
March 7, 2018 5:46 am

Seems to do the trick :
DECLARE @MxRws BIGINT, @row BIGINT;
AndOneMoreTime: /*Only if the row does not exist*/
SELECT @MxRws = IDENT_CURRENT ( ‘[TableHere]’ ); –Assume using identity on PK
SELECT @row = RAND () * @MxRws;
/*Check that the row exists and has not been deleted*/
IF (SELECT 1 FROM [TableHere] WHERE Id = @row) !=1 GOTO AndOneMoreTime: ELSE SELECT * FROM [TableHere] WHERE Id = @row;

Reply
- Brent Ozar
  March 7, 2018 6:03 am
  
  Elliot – ooo, I love that, but as I think about it, both this and method #2 fail when you have negative identity Ids. Doh!
  
  Reply
Diego
March 22, 2018 3:23 am

What about taking advantage of the pk (or index) histogram statistics?

Something like:

drop table if exists #stats

create table #stats(
id int identity(1,1) primary key,
RANGE_HI_KEY int,
RANGE_ROWS int,
EQ_ROWS numeric(10,2),
DISTINST_RANGE_ROWS int,
AVG_RANGE_ROWS int
)
insert into #stats
exec(‘dbcc show_statistics(”Users”,”IX_ID”) WITH HISTOGRAM’)

declare @rows int
select @rows=count(*) from #stats

declare @id int =cast(round((@rows-1)* Rand()+1,0) as integer)
declare @low int
declare @high int

select top 1 @low=a.range_hi_key , @high=lead(range_hi_key,1) over (order by id) from #stats a where a.id between @id and @id+1
order by id

select top 1 * from Users where ID >= @low+cast(round(((@high-@low)-1)* Rand()+1,0) as integer)

Reply
Sean Redmond
March 22, 2018 3:59 am

I have a quick’n’dirty query that I use to get semi-random values from a table. I use time. It depends, of course, on whether one needs a result every execution, how many gaps in the table there are, how representative it has to be and how large the table is.
declare @gosh int = datepart( hour, getdate() ) * datepart( minute, getdate() ) * datepart( s, getdate() ) * datepart( ms, getdate() );
select *
from schemaname.BigTable
where BigTablePK = @gosh
;
This allows me to address a row with 41 million rows. It is not especially repesentative but it does give me new rows on each execution and it is quite fast.

Reply
Hanahan Kim
September 20, 2018 12:21 am

Maybe this is obvious, but I think method 3 is not completely random. What if there are only 2 rows with the id of 1 and 100? “100” will be selected with 99% probability. Can it be improved?

Reply
Bruce Lindman
September 20, 2018 12:24 pm

Dismiss it, but ORDER BY NEW_ID() appears to be the only way to get a truly random sample set (more than one record) in your list of methods.

Reply
KondzioSSJ4
September 14, 2020 7:09 am

Am I the only person who thinks that 3rd or 4th case is bad if PK key has a large number of big gaps?

because in case of big gaps in PK, it means that the probability of receiving is very large compared to another rows..

e.g.

You have a table for 1’000’000 rows, then you delete 700’000 of first rows

so the probability to receive that 1st rows after from table (after deletion)
is ~70%, where 30% “probability” is spread across another row

Am I correct?

Reply
- neverrr neverrr
  September 23, 2020 6:47 am
  
  Yeah, method 3 is not random at all, just imagine a table with UserId = {1, 2, 99999 ,100000}. There 99999 will b picked close to always.
  
  To pick a decently random record you need at least a COUNT() of the table, and that requires a table scan. Its the scan that takes time, not the calculation of NewId().
  
  Reply
  - neverrr neverrr
    September 23, 2020 8:27 am
    
    Well, i guess it’s the sorting
    
    Reply
neverrr neverrr
September 23, 2020 8:18 am

Giving each row a random value is the only right way to do random selection unless you already have them numbered without gaps.

If you know you have millions of rows you can filter out maybe 999/1000 of the users and then sort the rest:

SELECT TOP (1)
U.*
FROM dbo.Users U
CROSS APPLY(SELECT CHECKSUM(NEWID())) R(Rnd)
WHERE
R.Rnd%1000=0
ORDER BY
NEWID()

This code should be a lot faster than if you drop the CROSS APPLY and WHERE.

Also, don’t use ABS(CHECKSUM(NEWID())) in production.

Reply
- Rahul
  March 22, 2021 10:43 am
  
  This works perfect for me.!! thanks “neverrr neverrr”
  
  Reply
Sid
January 20, 2022 9:51 pm

This post and comments have helped me thus far. My business case is a tad different. I need a random sample of claims that represent 2.5% of the population. So, if my population contains 3,517 distinct records, I need to return, consistently, 88 distinct records. Here’s what I’ve done thus far:

DECLARE @FirstofLastMonth date = DATEADD(DD,1,EOMONTH(Getdate(),-2)),
@EndofLastMonth date = EOMONTH(Getdate(), -1)

DECLARE @Sample table (ID uniqueidentifier, ClaimNO varchar(50), Company_ID varchar(10), FinancialResp varchar(30), ProviderType varchar(50),
DateOfService date, ClaimType varchar(20), TotalBilled numeric(10,2), TotalPaid numeric(10,2) )
Insert into @Sample (ID, ClaimNo,Company_ID, FinancialResp, ProviderType, DateOfService, ClaimType, TotalBilled, TotalPaid)
Select distinct NEWID() as ID, cd.ClaimNo, cd.Company_ID, cd.FinancialResp,
CASE
when cd.FinancialResp in (‘KEYMA’, ‘SHP’, ‘SFKD’) and lob.LOB_Network = ‘1’ then ‘Contracted’
when cd.FinancialResp in (‘KEYMA’, ‘SHP’, ‘SFKD’) and lob.LOB_Network = ‘0’ then ‘Non-Contracted’
when cd.FinancialResp = ‘KDMA’ and pu.UDF_Value = ‘Y’ then ‘Contracted’
when cd.FinancialResp = ‘KDMA’ and pu.UDF_Value ‘Y’ then ‘Non-Contracted’
else ‘Missing’
end as ProviderType,
Format( Min(cd.FromDateSvc), ‘MM/dd/yyyy’) as DateOfService,
CASE
when substring(cd.ClaimNo, 9,1) = ‘8’ then ‘Institutional’
when substring(cd.ClaimNO, 9,1) = ‘9’ then ‘Professional’
end as Claim_Type,
sum(cd.Billed) as TotalBilled,
SUM(cd.Net) as TotalPaid
from Claim_Details cd
inner join Claim_Masters_V cm on cd.ClaimNo = cm.ClaimNo and cd.Company_ID = cm.Company_ID
inner join Prov_VendInfo pv on cm.Prov_KeyID = pv.Prov_KeyID and cm.Company_ID = pv.Company_ID
inner join Vend_Masters vm on pv.Vendor = vm.VendorID
and pv.Company_ID = vm.Environment_ID
inner join Prov_Master_V pm on cm.Prov_KeyID = pm.Prov_KeyID
and cm.Company_ID = pm.Company_ID
and pv.Default_YN = 1
inner join BM_Prov_HP_LineOfBuss lob on cm.Prov_KeyID = lob.Prov_KeyID
and cm.Company_ID = lob.Company_ID
and cm.HPCode = lob.HPCode
and pv.Vendor = lob.Vendor
and lob.LOB_Start = CAST(CHECKSUM(NEWID(),ID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

When I run the above query repeatedly, the results vary in number from around 100 to the upper 80’s and lower 90’s. How do I refine the above to ensure that just 2.5% of the qualifying records are returned in a random sample?

Reply
- Brent Ozar
  January 20, 2022 9:52 pm
  
  For personalized help, click Consulting at the top of the screen.
  
  Reply