Product has been added to your cart.

Getting the most recent record

Last Updated 9 years ago

Sounds simple, right? Just grab the max. But what if you want to use a single T-SQL operation to fetch the most recent record and some of its attributes?

StorminSpank asked:

@BrentO SQL Question. 4 columns, ID, Date, Info1, Info2. ID has multiple entries. Want Latest Date for Distinct ID, but also info1 and 2.

Let’s say our table schema is:

CREATE TABLE [dbo].[TestTable](
  [id] [int] NOT NULL,
  [create_date] [date] NOT NULL,
  [info1] [varchar](50) NOT NULL,
  [info2] [varchar](50) NOT NULL,
 CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED
(
  [id] ASC, [create_date] ASC
) ON [PRIMARY]
) ON [PRIMARY]
GO

1

2

3

4

5

6

7

8

9

10

11

CREATE TABLE [dbo].[TestTable](

[id] [int] NOT NULL,

[create_date] [date] NOT NULL,

[info1] [varchar](50) NOT NULL,

[info2] [varchar](50) NOT NULL,

CONSTRAINT [PK_TestTable] PRIMARY KEY CLUSTERED

(

[id] ASC, [create_date] ASC

) ON [PRIMARY]

) ON [PRIMARY]

GO

And let's populate it with some data:

1	And let's populate it with some data:

INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (1, '1/1/09', 'Blue', 'Green')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (1, '1/2/09', 'Red', 'Yellow')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (1, '1/3/09', 'Orange', 'Purple')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (2, '1/1/09', 'Yellow', 'Blue')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (2, '1/5/09', 'Blue', 'Orange')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (3, '1/2/09', 'Green', 'Purple')
INSERT INTO dbo.TestTable (id, create_date, info1, info2)
VALUES (3, '1/8/09', 'Red', 'Blue')

1

2

3

4

5

6

7

8

9

10

11

12

13

14

INSERT INTO dbo.TestTable (id, create_date, info1, info2)

VALUES (1, '1/1/09', 'Blue', 'Green')

INSERT INTO dbo.TestTable (id, create_date, info1, info2)

VALUES (1, '1/2/09', 'Red', 'Yellow')

INSERT INTO dbo.TestTable (id, create_date, info1, info2)

VALUES (1, '1/3/09', 'Orange', 'Purple')

INSERT INTO dbo.TestTable (id, create_date, info1, info2)

VALUES (2, '1/1/09', 'Yellow', 'Blue')

INSERT INTO dbo.TestTable (id, create_date, info1, info2)

VALUES (2, '1/5/09', 'Blue', 'Orange')

INSERT INTO dbo.TestTable (id, create_date, info1, info2)

VALUES (3, '1/2/09', 'Green', 'Purple')

INSERT INTO dbo.TestTable (id, create_date, info1, info2)

VALUES (3, '1/8/09', 'Red', 'Blue')

Now let’s get out the most recent record for each ID:

SELECT tt.*
  FROM dbo.TestTable tt
    LEFT OUTER JOIN dbo.TestTable ttNewer
    ON tt.id = ttNewer.id AND tt.create_date &lt; ttNewer.create_date
  WHERE ttNewer.id IS NULL

1

2

3

4

5

SELECT tt.*

FROM dbo.TestTable tt

LEFT OUTER JOIN dbo.TestTable ttNewer

ON tt.id = ttNewer.id AND tt.create_date < ttNewer.create_date

WHERE ttNewer.id IS NULL

That left outer join is looking for any newer TestTable records, but the where clause makes sure there aren’t any. If you switch it around to “IS NOT NULL”, you’d get the exact opposite – you’d get all of the older records.

And remember, ladies and gentlemen, do as I say and not as I do – never SELECT *.

Certifications are the icing on the cake

Virtualization Survey

25 Comments. Leave new

Rhys
April 14, 2009 4:14 pm

A common one in technical tests. I’ve used CTE’s for this before. This way is probably better.

Reply
storminspank
April 14, 2009 4:55 pm

Fantastic! Thanks for the great advice!!!

Reply
Jay Allen
April 14, 2009 9:29 pm

Same here Rhys, I would use a CTE, or before that table variables to collect the “current” rows, then join again on the table to get the other columns (…and for those out there old enough to remember SQL Server 6.5 or before, ugly subselects) Not only is this less code, but also fewer disk i/o’s. Run the example and take a look at the disk activity using Performance Monitor. Yes Brent, I had to give it a test! 🙂

Reply
- Brent Ozar
  April 15, 2009 6:41 am
  
  Thanks, Jay – that’s really interesting. Can you post your example with less code that uses fewer disk i/o’s? I’d love to do a followup on it to show a better way to do it.
  
  Reply
David Stein
April 15, 2009 9:59 am

I don’t think CTE’s are available in SQL 2000, so this is similar to the way I’ve handled this. I can’t remember where I found the code snippet.

Reply
Jay Allen
April 15, 2009 11:07 am

Brent,

I was actually referring to your example as the one with “less code”. Here are examples of the various way to perform the task. Of each of them, it seems to me your example is the best both in terms of simplicity and performance. I do a lot of bi-temporal database modeling and development so this is a very common task for me to perform. Thanks again and keep the tips and tricks coming!

Jay

–First I setup a counter log on the SQL Server with
–several counters for both physical and logical disks, as well as memory.
–Disk Reads/sec, Avg. Disk sec/Read, etc.
–These counters were set to read only one physical disk, which had only one partition (thus one logical)
–For the test, this disk contained both the data and log file for the database.
–System database files were on another disk and no other files were on this disk.
–Since I set the counter log top use an SQL database, the following query
–was used to read the values of the counters after they were written…
SELECT cd1.ObjectName, cd1.CounterName,
cd0.CounterDateTime, cd0.CounterValue,
cd0.FirstValueA, cd0.FirstValueB,
cd0.SecondValueA, cd0.SecondValueB,
cd0.MultiCount
FROM dbo.CounterData cd0
INNER JOIN dbo.CounterDetails cd1
ON cd0.CounterId = cd1.CounterId
WHERE cd0.CounterValue > 0
ORDER BY cd0.CounterDateTime DESC
–The different examples of getting the most recent record are shown below.
–The two tables created in this example are very similar to the structure
–where I ran the test, except the dbo.Customer table had just over 20K rows
–and the dbo.CustomerStatus table contained just over 2,100,000 rows.

–Create our sample tables…
IF EXISTS (SELECT * FROM sys.objects
WHERE object_id = OBJECT_ID(N'[dbo].[Customer]’)
AND type in (N’U’))
DROP TABLE [dbo].[Customer]
GO
IF EXISTS (SELECT * FROM sys.objects
WHERE object_id = OBJECT_ID(N'[dbo].[CustomerStatus]’)
AND type in (N’U’))
DROP TABLE [dbo].[CustomerStatus]
GO
–The base dimension of customer…
CREATE TABLE dbo.Customer (
CustomerId int IDENTITY(1,1), Name varchar(64),
CONSTRAINT [PK_Cusomter] PRIMARY KEY CLUSTERED
(CustomerId ASC),
CONSTRAINT [AK_Customer] UNIQUE NONCLUSTERED
(Name ASC))
ON [PRIMARY]
GO
–A temporal table showing the status of the customer over time…
CREATE TABLE dbo.CustomerStatus (
CustomerId int , TransactionTime datetime, Active bit,
CONSTRAINT [PK_CusomterStatus] PRIMARY KEY CLUSTERED
(CustomerId ASC, TransactionTime ASC))
ON [PRIMARY]
GO
–Seed our sample tables by creating 2 customers, both of which are active…
DECLARE @CustomerId int
DECLARE @Now datetime
INSERT INTO dbo.Customer
(Name)
VALUES
(‘Widget Manufacturing’)
SET @CustomerId = SCOPE_IDENTITY()
SET @Now = GETDATE()
INSERT INTO dbo.CustomerStatus
(CustomerId, TransactionTime, Active)
VALUES
(@CustomerId, @Now, 1)
INSERT INTO dbo.Customer
(Name)
VALUES
(‘Widget Distribution Inc.’)
SET @CustomerId = SCOPE_IDENTITY()
SET @Now = GETDATE()
INSERT INTO dbo.CustomerStatus
(CustomerId, TransactionTime, Active)
VALUES
(@CustomerId, @Now, 1)
GO
–Perform a logical update of one of the customers
–by inserting a new record where the Active bit is 0,
–indicating the customer is no longer active…
DECLARE @CustomerId int
DECLARE @Now datetime
SELECT @CustomerId = CustomerId FROM dbo.Customer WHERE [Name] = ‘Widget Manufacturing’
SET @Now = GETDATE()
INSERT INTO dbo.CustomerStatus
(CustomerId, TransactionTime, Active)
VALUES
(@CustomerId, @Now, 0)
GO
–View the complete history of all customers…
SELECT c.CustomerId, c.Name, cs.TransactionTime, cs.Active
FROM dbo.Customer c
INNER JOIN dbo.CustomerStatus cs
ON c.CustomerId = cs.CustomerId
ORDER BY c.CustomerId
GO
–View the “current” status of all customers, using Brent’s
–example of “getting the most recent record”…
–…and this will show that “Widget Manufacturing” is now inactive…
SELECT cs0.CustomerId, c.Name, cs0.TransactionTime, cs0.Active
FROM dbo.Customer c
INNER JOIN dbo.CustomerStatus cs0
ON c.CustomerId = cs0.CustomerId
LEFT OUTER JOIN dbo.CustomerStatus cs1
ON cs0.CustomerId = cs1.CustomerId
AND cs0.TransactionTime < cs1.TransactionTime
WHERE cs1.CustomerId IS NULL
ORDER BY cs0.CustomerId;
GO
–Now use a CTE to perform the same task…
WITH Customers(CustomerId, TransactionTime) AS
(
SELECT CustomerId, MAX(TransactionTime)
FROM dbo.CustomerStatus
GROUP BY CustomerId
)
SELECT c.CustomerId, c.Name, cs0.TransactionTime, cs0.Active
FROM dbo.Customer c
INNER JOIN dbo.CustomerStatus cs0
ON c.CustomerId = cs0.CustomerId
INNER JOIN Customers cs1
ON cs0.CustomerId = cs1.CustomerId
AND cs0.TransactionTime = cs1.TransactionTime
ORDER BY cs0.CustomerId
–Now use a table variable to perform the same task…
GO
DECLARE @Customers TABLE
(CustomerId int, TransactionTime datetime)
INSERT INTO @Customers
(CustomerId, TransactionTime)
SELECT CustomerId, MAX(TransactionTime)
FROM dbo.CustomerStatus
GROUP BY CustomerId
SELECT c.CustomerId, c.Name, cs0.TransactionTime, cs0.Active
FROM dbo.CustomerStatus cs0
INNER JOIN @Customers cs1
ON cs0.CustomerId = cs1.CustomerId
AND cs0.TransactionTime = cs1.TransactionTime
INNER JOIN dbo.Customer c
ON cs0.CustomerId = c.CustomerId
GO
–And finally a subselect…
SELECT c.CustomerId, c.CustomerId, cs0.TransactionTime, cs0.Active
FROM dbo.CustomerStatus cs0
INNER JOIN dbo.Customer c
ON cs0.CustomerId = c.CustomerId
WHERE cs0.TransactionTime = (SELECT MAX(TransactionTime)
FROM dbo.CustomerStatus cs1
WHERE cs1.CustomerId = cs0.CustomerId)
–Which one showed less disk and memory activity?

Reply
Raul Santos Neto
April 15, 2009 12:52 pm

Hey Brent,

What about the following approach? What do you think about performance?

SELECT T1.id, T1.create_date, T1.info1, T1.info2
FROM (
SELECT RANK() OVER (PARTITION BY id ORDER BY create_date) AS RowNumber
, id, create_date, info1, info2
FROM dbo.TestTable
) T1
WHERE RowNumber = 1

Cheers,
Raul Santos Neto

Reply
Brent Ozar
April 15, 2009 1:03 pm

Jay – ah, cool, I was gonna say – I’ve never seen a shorter way to do it than my way, heh.

Raul – go ahead & find out! Try it in your environment and compare the query plans.

Reply
Chris
April 15, 2009 1:28 pm

The self join on that table seems a bit expensive (O=N!). With a very large table this could be very slow.

It seems to me that barring exactly simultaneous entries and with an index on said date field this approach would work a bit faster:

SELECT tt. …
FROM dbo.TestTable tt
ORDER BY tt.create_date DESC
LIMIT 1;

This would involve a single page load from the index to get the pointer to the record with the maximum date value then a single page load for the record (O=2).

In the case of multiple entries on the maximum create_date, your query would return multiple records, as well.

IF you want all records on the last date why not just do this:

SELECT tt. …
FROM dbo.TestTable tt
ORDER BY tt.create_date DESC
WHERE create_date = (SELECT MAX(t.create_date) FROM tbl.TestTable t);

This would require a single Page read from the index to get the max value and a single partial index scan for the compare and the page loads for each record hit. I would guesstimate the cost at O=N (technically N/2+C).

Of course if create_date isn’t indexed then all bets are off because each of these would require a table scan.

Reply
Brent Ozar
April 15, 2009 1:39 pm

Hi, Chris. You’ll need to check the specs again. Each ID may have a different create_date. If they all had the same date, then your approach would make sense, but that wasn’t the request. Good idea though!

Reply
Wayne Smith
April 15, 2009 3:00 pm

I have found that using table functions has been the fastet and most versital in my code and makes querying the teporal tables just like querying any other table. Below is an example is a table I have in production and the function that I use. The only parmater is the date which determines the max date of data making it easy to look at the data as of that date.

CREATE TABLE [Agency].[AgentCarrierAppointment](
[AgentId] [int] NOT NULL,
[CarrierId] [smallint] NOT NULL,
[StateId] [tinyint] NOT NULL,
[TransactionDate] [datetime] NOT NULL,
[AppointmentId] [int] NOT NULL,
[TransactionuserId] [int] NOT NULL,
[Active] [bit] NOT NULL,
CONSTRAINT [PK_AgentCarrierAppointment] PRIMARY KEY CLUSTERED
(
[AgentId] ASC,
[CarrierId] ASC,
[StateId] ASC,
[TransactionDate] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
) ON [PRIMARY]
GO

CREATE FUNCTION [Agency].[AgentCarrierAppointmentTable]
(
@MaxTransactionDate datetime = NULL
)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN
(
Select aca.AgentId, aca.CarrierId, aca.StateId, aca.TransactionDate, aca.AppointmentId, aca.TransactionuserId, aca.Active
From Agency.AgentCarrierAppointment aca
Inner Join (Select AgentId, CarrierId, StateId, MAX(TransactionDate) [TransactionDate]
From Agency.AgentCarrierAppointment
Where TransactionDate <= isNULL(@MaxTransactionDate, TransactionDate)
Group By AgentId, CarrierId, StateId) acac
on aca.AgentId = acac.AgentId
And aca.CarrierId = acac.CarrierId
And aca.StateId = acac.StateId
And aca.TransactionDate = acac.TransactionDate
)
GO

Here is some examples that I have compared first using the table function second the LEFT JOIN method.
Select COUNT(1) From Agency.AgentCarrierAppointmentTable(Default)
Select COUNT(1)
From Agency.AgentCarrierAppointment aca
Left Join Agency.AgentCarrierAppointment acac
On aca.AgentId = acac.AgentId
And aca.CarrierId = acac.CarrierId
And aca.StateId = acac.StateId
And aca.TransactionDate < acac.TransactionDate
Where acac.AgentId is NULL
In these examples the count on my data sets is 100004 on both, and did a compare of all data returned matched so all are logically the same. Now the execution plans show that the table function is 28% of the batch and LEFT Join is 72% when running actual time test the table function returns about 7 times faster. (143ms to 1000ms) Now if you return small datasets like adding a where on the AgentId both return sub millisecond. If you return large data sets like doing a select * the results are much closer but the table function is still faster. (1396ms to 1576ms about 12% faster) Now the one interesting thing was that when adding a where on the Active bit the execution plan was nearly identical on % of work 49% to 51% but the results stayed the same for actual execution.

Reply
Brian
April 21, 2009 2:27 pm

how about this?

SELECT tt.*
FROM dbo.TestTable tt
INNER JOIN (SELECT t2.id, MAX(t2.create_date) AS create_date_newer FROM dbo.TestTable t2 GROUP BY t2.id) AS ttNewer
ON tt.id = ttNewer.id AND tt.create_date = ttNewer.create_date_newer

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 3 ms.

(3 row(s) affected)
Table ‘TestTable’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.

Reply
Gerg
August 12, 2009 3:13 pm

Hm, on Postgres you could just do:

SELECT DISTINCT ON (id) *
FROM tt
ORDER BY id, create_date DESC

It’s functionally identical to the OLAP window-function based solution (which Postgres supports too).

Also, you could rewrite Brian’s solution as:

select *
from tt
where (id,create_date) in (
select id,max(create_date)
from tt
group by id)

Assuming a good SQL optimizer the plan should be identical since the query is equivalent but it’s a lot easier to read.

You could do the same thing with the original proposal:

select *
from tt
where not exists (
select 1
from tt as ttnewer
where ttnewer.id = tt
and ttnewer.create_date > tt.create_date
)

which I claim is easier to read too.

Incidentally “don’t use select *” is, imho, cargo cult programming and disagree vociferously. I don’t know how well SQL Server deals with it but on other tools all the old problems are gone and there are massive code modularity benefits to using it.

Reply
- Brent Ozar
  August 12, 2009 3:15 pm
  
  Hi, Gerg. Sorry, but no, SQL Server does not respond well to SELECT *’s when they’re not needed. I’ve demoed this using execution plans several times at user groups. For more info on it, pick up Grant Fritchey’s excellent book “SQL Server Query Performance Tuning Distilled,” which explains this concept in detail.
  
  Reply
  - Gerg
    August 12, 2009 3:19 pm
    
    That sucks. Have you reported bugs?
    
    You shouldn’t have to hard code ever field in every bit of your code making it hard to change schema without adjusting every layer. Don’t put up with bad tools, get them fixed!
    
    Reply
    - Brent Ozar
      August 12, 2009 8:20 pm
      
      Gerg – no, that’s not a bug. If you only need 3 fields out of 20, you should only ask for those 3. Otherwise you’re pulling unnecessary IO overhead by doing table scans when you might be able to take advantage of a clustering index instead, and you’re pulling unnecessary network overhead by pushing those 20 fields (x every record) instead of just 3.
      
      Seriously, do yourself a favor and pick up Grant’s book. It’ll be an eye-opener for large-scale environments. I understand that doing SELECT * is easier for programmers, but if you need to scale to high volume environments, you need to watch every operation for performance tuning.
      
      Reply
Gerg
August 12, 2009 3:15 pm

(reposting due to munging the blog software did on the first attempt)

Hm, on Postgres you could just do:

.SELECT DISTINCT ON (id) *
. FROM tt
. ORDER BY id, create_date DESC

It’s functionally identical to the OLAP window-function based solution (which Postgres supports too).

Also, you could rewrite Brian’s solution as:

.select *
. from tt
. where (id,create_date) in (
. select id,max(create_date)
. from tt
. group by id)

Assuming a good SQL optimizer the plan should be identical since the query is equivalent but it’s a lot easier to read.

You could do the same thing with the original proposal:

.select *
. from tt
. where not exists (
. select 1
. from tt as ttnewer
. where ttnewer.id = tt
. and ttnewer.create_date > tt.create_date
. )

which I claim is easier to read too.

Incidentally “don’t use select *” is, imho, cargo cult programming and disagree vociferously. I don’t know how well SQL Server deals with it but on other tools all the old problems are gone and there are massive code modularity benefits to using it.

Reply
Gerg
August 12, 2009 3:16 pm

Sorry, that didn’t help with the munging anwyways

Reply
Gerg
August 12, 2009 3:18 pm

That sucks. Have you reported bugs?

You shouldn’t have to hard code ever field in every bit of your code making it hard to change schema without adjusting every layer. Don’t put up with bad tools, get them fixed!

Reply
Nancy D
August 19, 2009 11:29 am

Brent –
Your original solution was so simple and it worked! I spent a huge chunk of yesterday trying to get it right and just couldn’t. Found this site with google and am now bookmarking it. Thank you!!

Reply
- Brent Ozar
  August 19, 2009 11:30 am
  
  You’re welcome! Glad I could help!
  
  Reply
Shane Voelker
April 1, 2014 12:41 pm

This is a great idea. It is also useful when trying to compare deltas for specific date periods. You can join on records where the data is the same and then filter out those records.

Reply
Mxolisi Phiri
April 10, 2015 4:36 am

Great stuff, thanks for sharing your code…

Reply
Steven Swab
February 10, 2016 11:33 am

I just learned how much of a hacker I am with regard to TSQL.
I normally use derived tables for this type of thing, but this code is so much better.
Just goes to show that knowing how things work, and taking the time to think things through (rather than instantly coding) can pay pretty good dividends.

Thanks Brent!

Reply

Leave a Reply Cancel reply