Getting the most recent record

SQL Server
25 Comments

Sounds simple, right?  Just grab the max.  But what if you want to use a single T-SQL operation to fetch the most recent record and some of its attributes?

StorminSpank asked:

@BrentO SQL Question. 4 columns, ID, Date, Info1, Info2. ID has multiple entries. Want Latest Date for Distinct ID, but also info1 and 2.

Let’s say our table schema is:

Now let’s get out the most recent record for each ID:

That left outer join is looking for any newer TestTable records, but the where clause makes sure there aren’t any. If you switch it around to “IS NOT NULL”, you’d get the exact opposite – you’d get all of the older records.

And remember, ladies and gentlemen, do as I say and not as I do – never SELECT *.

Previous Post
Certifications are the icing on the cake
Next Post
Virtualization Survey

25 Comments. Leave new

  • A common one in technical tests. I’ve used CTE’s for this before. This way is probably better.

    Reply
  • Fantastic! Thanks for the great advice!!!

    Reply
  • Same here Rhys, I would use a CTE, or before that table variables to collect the “current” rows, then join again on the table to get the other columns (…and for those out there old enough to remember SQL Server 6.5 or before, ugly subselects) Not only is this less code, but also fewer disk i/o’s. Run the example and take a look at the disk activity using Performance Monitor. Yes Brent, I had to give it a test! 🙂

    Reply
    • Thanks, Jay – that’s really interesting. Can you post your example with less code that uses fewer disk i/o’s? I’d love to do a followup on it to show a better way to do it.

      Reply
  • I don’t think CTE’s are available in SQL 2000, so this is similar to the way I’ve handled this. I can’t remember where I found the code snippet.

    Reply
  • Brent,

    I was actually referring to your example as the one with “less code”. Here are examples of the various way to perform the task. Of each of them, it seems to me your example is the best both in terms of simplicity and performance. I do a lot of bi-temporal database modeling and development so this is a very common task for me to perform. Thanks again and keep the tips and tricks coming!

    Jay

    –First I setup a counter log on the SQL Server with
    –several counters for both physical and logical disks, as well as memory.
    –Disk Reads/sec, Avg. Disk sec/Read, etc.
    –These counters were set to read only one physical disk, which had only one partition (thus one logical)
    –For the test, this disk contained both the data and log file for the database.
    –System database files were on another disk and no other files were on this disk.
    –Since I set the counter log top use an SQL database, the following query
    –was used to read the values of the counters after they were written…
    SELECT cd1.ObjectName, cd1.CounterName,
    cd0.CounterDateTime, cd0.CounterValue,
    cd0.FirstValueA, cd0.FirstValueB,
    cd0.SecondValueA, cd0.SecondValueB,
    cd0.MultiCount
    FROM dbo.CounterData cd0
    INNER JOIN dbo.CounterDetails cd1
    ON cd0.CounterId = cd1.CounterId
    WHERE cd0.CounterValue > 0
    ORDER BY cd0.CounterDateTime DESC
    –The different examples of getting the most recent record are shown below.
    –The two tables created in this example are very similar to the structure
    –where I ran the test, except the dbo.Customer table had just over 20K rows
    –and the dbo.CustomerStatus table contained just over 2,100,000 rows.

    –Create our sample tables…
    IF EXISTS (SELECT * FROM sys.objects
    WHERE object_id = OBJECT_ID(N'[dbo].[Customer]’)
    AND type in (N’U’))
    DROP TABLE [dbo].[Customer]
    GO
    IF EXISTS (SELECT * FROM sys.objects
    WHERE object_id = OBJECT_ID(N'[dbo].[CustomerStatus]’)
    AND type in (N’U’))
    DROP TABLE [dbo].[CustomerStatus]
    GO
    –The base dimension of customer…
    CREATE TABLE dbo.Customer (
    CustomerId int IDENTITY(1,1), Name varchar(64),
    CONSTRAINT [PK_Cusomter] PRIMARY KEY CLUSTERED
    (CustomerId ASC),
    CONSTRAINT [AK_Customer] UNIQUE NONCLUSTERED
    (Name ASC))
    ON [PRIMARY]
    GO
    –A temporal table showing the status of the customer over time…
    CREATE TABLE dbo.CustomerStatus (
    CustomerId int , TransactionTime datetime, Active bit,
    CONSTRAINT [PK_CusomterStatus] PRIMARY KEY CLUSTERED
    (CustomerId ASC, TransactionTime ASC))
    ON [PRIMARY]
    GO
    –Seed our sample tables by creating 2 customers, both of which are active…
    DECLARE @CustomerId int
    DECLARE @Now datetime
    INSERT INTO dbo.Customer
    (Name)
    VALUES
    (‘Widget Manufacturing’)
    SET @CustomerId = SCOPE_IDENTITY()
    SET @Now = GETDATE()
    INSERT INTO dbo.CustomerStatus
    (CustomerId, TransactionTime, Active)
    VALUES
    (@CustomerId, @Now, 1)
    INSERT INTO dbo.Customer
    (Name)
    VALUES
    (‘Widget Distribution Inc.’)
    SET @CustomerId = SCOPE_IDENTITY()
    SET @Now = GETDATE()
    INSERT INTO dbo.CustomerStatus
    (CustomerId, TransactionTime, Active)
    VALUES
    (@CustomerId, @Now, 1)
    GO
    –Perform a logical update of one of the customers
    –by inserting a new record where the Active bit is 0,
    –indicating the customer is no longer active…
    DECLARE @CustomerId int
    DECLARE @Now datetime
    SELECT @CustomerId = CustomerId FROM dbo.Customer WHERE [Name] = ‘Widget Manufacturing’
    SET @Now = GETDATE()
    INSERT INTO dbo.CustomerStatus
    (CustomerId, TransactionTime, Active)
    VALUES
    (@CustomerId, @Now, 0)
    GO
    –View the complete history of all customers…
    SELECT c.CustomerId, c.Name, cs.TransactionTime, cs.Active
    FROM dbo.Customer c
    INNER JOIN dbo.CustomerStatus cs
    ON c.CustomerId = cs.CustomerId
    ORDER BY c.CustomerId
    GO
    –View the “current” status of all customers, using Brent’s
    –example of “getting the most recent record”…
    –…and this will show that “Widget Manufacturing” is now inactive…
    SELECT cs0.CustomerId, c.Name, cs0.TransactionTime, cs0.Active
    FROM dbo.Customer c
    INNER JOIN dbo.CustomerStatus cs0
    ON c.CustomerId = cs0.CustomerId
    LEFT OUTER JOIN dbo.CustomerStatus cs1
    ON cs0.CustomerId = cs1.CustomerId
    AND cs0.TransactionTime < cs1.TransactionTime
    WHERE cs1.CustomerId IS NULL
    ORDER BY cs0.CustomerId;
    GO
    –Now use a CTE to perform the same task…
    WITH Customers(CustomerId, TransactionTime) AS
    (
    SELECT CustomerId, MAX(TransactionTime)
    FROM dbo.CustomerStatus
    GROUP BY CustomerId
    )
    SELECT c.CustomerId, c.Name, cs0.TransactionTime, cs0.Active
    FROM dbo.Customer c
    INNER JOIN dbo.CustomerStatus cs0
    ON c.CustomerId = cs0.CustomerId
    INNER JOIN Customers cs1
    ON cs0.CustomerId = cs1.CustomerId
    AND cs0.TransactionTime = cs1.TransactionTime
    ORDER BY cs0.CustomerId
    –Now use a table variable to perform the same task…
    GO
    DECLARE @Customers TABLE
    (CustomerId int, TransactionTime datetime)
    INSERT INTO @Customers
    (CustomerId, TransactionTime)
    SELECT CustomerId, MAX(TransactionTime)
    FROM dbo.CustomerStatus
    GROUP BY CustomerId
    SELECT c.CustomerId, c.Name, cs0.TransactionTime, cs0.Active
    FROM dbo.CustomerStatus cs0
    INNER JOIN @Customers cs1
    ON cs0.CustomerId = cs1.CustomerId
    AND cs0.TransactionTime = cs1.TransactionTime
    INNER JOIN dbo.Customer c
    ON cs0.CustomerId = c.CustomerId
    GO
    –And finally a subselect…
    SELECT c.CustomerId, c.CustomerId, cs0.TransactionTime, cs0.Active
    FROM dbo.CustomerStatus cs0
    INNER JOIN dbo.Customer c
    ON cs0.CustomerId = c.CustomerId
    WHERE cs0.TransactionTime = (SELECT MAX(TransactionTime)
    FROM dbo.CustomerStatus cs1
    WHERE cs1.CustomerId = cs0.CustomerId)
    –Which one showed less disk and memory activity?

    Reply
  • Hey Brent,

    What about the following approach? What do you think about performance?

    SELECT T1.id, T1.create_date, T1.info1, T1.info2
    FROM (
    SELECT RANK() OVER (PARTITION BY id ORDER BY create_date) AS RowNumber
    , id, create_date, info1, info2
    FROM dbo.TestTable
    ) T1
    WHERE RowNumber = 1

    Cheers,
    Raul Santos Neto

    Reply
  • Jay – ah, cool, I was gonna say – I’ve never seen a shorter way to do it than my way, heh.

    Raul – go ahead & find out! Try it in your environment and compare the query plans.

    Reply
  • The self join on that table seems a bit expensive (O=N!). With a very large table this could be very slow.

    It seems to me that barring exactly simultaneous entries and with an index on said date field this approach would work a bit faster:

    SELECT tt. …
    FROM dbo.TestTable tt
    ORDER BY tt.create_date DESC
    LIMIT 1;

    This would involve a single page load from the index to get the pointer to the record with the maximum date value then a single page load for the record (O=2).

    In the case of multiple entries on the maximum create_date, your query would return multiple records, as well.

    IF you want all records on the last date why not just do this:

    SELECT tt. …
    FROM dbo.TestTable tt
    ORDER BY tt.create_date DESC
    WHERE create_date = (SELECT MAX(t.create_date) FROM tbl.TestTable t);

    This would require a single Page read from the index to get the max value and a single partial index scan for the compare and the page loads for each record hit. I would guesstimate the cost at O=N (technically N/2+C).

    Of course if create_date isn’t indexed then all bets are off because each of these would require a table scan.

    Reply
  • Hi, Chris. You’ll need to check the specs again. Each ID may have a different create_date. If they all had the same date, then your approach would make sense, but that wasn’t the request. Good idea though!

    Reply
  • I have found that using table functions has been the fastet and most versital in my code and makes querying the teporal tables just like querying any other table. Below is an example is a table I have in production and the function that I use. The only parmater is the date which determines the max date of data making it easy to look at the data as of that date.

    CREATE TABLE [Agency].[AgentCarrierAppointment](
    [AgentId] [int] NOT NULL,
    [CarrierId] [smallint] NOT NULL,
    [StateId] [tinyint] NOT NULL,
    [TransactionDate] [datetime] NOT NULL,
    [AppointmentId] [int] NOT NULL,
    [TransactionuserId] [int] NOT NULL,
    [Active] [bit] NOT NULL,
    CONSTRAINT [PK_AgentCarrierAppointment] PRIMARY KEY CLUSTERED
    (
    [AgentId] ASC,
    [CarrierId] ASC,
    [StateId] ASC,
    [TransactionDate] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
    ) ON [PRIMARY]
    GO

    CREATE FUNCTION [Agency].[AgentCarrierAppointmentTable]
    (
    @MaxTransactionDate datetime = NULL
    )
    RETURNS TABLE
    WITH SCHEMABINDING
    AS
    RETURN
    (
    Select aca.AgentId, aca.CarrierId, aca.StateId, aca.TransactionDate, aca.AppointmentId, aca.TransactionuserId, aca.Active
    From Agency.AgentCarrierAppointment aca
    Inner Join (Select AgentId, CarrierId, StateId, MAX(TransactionDate) [TransactionDate]
    From Agency.AgentCarrierAppointment
    Where TransactionDate <= isNULL(@MaxTransactionDate, TransactionDate)
    Group By AgentId, CarrierId, StateId) acac
    on aca.AgentId = acac.AgentId
    And aca.CarrierId = acac.CarrierId
    And aca.StateId = acac.StateId
    And aca.TransactionDate = acac.TransactionDate
    )
    GO

    Here is some examples that I have compared first using the table function second the LEFT JOIN method.
    Select COUNT(1) From Agency.AgentCarrierAppointmentTable(Default)
    Select COUNT(1)
    From Agency.AgentCarrierAppointment aca
    Left Join Agency.AgentCarrierAppointment acac
    On aca.AgentId = acac.AgentId
    And aca.CarrierId = acac.CarrierId
    And aca.StateId = acac.StateId
    And aca.TransactionDate < acac.TransactionDate
    Where acac.AgentId is NULL
    In these examples the count on my data sets is 100004 on both, and did a compare of all data returned matched so all are logically the same. Now the execution plans show that the table function is 28% of the batch and LEFT Join is 72% when running actual time test the table function returns about 7 times faster. (143ms to 1000ms) Now if you return small datasets like adding a where on the AgentId both return sub millisecond. If you return large data sets like doing a select * the results are much closer but the table function is still faster. (1396ms to 1576ms about 12% faster) Now the one interesting thing was that when adding a where on the Active bit the execution plan was nearly identical on % of work 49% to 51% but the results stayed the same for actual execution.

    Reply
  • how about this?

    SELECT tt.*
    FROM dbo.TestTable tt
    INNER JOIN (SELECT t2.id, MAX(t2.create_date) AS create_date_newer FROM dbo.TestTable t2 GROUP BY t2.id) AS ttNewer
    ON tt.id = ttNewer.id AND tt.create_date = ttNewer.create_date_newer

    SQL Server parse and compile time:
    CPU time = 0 ms, elapsed time = 3 ms.

    (3 row(s) affected)
    Table ‘TestTable’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:
    CPU time = 0 ms, elapsed time = 1 ms.

    Reply
  • Hm, on Postgres you could just do:

    SELECT DISTINCT ON (id) *
    FROM tt
    ORDER BY id, create_date DESC

    It’s functionally identical to the OLAP window-function based solution (which Postgres supports too).

    Also, you could rewrite Brian’s solution as:

    select *
    from tt
    where (id,create_date) in (
    select id,max(create_date)
    from tt
    group by id)

    Assuming a good SQL optimizer the plan should be identical since the query is equivalent but it’s a lot easier to read.

    You could do the same thing with the original proposal:

    select *
    from tt
    where not exists (
    select 1
    from tt as ttnewer
    where ttnewer.id = tt
    and ttnewer.create_date > tt.create_date
    )

    which I claim is easier to read too.

    Incidentally “don’t use select *” is, imho, cargo cult programming and disagree vociferously. I don’t know how well SQL Server deals with it but on other tools all the old problems are gone and there are massive code modularity benefits to using it.

    Reply
    • Hi, Gerg. Sorry, but no, SQL Server does not respond well to SELECT *’s when they’re not needed. I’ve demoed this using execution plans several times at user groups. For more info on it, pick up Grant Fritchey’s excellent book “SQL Server Query Performance Tuning Distilled,” which explains this concept in detail.

      Reply
      • That sucks. Have you reported bugs?

        You shouldn’t have to hard code ever field in every bit of your code making it hard to change schema without adjusting every layer. Don’t put up with bad tools, get them fixed!

        Reply
        • Gerg – no, that’s not a bug. If you only need 3 fields out of 20, you should only ask for those 3. Otherwise you’re pulling unnecessary IO overhead by doing table scans when you might be able to take advantage of a clustering index instead, and you’re pulling unnecessary network overhead by pushing those 20 fields (x every record) instead of just 3.

          Seriously, do yourself a favor and pick up Grant’s book. It’ll be an eye-opener for large-scale environments. I understand that doing SELECT * is easier for programmers, but if you need to scale to high volume environments, you need to watch every operation for performance tuning.

          Reply
  • (reposting due to munging the blog software did on the first attempt)

    Hm, on Postgres you could just do:

    .SELECT DISTINCT ON (id) *
    . FROM tt
    . ORDER BY id, create_date DESC

    It’s functionally identical to the OLAP window-function based solution (which Postgres supports too).

    Also, you could rewrite Brian’s solution as:

    .select *
    . from tt
    . where (id,create_date) in (
    . select id,max(create_date)
    . from tt
    . group by id)

    Assuming a good SQL optimizer the plan should be identical since the query is equivalent but it’s a lot easier to read.

    You could do the same thing with the original proposal:

    .select *
    . from tt
    . where not exists (
    . select 1
    . from tt as ttnewer
    . where ttnewer.id = tt
    . and ttnewer.create_date > tt.create_date
    . )

    which I claim is easier to read too.

    Incidentally “don’t use select *” is, imho, cargo cult programming and disagree vociferously. I don’t know how well SQL Server deals with it but on other tools all the old problems are gone and there are massive code modularity benefits to using it.

    Reply
  • Sorry, that didn’t help with the munging anwyways

    Reply
  • That sucks. Have you reported bugs?

    You shouldn’t have to hard code ever field in every bit of your code making it hard to change schema without adjusting every layer. Don’t put up with bad tools, get them fixed!

    Reply
  • Brent –
    Your original solution was so simple and it worked! I spent a huge chunk of yesterday trying to get it right and just couldn’t. Found this site with google and am now bookmarking it. Thank you!!

    Reply
  • Shane Voelker
    April 1, 2014 12:41 pm

    This is a great idea. It is also useful when trying to compare deltas for specific date periods. You can join on records where the data is the same and then filter out those records.

    Reply
  • Mxolisi Phiri
    April 10, 2015 4:36 am

    Great stuff, thanks for sharing your code…

    Reply
  • I just learned how much of a hacker I am with regard to TSQL.
    I normally use derived tables for this type of thing, but this code is so much better.
    Just goes to show that knowing how things work, and taking the time to think things through (rather than instantly coding) can pay pretty good dividends.

    Thanks Brent!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.