Can this Sql statement be refactored to NOT use RANK/PARTITION?

I have the following sql statement, which works perfectly fine. I was hoping to see how this could be refactored so it doesn’t require the use of RANK/PARTITION … if possible.

SELECT LogEntryId, FileId, CreatedOn, EventTypeId
FROM (SELECT a.LogEntryId, a.FileId, a.CreatedOn,  a.EventTypeId, 
        RANK() OVER (PARTITION BY ClientName ORDER BY a.CreatedOn DESC) AS MostRecentEventRank
    FROM LogEntries a
    WHERE (a.EventTypeId = 2 or a.EventTypeId = 4)) SubQuery
WHERE MostRecentEventRank = 1

What is it trying to do?

  • Why does SQL Server use a non-clustered index over the clustered PK in a “select *” operation?
  • How does SQLParameter prevent SQL Injection?
  • Strange behaviour with Fulltext search in SQL Server
  • Group by Time Interval
  • Query to find Exact count
    1. Grab all the records in the table, grouped by client name and then ordered by most recently created.
    2. Filter this by only event types #2 (a connection) or #4 (a disconnection).
    3. Now, for each client name, retrieve the most recent record.

    this in effect is grabbing the most recent event (for a connection or disconnection), for each unique user in the table.

    I do like RANK/PARTITION, but i was hoping to see if it’s possible to do without using it.

    4 Solutions collect form web for “Can this Sql statement be refactored to NOT use RANK/PARTITION?”

    Yet another variation: select the clients, then use CROSS APPLY (.. TOP (1) … ORDER BY …) to get the relevant entry.

    SELECT c.ClientName,r.LogEntryId, r.FileId, r.CreatedOn,  r.EventTypeId
    FROM (
     FROM LogEntries
     WHERE EventTypeId IN (2,4)) as c
       SELECT TOP (1) a.LogEntryId, a.FileId, a.CreatedOn,  a.EventTypeId
       FROM LogEntries as a
       WHERE a.ClientName = c.ClientName
       AND a.EventTypeId IN (2,4)
       ORDER BY a.CreatedOn DESC) as r;


    To talk about performance on a T-SQL query without knowing the schema is non-sense. This query is perfectly optimal on a properly designed schema for its needs. Since the access is by ClientName and CreatedOn, then even a simplistic schema would need to take this into consideration:

    CREATE TABLE LogEntries (
       LogEntryId int identity(1,1),
       FileID int,
       CreatedOn datetime,
       EventTypeID int,
       ClientName varchar(30)
    create clustered index cdxLogEntries on LogEntries (
        ClientName, CreatedOn DESC);

    And lets load the table with some 2.4M rows:

    declare @i int;
    set @i = 0;
    while @i < 1000
        insert into LogEntries (FileId, CreatedOn, EventTypeId, ClientName)
        select cast(rand()*100 as int),
            dateadd(minute, -rand()*10000, getdate()),
            cast(rand() * 5 as int),
            'Client' + cast(@i as varchar(10))
            from master..spt_values;
        set @i = @i+1;

    What time and IO do we get with set statistics io on; set statistics time on; on a warmed cache?

    (410 row(s) affected)
    Table 'LogEntries'. Scan count 411, logical reads 14354, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
    SQL Server Execution Times:
       CPU time = 1219 ms,  elapsed time = 1932 ms.

    1.9 sec to get the data from 2.4M entries on my laptop (which is 4 years old and has 1Gb RAM). And there is still plenty of room for improvement int he schema design. Separate ClientName into a normalized table with a trusted a foreign key from LogEntries into it would reduce the time significantly. Proper filtered indexes on EntryTypeId IN (2,4) also would contribute. We did not even start the exploration of parallelism posibilities.

    This is SQL, performance is obtained on the drawing board of your schema, not in the text editor of your query.

    Single table scan, no windowing function, single group by, no problems with duplicate dates, equal performance with windowing functions, or even outperforms them with really large queries. (Update: I don’t know how it performs compared to the TOP 1 WITH TIES / CROSS APPLY method. Since it uses a scan, it might be slower in some situations.)

       LogEntryID = Convert(int, Substring(Packed, 9, 4)),
       FileID = Convert(int, Substring(Packed, 13, 4)),
       CreatedOn = Convert(datetime, Substring(Packed, 1, 8)),
       EventTypeID = Convert(int, Substring(Packed, 17, 4))
             Packed = Max(
                Convert(binary(8), CreatedOn)
                + Convert(binary(4), LogEntryID)
                + Convert(binary(4), FileID)
                + Convert(binary(4), EventTypeID)
          FROM LogEntries
          WHERE EventTypeID IN (2,4)
          GROUP BY ClientName
       ) X

    If anyone would like to see this in action, here’s some creation script:

    USE tempdb
    CREATE TABLE LogEntries (
       LogEntryID int not null identity(1,1),
       FileID int,
       CreatedOn datetime,
       EventTypeID int,
       ClientName varchar(30)
    INSERT LogEntries VALUES (1, GetDate()-20, 2, 'bob')
    INSERT LogEntries VALUES (1, GetDate()-19, 3, 'bob')
    INSERT LogEntries VALUES (1, GetDate()-18, 4, 'bob')
    INSERT LogEntries VALUES (1, GetDate()-17, 3, 'bob')
    INSERT LogEntries VALUES (1, GetDate()-19.5, 2, 'anna')
    INSERT LogEntries VALUES (1, GetDate()-18.5, 3, 'anna')
    INSERT LogEntries VALUES (1, GetDate()-17.5, 4, 'anna')
    INSERT LogEntries VALUES (1, GetDate()-16.5, 3, 'anna')

    Please note that this method is taking advantage of the internal byte representation of the given data types having the same ordering as the type’s values. Packed data types like float or decimal will NOT work: those would require conversion to something suitable first, such as int, bigint, or character.

    Also, the new Date and Time data types in SQL 2008 have different representations that won’t pack correctly to use with this method. I haven’t examined the Time data type yet, but for the Date data type:

    DECLARE @d date
    SET @d ='99990101'
    SELECT Convert(binary(3), @d) -- 0x6EB837

    The actual value is 0x37B86E, so it’s storing them in reverse byte order (the “zero” date is 0001-01-01).

    You could use an exclusive left join:

    select     cur.*
    from       LogEntries cur
    left join  LogEntries next
    on         next.ClientName = cur.ClientName
               and next.EventTypeId in (2,4)
               and next.CreatedOn > cur.CreatedOn               
    where      next.ClientName is null
               and cur.EventTypeId in (2,4)

    This joins the table on itself, searching for later rows in the on condition. In the where clause, you specify that no later row may exist. In that way you filter out all but the latest row per client.

    Here you go. Might be faster… not sure.
    Also, this assumes that ClientName+CreatedOn is unique.

    ;WITH MostRecent AS
       SELECT ClientName, Max(CreatedOn) AS CreatedOn
       FROM LogEntries
       WHERE EventTypeID IN (2,4)
       GROUP BY ClientName
    SELECT LogEntryId, FileId, CreatedOn, EventTypeId
    FROM LogEntries L
    INNER JOIN MostRecent R ON L.ClientName = R.ClientName AND L.CreatedOn = R.CreatedON

    Note, I did not test might have typos.

    MS SQL Server is a Microsoft SQL Database product, include sql server standard, sql server management studio, sql server express and so on.