Introduction

Table-Valued functions have been around since SQL Server version 2005. Basically a Table-Valued Function is a function that returns a table thus can be used as a table in a query. First sounds like nothing new since a view has been an available mechanism for much longer time. That’s true, partly. Function has the same benefits as a view when it comes to data protection enforced via privileges or simplifying a query. However Table-Valued Function has at least few advantages:

  • Parameterization, the function can receive parameters so the logic inside the function can be adjusted better than using traditional predicate pushing.
  • Programmability, a view can have certain amount of logic (calculations, case-structures etc.) but it’s still quite column bound so more complex logic is hard or impossible to create.

Creating a simple Table-Valued Function with (some kind of) logic

First lets’ create a small table to store some data:

CREATE TABLE TrackingItem (
   Id       int  NOT NULL IDENTITY(1,1),
   Issued   date NOT NULL,
   Category int  NOT NULL
);
CREATE INDEX X_TrackingItem_Issued ON TrackingItem (Issued);

And then add few rows for test data:

INSERT INTO TrackingItem (Issued, Category) VALUES ( DATEADD( day, 0, GETDATE()), 1);
INSERT INTO TrackingItem (Issued, Category) VALUES ( DATEADD( day, 1, GETDATE()), 2);
INSERT INTO TrackingItem (Issued, Category) VALUES ( DATEADD( day, 4, GETDATE()), 1);
INSERT INTO TrackingItem (Issued, Category) VALUES ( DATEADD( day, 4, GETDATE()), 2);

Now if we would need a result set which would:

  • Include all the columns from TrackingTable
  • Include an extra Modified (date) column
  • Not have even numbers in Category
  • If Category is changed, Modified column indicates the modification date
  • Return only TrackingItems having the Id greater than or equal to the parameter passed.

The Table-Valued Function could look like this:

CREATE FUNCTION TrackingItemsModified(@minId int)
RETURNS @trackingItems TABLE (
   Id       int      NOT NULL,
   Issued   date     NOT NULL,
   Category int      NOT NULL,
   Modified datetime NULL
) 
AS
BEGIN
   INSERT INTO @trackingItems (Id, Issued, Category)
   SELECT ti.Id, ti.Issued, ti.Category 
   FROM   TrackingItem ti
   WHERE  ti.Id >= @minId; 
   
   UPDATE @trackingItems
   SET Category = Category + 1,
       Modified = GETDATE()
   WHERE Category%2 = 0;
  
   RETURN;
END;

The function defines a new table called @trackingItems. The contents of this table will be return value for the function. First the function inserts all the desired rows from TrackingTable to the temporary table. After that the contents of the temporary table are modified based on the specifications and then returned.

Using the function

The next step is to use the function. If we want to select all the rows having Id 2 or more the query would look like:

SELECT * FROM TrackingItemsModified(2);

And the results:

Id  Issued      Category  Modified
--  ----------  --------  -----------------------
2   2011-03-11  3         2011-03-10 23:46:53.523
3   2011-03-14  1         NULL
4   2011-03-14  3         2011-03-10 23:46:53.523

As the result is a table it can be used like one. For example if we want to query all the original tracking items that don’t exist in this set, the query could be:

SELECT *
FROM  TrackingItem ti
WHERE ti.Id NOT IN (SELECT tim.Id
                    FROM   TrackingItemsModified(2) tim)

An the results would be:

Id  Issued      Category
--  ----------  --------
1   2011-03-10  1

Generating data

So table valued functions can be used to return modified data from one or more tables in the database. But since they are programmable functions they can also generate data. One quite common problem is to query all dates from a specified period based on the data in a table which doesn’t have entries for all dates. In our test data there is a row for today and tomorrow but the next few dates are missing. So if we want to get the amount of tracking items for each day for the next seven days, it wouldn’t be so simple. One typical solution is to create a table that contains all the necessary dates and use that in the query. Table-Valued Function can be used as an alternative. If we pass the date range to a function we can create the necessary data on-the-fly with a simple loop.

CREATE FUNCTION DatesBetween(@startDate date, @endDate date)
RETURNS @dates TABLE (
   DateValue date NOT NULL
) 
AS
BEGIN
   WHILE (@startDate <= @endDate) BEGIN
      INSERT INTO @dates VALUES (@startDate);
      SET @startDate = DATEADD(day, 1, @startDate);
   END;
   
   RETURN;
END;

And the query for the tracking item amounts would be:

SELECT d.DateValue,
       (SELECT COUNT(*)
        FROM   TrackingItem ti
        WHERE  d.DateValue = ti.Issued) AS Items
FROM DatesBetween(DATEADD(day, 1, GETDATE()), DATEADD(day, 7, GETDATE())) d
ORDER BY d.DateValue;

So the results would be something like:

DateValue   Items
----------  ------
2011-03-12  1
2011-03-13  0
2011-03-14  0
2011-03-15  2
2011-03-16  0
2011-03-17  0
2011-03-18  0

What about performance

Since this is a procedural approach the performance won’t be as good as it would be using good set-based approach. However since functions provide more flexibility from programming point of view, let’s have a look at larger amounts of data.
First we’ll ensure that statistics are fine and then see what happens if we take the same query for a period of 50 years:

UPDATE STATISTICS TrackingItem WITH FULLSCAN;
 
SELECT d.DateValue,
       (SELECT COUNT(*)
        FROM   TrackingItem ti
        WHERE  d.DateValue = ti.Issued) AS Items
FROM DatesBetween(DATEADD(YEAR, -25, GETDATE()), DATEADD(YEAR, 25, GETDATE())) d
ORDER BY d.DateValue;

The query plan looks like:

Plan1.jpg

Not very good, we’re going to do scans to tracking items in a loop based on the returned days. Execution statistics verify this:

Table 'TrackingItem'. Scan count 18264, logical reads 36528, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.
Table '#24927208'. Scan count 1, logical reads 30, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.
 
SQL Server Execution Times:
CPU time = 4087 ms, elapsed time = 4232 ms.

Now if we add a primary key index to the table (which is a great opportunity in certain situations) we will first re-create the function:

DROP FUNCTION DatesBetween;
 
CREATE FUNCTION DatesBetween(@startDate date, @endDate date)
RETURNS @dates TABLE (
   DateValue date NOT NULL PRIMARY KEY CLUSTERED
) 
AS
BEGIN
   WHILE (@startDate <= @endDate) BEGIN
      INSERT INTO @dates VALUES (@startDate);
      SET @startDate = DATEADD(day, 1, @startDate);
   END;
 
   RETURN;
END;

So what happens with the same query:

Plan2.jpg

The plan has changed since the new index is taken into account but still the same approach. Statistics:

Table 'TrackingItem'. Scan count 18264, logical reads 36528, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.
Table '#33D4B598'. Scan count 1, logical reads 30, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.
 
SQL Server Execution Times:
CPU time = 4165 ms, elapsed time = 4304 ms.

Let’s separate the date generation and see how it goes:

SELECT * 
FROM DatesBetween(DATEADD(YEAR, -25, GETDATE()), DATEADD(YEAR, 25, GETDATE())) d

Plan:

Plan3.jpg


And the statistics:

Table '#33D4B598'. Scan count 1, logical reads 30, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.
 
SQL Server Execution Times:
CPU time = 4103 ms, elapsed time = 4268 ms.

So what’s happening? Few things:

  • The statistics we see do not include the whole operation just getting the results from the function (and in previous examples joining the data). There’s no way that a single scan reading 30 pages would take 4 seconds. It’s the data generation and inserts to the temp table.
  • A bad plan is chosen simply because the optimizer cannot know how many rows are going to be returned. There’s no statistical data to use since we’re not querying any table (actually this effect would happen in all functions that contain multiple statements). So the optimizer assumes that the result set is going to be small.

Having a look at the statistics for the Table-Valued Function access verifies this:

stat1.jpg

Unfortunately there’s no way to tell the optimizer (in the optimization phase) an estimate for the row count. This feature would a great thing to have in the future.

Let’s take a look other way round: Use small amounts of dates against a big table. First let’s add a few rows to the TrackingItem-table. 100’000 rows randomly for next 50 years:

SET NOCOUNT ON
DECLARE @counter int = 0;
BEGIN
   WHILE (@counter < 100000) BEGIN
      INSERT INTO TrackingItem (Issued, Category) 
      VALUES ( DATEADD( day, RAND() * 365 * 50, GETDATE()), 1);
   
      SET @counter = @counter + 1;
   END;
END;

And now let’s update the statistics again and query for few days in the far future:

UPDATE STATISTICS TrackingItem WITH FULLSCAN;
 
SELECT d.DateValue,
       (SELECT COUNT(*)
        FROM   TrackingItem ti
        WHERE  d.DateValue = ti.Issued) AS Items
FROM DatesBetween(DATEADD(MONTH, 101, GETDATE()), DATEADD(MONTH, 102, GETDATE())) d
ORDER BY d.DateValue;

Results (partial):

DateValue   Items
----------  -----
2019-08-11  7
2019-08-12  6
2019-08-13  2
2019-08-14  3
2019-08-15  5
2019-08-16  1
2019-08-17  5
2019-08-18  7
...

The plan:

Plan4.jpg


And the statistics:

Table 'TrackingItem'. Scan count 32, logical reads 97, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.
Table '#33D4B598'. Scan count 1, logical reads 2, 
      physical reads 0, read-ahead reads 0, lob logical reads 0, 
      lob physical reads 0, lob read-ahead reads 0.
 
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 104 ms.

So the plan was good when there are relatively small amounts of rows coming out of the function.

Conclusions

Table-Valued Function is a good alternative for a view or an extra table when parameterization is needed, complex logic is included and the amount of data returning from the function is quite small. For example if the date generator should be able to generate dates for different granularities (days, working days, weekends etc.) we would possibly need several tables or tagged rows if traditional approach is used.

History

  • March 11, 2011: Created
推荐.NET配套的通用数据层ORM框架:CYQ.Data 通用数据层框架