Friday, May 11, 2012

One Way To Insert Many Rows Very Fast From Your .NET Application

This is my attempt to explain something that I have wanted to figure out for a long time.  I have read many articles on this very subject but they always seem to fall short for various reasons - generally by including way more detail or complexity than is necessary to answer the question or get someone going.  Due to the work of others who have written similar articles on this subject in the past (particularly MVP Michael Coles), I have eventually figured this out.  This is my attempt to present this material in an end-to-end yet bare-bones and hopefully easily-understandable format for any .NET or SQL developer who needs to do this.

Note: I've written a follow-up to this article based on the excellent feedback: Another Way To Insert Many Rows Very Fast From Your .NET Application

The issue:

Recently, I was asked to troubleshoot the startup performance of a VB.NET application with a SQL 2008 back-end.  The startup routine was taking up to a minute or so to complete in locations with a fast DB connection, and up to 15 minutes in locations with a slow DB connection.

By stepping through the application and inserting Debug.Print() statements liberally, I narrowed-down the "slow" part of the routine to a loop that was inserting product IDs into a table.  Basically, the way the app worked was that when it was launched, it would figure out which products were relevant to the active user and then it would insert those into a table that was used with INNER JOINs in all the rest of the screens and reports.

There was a table in the SQL database that looked a bit like this (pretend there is only ever one user for this example):

CREATE TABLE ProductsAccess(

ProductID INT NOT NULL

PRIMARY KEY (ProductID));

 

The VB.NET code that initialized the product list for the user in the DB was something like this (using typical ORM stuff):

 

'Delete all records in the table

ProductsAccess.DeleteAll() 

'now populate all products that the user can choose in a loop

For i = 0 To objProductList.GetUpperBound(0)

       ProductsAccess.Insert(objProductList.Item(i).ProductID)

Next

 

Behind the scenes, this meant at least 1 call to the database per row to be inserted, plus the first call to clear out any existing rows.  Since it was typical for our users to have several thousand products available to them, this could take a very long time, particularly with remote users on high-latency connections.

 

It wasn't the amount of data that was the problem - it was the count of the calls that was the issue.  Instead of saying "Insert product 1", "Insert Product 2", "Insert Product 3", etc., I needed to say "Insert Products: 1, 2, 3," etc.

 

What I did:

Here is the simple version of how I changed the insert routine to cut the number of database calls to insert "n" rows from "n+1" to "always 2":

 

I created this stored procedure:

CREATE PROC ProductsAccessXMLInsert(@ValidXMLInput XML)

AS BEGIN

 

       INSERT INTO ProductsAccess (ProductID)

       SELECT Col.value('@ID','INT')

       FROM @ValidXMLInput.nodes('//Products/Product') Tab(Col)

 

END

 

Then I modified my .NET code to look something like this (note that I left the DeleteAll method untouched):

 

'Delete all records in the table

ProductsAccess.DeleteAll()

 

'Build a valid XML string (you should use System.Xml in real code - this is just a demo)

Dim strBuffer As New Text.StringBuilder

strBuffer.Append("<Products>")

For i = 0 To objProductList.GetUpperBound(0)

    strBuffer.Append(" <Product ID=""" & objProductList.Item(i).ProductID.ToString & """ />")

Next

strBuffer.Append("</Products>")

 

'now populate all products that the user can choose by passing a single XML document

ProductsAccess.InsertXML(strBuffer.ToString)

 

How it works:

The XML generated in the loop would look something like this:

<Products>

<Product ID="1" />

<Product ID="2" />

<Product ID="3" />

</Products>

 

The new InsertXML() method called the new ProductsAccessXMLInsert stored procedure, passing all of the generated XML in only a single call to the DB.

 

Public Sub InsertXML(TheXML As String)

    Using objConn As New SqlConnection(DBConnectionString())

        objConn.Open()

        Using objCmd As New SqlCommand("dbo.ProductsAccessXMLInsert")

            objCmd.CommandType = CommandType.StoredProcedure

            objCmd.Connection = objConn

            objCmd.Parameters.Add("ValidXMLInput", SqlDbType.Xml)

            objCmd.Parameters(0).Value = TheXML

 

            objCmd.ExecuteNonQuery()

        End Using

        objConn.Close()

    End Using

End Sub

 

Inside the ProductsAccessXMLInsert stored procedure, SQL Server does a SELECT from the input XML parameter's built-in "nodes" table-valued function using the passed XPath query ("//Products/Product").  The XML "columns" are then accessible using the Col alias (this comes from the Tab(Col) tag after the XPath query).  In the SELECT clause, I referenced the ID attribute of the XML nodes, and instructed SQL Server to cast it as an Integer (these are the two parameters passed to the Col.value(,) function in the SELECT clause).  XML attribute names must be prefixed with an @ symbol.  (For node names you can leave the @ off.)  Then I'm simply doing an insert to the ProductsAccess table from the results of my SELECT.

 

Performance Difference:

I did four sets of tests using insert sets of 10 rows, 50 rows, 100 rows, and 504 rows (the number of products in the AdventureWorks Production.Product table which was the test data source).  I did three sets of tests for each count of rows with doing the one-by-one inserts first, and then three series of tests doing the XML inserts first.  All runs were measured using the .NET Stopwatch class.

 

So how much better was this single call using XML versus the using the standard row-by-row INSERT call?  Here are the results:

 

(lower milliseconds to complete is better)

 

image

 

Conclusion:

The XML insert method for our scenario of needing to insert many rows into our SQL database in one shot averaged more than 97% faster than looped inserts when processing 504 rows of data, and was still more than 81% faster with a mere 10 rows of data to insert.  This was averaged over 6 runs of 50 tries of each method and by my eye, the results were fairly consistent.

 

I hope you found this a useful introduction on one way to efficiently handle multi-insert scenarios with SQL Server using an XML parameter.  I would love to hear any comments or questions you might have.

 

Notes and Disclaimers:

  1. This is merely one way to insert many rows very fast from your .NET application.
  2. The term "many" may mean different things for you or your application.
  3. To keep the performance testing fair, I first refactored the existing insert method to use a stored procedure that took the product ID as a parameter so it wasn't using wonky ORM-generated code.:
    CREATE PROC DoSingleInsert(@ProductID INT)

    AS BEGIN

           INSERT INTO ProductsAccess (ProductID) VALUES (@ProductID);

    END
    Also the inserts were processed in a tight loop using a single SqlCommand object with a strongly-typed parameter and just updating the parameter value and calling ExecuteNonQuery().  This is about as efficient as ADO.NET typically gets, though I would love to hear if there is a way to drive a SqlCommand object harder.

  4. Testing was done with the application and SQL Server running off two ports of the same gigabit switch to minimize the factor of network latency.

  5. To keep this post as simple as possible, I am pretending that there is only ever one user for this application - my real table also has a UserID field, among others.  Additionally, in my real solution I do a MERGE against my XML parameter which allowed me to also eliminate the DELETE call - perhaps this can be a future blog post.

  6. Dear fellow developer: please don't do what I did above and use a StringBuilder to generate XML.  The moment you have a funny character such as <, >, or &, you will be in a world of pain.  You should use the features of the System.XML namespace to generate XML, however that was more complicated than I wanted to get into for this blog post.

  7. The slow part of the original code in this application was basically the last four lines of code in a massive subroutine.  I had seen dozens of things earlier that I thought might be the culprit, but actually were OK.  In doing this work, I was reminded of the lesson to not do premature optimization of code; I could have spent an hour or two making all the "kind-of fast" code "super fast" and the end result would have been completely unnoticeable due to the four lines of boat-anchor DB access code which were the main problem.

See Also:

Creating a deep hierarchy with FOR XML

Using the MERGE Statement

17 comments:

  1. Nice article!

    I'd written an article on the very same subject few day back that you might want to check below:

    http://nilthakkar.blogspot.in/2012/05/how-to-pass-multiple-records-to-stored.html

    Keep Writing!

    ReplyDelete
  2. SQL Server 2008 introduced Table Valued Parameters - this would be the recommended way to insert multiple records into SQL Server 2008 and above.

    I would choose this over XML to avoid the overhead of XML parsing in the server.

    See here:

    http://msdn.microsoft.com/en-us/library/bb675163.aspx

    ReplyDelete
    Replies
    1. Thanks for the comment Oded. I'm a huge fan of Table Valued Parameters, particularly for scenarios just like this where they represent generic lists of primitive IDs, but in my experience they have two practical drawbacks to use with external calls. The first is that using a type mandates creating the type in the database ahead of time, whereas the XML method is always available to a developer even without using a stored procedure or function. The second is that Table Valued Parameters don't enjoy first-class support in some popular ADO.NET wrappers such as LINQ to SQL, whereas parameterized stored procedures calls generally do.

      You bring up a good point, though. I'd like to see what the performance difference between a this XML-based implementation is versus the TVP method.

      Delete
  3. Hi

    Thank you for the post. This would have been very handy last week, when I had to figure this out.

    For those that would want to use DataTable.writeXML or DataSet.writeXML, keep in mind that you can specify the column mapping of the datacolumns with DataColumn.ColumnMapping = MappingType.Attribute to get XML that will work with the above SP.

    ReplyDelete
  4. Hi

    Thank you for the post.

    This was the only possibility beside SQLBulkCopy until SQL Server 2008.

    User defined table types came with release of sql server 2008 and are the obvious choice for today use.

    ReplyDelete
  5. I have written a follow-up that compares the XML parameter method to a Table-Valued Parameter (TVP) method.

    http://nycdotnet.blogspot.com/2012/05/another-way-to-insert-many-rows-very.html

    ReplyDelete
  6. This is a wonderful technique for improving performance in SQL Server 2005, but ONLY in SQL Server 2005. I am really being more informative than negative when I say this, but PLEASE DO NOT USE THIS TECHNIQUE in SQL Server 2008 or newer. The new Table-Valued Parameter functionality (TVP) is clearly the way to go for several reasons:

    1) no wasted time creating the XML in the app layer
    2) no wasted time or CPU in the DB shredding the XML
    3) keep strong datatypes in the transport
    4) ability to stream!!!

    I wrote an article a few years ago on SQL Server Central detailing the new TVP techniques (standard and streaming) which can be found here:

    Streaming Data Into SQL Server 2008 From an Application

    Again, I applaud the author for promoting set-based operations which are clearly faster than row-by-row, but this article / post really needs to start with a caution to readers that it ONLY applies to SQL Server 2005. I can see no real benefit to using XML over TVPs in SQL Server 2008 or newer.

    Take care,
    Solomon...

    ReplyDelete
    Replies
    1. Hi Solomon,

      Thanks for taking the time to read and comment on this blog post. I greatly enjoyed reading the article that you linked and I expect to take advantage of the streaming technique you have demonstrated in the future. I have written a follow-up to this post that shares an implementation of an insert stored procedure using a TVP (non-streaming), and provides a performance comparison between that and the XML parameter solution proposed above.

      I would speculate that for the count of rows you were handling in your article (totaling 133 MB of data), the TVP-based solution would offer significantly better performance versus an XML parameter based on the results of my testing. 133 MB is a fairly large amount of data, and closer to a bulk-insert scenario than a typical application “save” scenario which was the subject of my post. For the count of rows that I was describing in my article (up to 500), the performance difference appears to be a toss-up for scenarios under 100 rows between XML and TVP, and TVP only gets statistically faster beyond 100 rows (even at 500 rows the difference would not be noticeable to a user unless your server was near capacity).

      The results of my comparison of TVP and XML performance, along with the other “meta” advantages for XML that I describe in the follow-up blog post, leads me to disagree that XML-based solutions should only be used in SQL 2005 (and not with SQL 2008) for inserting many rows. I think it really depends and those who want to choose the best option will need to consider which works better in their environment and which fits better with their application – so I’m basically agreeing with you that TVPs are good but I disagree that they’re always better when considering all the angles.

      Thanks again for your very informative article and your comments!

      Delete
  7. This functionality also works well with other languages such as with VB6. I'm stuck with it for the forseeable future.

    ReplyDelete
  8. Did you consider the ADO.Net DataAdapter to do a batch update?

    ReplyDelete
  9. I use SqlDataAdapter to update tables. Googling for ways to make it more efficient, I see that you can set a batch size so - as in your methods - it only needs 1 round trip to the server for many records.

    E.g.
    Dim Adapter As New SqlDataAdapter()
    ...
    Adapter.UpdateBatchSize = 20000

    I hoped this would get the sort of performance you're getting without re-writing lots of my code, (I load a LOT of files this way!). However when I run this, it doesn't seem to save any time; updating 16k records takes about 19 seconds for me, regardless of batch size.

    Has anyone else used a data adaptor method with UpdateBatchSize?

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete
  11. Odd way to show your improvements. Your results are much better than you have stated. XML is anywhere from 5 to 34 times faster according to your timed results or 500% to 3400% improvement not 76% to 97%. Or you could say XML took only 3% to 24% as long.

    ReplyDelete
  12. What appears to be absent is not fully addressing the overall architecture shortfall, of which shouldn't require that much more code and, in fact, may even require less code, than the revision provided.

    As I read the orig issue, the very first thing that I identify is: Where do the Product ID's come from in the first place?
    It would seem they come from a table within same database that they're being inserted into another table.

    If this is the case, and I conclude it most likely seems such, then:
    A) only a single SQL call from .NET is needed, not two;
    B) that .NET doesn't need to first call SQL to get the IDs to then
    C) "compile" the list to then send back to SQL so that the IDs are then
    D) once again .NET calls SQL to get the product ID's inserted

    A solution I profer is:

    1) Create a SQL StoredProc that
    A) accepts the User's ID [and any other params for additional search criteria] to get the resulting/desired set of Product ID's
    B) Deletes the current produt ID's (based on UserID) from the"ProductsAccess" table
    C) Uses INSERT/SELECT to get & insert the current desired product IDs (use other criteria as desired) from the "Products" table to the "ProductsAccess" table
    D) Perform a final SELECT with INNER JOIN on the "Products" table <== 1_N ==> "ProductsAccess" to return a paged (e.g. small) set of initial products for display.

    Something like:
    prcProductAccessInitByUserCriteria(userID int, param1 int, param2 nvarchar(2))

    Consider having a separate proc
    prcProductAccessGetByUserCriteria(userID int, param1 int, param2 nvarchar(2)))

    that does the paged select and can be called initial by the above "InitByUserCriteria" and then subsequently from .NET's paged datagridview.

    ReplyDelete
    Replies
    1. Thank you for the comment Keith. I understand where you're coming from. However, you must be willing to suspend disbelief and accept somewhat contrived scenarios sometimes for blog posts. I have attempted to walk the fine line of providing just enough example code and context so that my example is useful for learning and easily reused in someone else's program, without being overcomplicated, inaccessible, or too specific.

      Delete
  13. One thing that is conspicuously missing is the wrapping of the individual multiple inserts with a transaction. My understanding is that each of those individual INSERT statements that are executed within your loop execute as a transaction. This means that for each insert, there is a transaction log entry that is committed to disk. While I don't know if it will perform better than your XML solution as I haven't tested it myself, I would bet that it performs significantly better than the loop that doesn't execute within a transaction.

    If you were to wrap your statements in a transaction like this (excuse my VB.Net, I'm rusty), you would end up with significantly less disk IO due to only a single transaction commit being executed for the entire batch. Also, just as a side note, when using a "Using" block (if it's the same as the C# using statement), you don't need to explicitly close your connection:

    Using objConn = New SqlConnection()
    Dim tn As SqlTransaction = objConn.BeginTransaction()
    Dim objCommand As New SqlCommand(..., objConn, tn)
    .... put your loop here
    tn.Commit()
    End Using

    If you're going to go through the trouble of creating an XML document to do inserts and have access to the file system, a BULK INSERT statement is the absolute fastest way to get data into SQL Server tables.

    ReplyDelete