Thursday, March 28, 2013

SSIS - Using Caching in Loops

Earlier this week I had an SSIS 2008R2 package that I wanted to try and cut down the runtime from over 5 total hours. This package does a lot of work in many separate Data Flows, but a bulk of the time in this package is spent in a For Each loop that iterates through "snapshots" stored a database. These "snapshots" are not actual database snapshots, just a day, month or quarter worth of data that get loaded into a database and uses a unique ID (date/sequence number in this case) on every single row. Normal processing of this database is to just include a WHERE clause in your SQL statements to filter by the appropriate "snapshot". The For Each loop that I had setup needs to do a bunch of logic for different tables within this database, but it needs to do this filter by "snapshot" to cut down on the number of rows being processed.

For regular OLE DB source queries you can specify these kind of parameters very easily in SSIS using a ? in your query and then telling SSIS what variable to plug into that ? at runtime. That is perfect for that use and I used that as much as possible in this setup, but my other issue is that it has many Lookups that need to be done within the For Each loop as well and because of the way the Lookup queries are setup you cannot specify the same kind of parameter/variable based queries. I know that there are some ways to "fake" this out by changing the caching mode from Full Cache to Partial Cache, but that solution requires a lot of crazy changes to make it work and at that point the package becomes very difficult to support.

After doing some more digging around I found out about using Cache Connection Managers and the Cache Transform to load this Cache Connection Manager (I had not previously used either of these components). Initially I thought that this could solve my issue, so I setup a new Data Flow as the first step inside the For Each loop and load up a few of these Cache Connection Managers using the same parameter based queries and the variable from my For Each loop to create the appropriate filtered datasets for use in my Lookups. Then the Lookups can be modified very easily to use these Cache Connection Managers (you do have to setup Index columns in the Cache Connection Manager for every column you are going to use for mapping in the Lookup).

All of this was very quick and easy to setup and with that I kicked off a test and it all worked perfectly for the first time through the For Each loop, then the package ended with the following error message:

[Cache_xxxxx] Error: The component "Cache_xxxxx" (22) cannot write to the cache because component "Cache_xxxxx" (22) has already written to it.

At this point I knew that I was not going to be able to use the Cache Connection Managers within a For Each loop and get unique result sets for each time through the loop, since Microsoft has setup these Cache Connection Managers a write once object. Very disappointing that Microsoft did not find a way to make it so that these Cache Connection Managers can be purged and reloaded. I have heard of some people doing something like this by writing the cache data out to files instead of keeping it in memory, and that solution sounds good, but I didn't want to go do that route since I don't have any access to the servers themselves to make sure it works and I was on a very strict deadline to get a solution working.

Unfortunately until Microsoft makes some updates to how these Cache Connection Managers work, I don't really see an easy way to use them in this context, which is too bad, because that first run through the loop was lightning fast. And yes, I did try to move the Data Flow that fills the Cache Connection Managers outside of the For Each loop and just have it load everything. The solution did not work either in my case because I needed to load over 24 million rows of a very wide table when it is the entire dataset, instead of the "snapshot" filtered version I wanted to do.

Now that I have some experience using the Cache Connection Managers and Cache Transform, I'm going to keep my eyes on updates that are made to both of these components to see if they are more useful in the future in loops.

Stay tuned for a future blog post where I will show all of the setup steps for using Cache Connection Managers and Cache Transform along with how it works with the Lookup transform in detail (with plenty of screenshots).