Why Use LinQ To SQL (Part 1 - Performance Considerations)?
[This is part 1 of an ongoing series. View Part 2 - Security]
WARNING & DISCLAIMER: This is a long post, split over a series, as it discusses some old, well-entrenched beliefs, practices, notions and 'best practices' that we as an industry have dragged along with us over the years. Some of these were always valid, and remain so, others were the result of particular tools or platforms or even just particular versions of these. In addition, this post assumes some basic familiarity with Linq To SQL in terms of what it aims to do and some of how it works. If you are not too familiar with the topic, or even if you know some parts but want a full overview, have a look at this MSDN overview of Linq To SQL (it's a pretty in-depth overview, but there do appear to be one or two out of date elements as it was written in the beta timeframe).
I have recently begun working on a new enterprise ASP.Net project and, as with most such projects, we have been faced with some essential decisions to make. Fortunately, we had some leeway with tools and so it was decided that the project would be done in Visual Studio 2008 and, where possible, would use .net 3.5, but, where not possible, to at least leverage multi-targeting in the new Visual Studio 2008 IDE. One of the decision we were faced with, though, was the choice of a data access framework or approach. The client wanted to standardize on Enterprise Library, but did not in mind originally intend on making use of any specific object persistence or OR/M frameworks and initial work had begun on rolling our own data access code (Sprocs, data classes in repositories to wrap these sproc methods, business classes, again with corresponding code for each sproc ("method"), etc. etc.). Early into this work an effort was made by a few of us to evaluate this approach and comment on it's particular implementation.
However, this lead to a questioning of the approach itself for a variety of reasons. I had strong opinions on this, having worked the spectrum from hand rolling to code generation to OR/Ms to typed datasets over the years and seeing the advantages and disadvantages of each approach. At worst, I was strongly in favour of making use of code generation in the form of at least the powerful free tools like MyGeneration or CodeSmith, but this lead to a bit of thought and evaluation of some of the frameworks available, from simple object frameworks like dOOdads and EasyObjects.net to more advanced OR/Ms like nHibernate and LLBLGen. In the end, we have settled on Linq to SQL and I wanted to put down some of the many hours of discussions, ideas, concerns and thoughts that came up during our process in the hope that it will be of assistance to those facing similar evaluations.
WARNING 2: Now, the topics contained in this and the subsequent posts are going to touch very closely to many people's hearts, so EXPECT TO DISAGREE WITH ME on some of these points. I do not claim to be an outright master of all things data access related, but as I mentioned I've played a pretty broad spectrum over the years with these tools and I drew out what I thought was important both at a general level but also specifically based on the client and project needs. Your situation may differ. It may even differ quite considerably. However, I think a lot of these points are shared, so hopefully we'll see some common ground. This post is not aiming to be a comprehensive evaluation of every framework, product and approach available. It is just the choices we made in a finite time with a fixed deadline and deliverable to achieve, but one that we thought would hopefully stand for a reasonable amount of time. Basically, a normal project with real constraints <g>.
Evaluation Criteria
We discussed a number of key factors, but early on we decided to prioritise these so that we could focus our efforts and assist us in evaluating an approach (thanks Matt for keeping us on track with this!). Our top priorities (there were others lower down), in order, looked something like this:
- Performance
- Security
- Development Time and Effort
- Maintenance Time and Effort
- Stability
Under each of these priorities, a number of factors became apparent and important, and I'll elaborate on these below.
1) Performance
The project is a reasonably high volume web site. In addition, the choices we were making now and the approaches we were following would be adopted by other teams in the organisation (which is an important factor under some of the other topics), some of whom have even more stringent data performance requirements than we do. As a result, whatever approach we used would need to be pretty performant (yes, this can be a real world, depending on who ... you ... ask - I particularly like the last one!).
I did some of my own analysis, but nothing quite as stringent as I suspected Microsoft would be doing themselves. Thankfully, I was able in the end to turn to an interesting post series by Rico Mariani, who has been focusing specifically on the performance on Linq to SQL. His full article series can be found here but it provided a detailed overview and some concrete statistics that we could make use of and is definitely a good read. In summary, he found the following: Linq to SQL, on raw reads in what he feels is a worst case scenario for Linq, results in a 7% performance decrease over raw, right-at-the-mettle SQLDataReader. I'm not sure I agree completely with why he thinks this 7% is higher than it would be, but still we're talking about single-digit drops in performance, so nothing drastic. Not great news though. On the other hand, this is against raw SQL Data Reader, even using ordinals over column names. We were not alone in considering Enterprise Library and despite its advantages, its a pretty heavy framework, so we were definitely going to close the gap a bit there. In addition, a large proportion of our data work is going to be cached, possibly even at multiple points, so the read performance issue becomes even less of a concern.
Incidentally, Linq To SQL queries can actually be compiled as well, but there is apparently only a gain relative to the first time the non-compiled query is executed, during which time it gets compiled anyway.
By default, Linq to SQL will generate dynamic SQL. What makes it an OR/M (of sorts) is in part this ability to take a defined model and automatically Map the Objects to Relational concepts for you, without you needing to write the code. For instance, the following Linq to SQL code:
using (NorthwindDataContext dc = new NorthwindDataContext())
{
dc.Log = Console.Out;
var categories = from c in dc.Categories
where c.CategoryName.StartsWith("B")
select c;
foreach (var c in categories)
{
Console.WriteLine(c.CategoryName);
}
}
will generate the following SQL statement automatically
SELECT [t0].[CategoryID], [t0].[CategoryName], [t0].[Description], [t0].[Picture]
FROM [dbo].[Categories] AS [t0]
WHERE [t0].[CategoryName] LIKE @p0
Now, Linq To SQL can make use of stored procs. In fact, it is quite flexible with this. You can for instance specify just Selects, just some Updates, some combination of Deletes and Inserts, and so forth. In isolated cases this might be of much value. By and large though, I feel you would be losing a lot of value of Linq's dynamic queries if you did this across the board for all database access.
However, the client is very stored procedure-heavy, and so team members maintained the long-held sprocs performance improvement belief. Frans Bouma (the creator of LLBLGen) discusses this in a blog post from way back 2003, where he quotes the SQL Server 7 and 2000 Books Online:
SQL Server 2000 and SQL Server version 7.0 incorporate a number of changes to statement processing that extend many of the performance benefits of stored procedures to all SQL statements. SQL Server 2000 and SQL Server 7.0 do not save a partially compiled plan for stored procedures when they are created. A stored procedure is compiled at execution time, like any other Transact-SQL statement. SQL Server 2000 and SQL Server 7.0 retain execution plans for all SQL statements in the procedure cache, not just stored procedure execution plans.
There are some other possible issues relating to the use of stored procs with Linq to SQL, including security and business logic, but those will come up later, never fear ;->. One point that was raised on the use of sprocs is that you can be explicit about the use of NOLOCK on your queries, which you can't quite do in Linq to SQL. I'm not 100% sure on this one, but apparently you can achieve the same effect using the SetTransactionIsolation method on the SQL connection, which you can access via the Linq to SQL DataContext.
Another pro-sprocs performance argument was that sprocs can be optimised by the DBAs without needing interaction from the dev team in a pinch (like on late-night standby). This belief is akin to
"Your DBA team can work completely independently of your programming team and do pretty much whatever they want and your application doesn't care!" (from a blog post by Eric wise in 2006).
Jeremy Miller, a big agile guy and the creator of one of StructureMap, a popular DI framework, argues strongly against this quite correctly that this is an almost absurd statement. Your stored procs should almost never be changed without some sort of testing of the application code, which means you need developers involved anyway. In addition, just as you have DBAs on standby, you need devs on standby too for exactly these reasons.
Another argument against the consideration of better performance through stored procs is dependent on their usage. There are at least two ideas one can look at on this. Having truckloads of them around, the client also wanted to 'leverage and reuse their existing investment', but Dennis van der Stelt comments on the use of projections to get data back from a stored Linq to SQL, and shows how Linq To SQL can outperform a stored proc at times simply because it can explicitly always only bring back the columns needed (see point number 3 in his article). Here's another view on the matter though. One argument I heard from DBAs is that they can modify and tune sprocs when they see them during a QA cycle. However, consider the following SQL code that might appear in 2 separate sprocs:
INSERT INTO [Categories]
([CategoryName], [Description] ,[Picture])
VALUES
(@CategoryName, @Description, @Picture)
And the 2nd sproc:
INSERT INTO [Products]
([ProductName],[SupplierID],...,[Discontinued])
VALUES
(@ProductName,@SupplierID,...,@Discontinued,)
these are perfectly valid looking sprocs if this is all they contain. However, what if a new category is added at the same time that 15 products are added to it. This would result in 16 separate database calls being required, but is not necessarily something the DBAs would pick up just QA'ing the sproc code. In Linq to SQL, this would all be batched into 1 single database call by default. In fact, in his article series linked to above, Rico Mariani states that despite the possible 7% drop in read performance, on the remaining database operations, because of this batching, Linq To SQL can achieve up to 400% gains in performance because of this.
To be clear, there are ways around this using other means. With sprocs, it is possible to create complex sprocs that take in a whole bunch of parameters and update multiple tables. However, the gain in performance becomes a tradeoff in maintenance complexity down the line, something I am very strongly against. I have the same argument about using the XML Bulk Upload features in SQL Server, another possible approach here.
So, that knocked sprocs off. The next point we faced was looking at the concept of Lazy Loading. Some of the people in the project group had heard correctly that Linq did this magic stuff, whereby whatever properties on an object, either primitives like strings and ints or even child collections (i.e. linked tables) would be loaded on demand from the database when you use them. This makes coding easier, but at the expense of potentially serious performance implications. For instance, consider the following code:
using (NorthwindDataContext dc = new NorthwindDataContext())
{
// first sql call
var categories = from c in dc.Categories
select c;
decimal categoryTotal;
foreach (var category in categories)
{
categoryTotal = 0;
// another sql call per category(!)
foreach (var product in category.Products)
{
if (product.UnitPrice.HasValue)
categoryTotal += product.UnitPrice.Value;
}
Console.WriteLine(categoryTotal);
}
}
In this instance, even though we're explicitly requesting the Categories, when we call for the Products later, it will result in a call to the database for every single category. So, if you have 12 categories, it will result in 13 separate calls, one to retrieve the categories, then one again per category to retrieve its products. Now, as with the stored procedure discussion, what we are encountering here is not an issue with Linq to SQL specifically, but rather with the concept of lazy loading. Oren Eini ("Ayende Rahien"), the creator of nHibernate, has some comments on this, where he in turn comments on Frans Bouma's opinion. They argue the opposite case and their tools reflect this views. If you read Dinesh Kulkarni's (one of the key guys on Linq To SQL) opinion you'll see that he is in the same camp as Frans that Lazy Load is potentially more trouble than good, but Linq to SQL is just one framework that offers this. As Oren says "It is something that can cause some really bad performance if you are misusing it, but so can any tool in the tool box. (You had better believe that a hammer can do some serious damage)." but he continues that at least "You certainly want to be aware of what is going". In Linq To SQL you can turn this Lazy Loading off in one of two ways. First of all, to just simply turn it off, you can set
dc.DeferredLoadingEnabled = false;
However, this will simply disable linked object loading. If instead you want to explicitly return child objects, you can do this using the DataLoadOptions, as follows:
using (NorthwindDataContext dc = new NorthwindDataContext())
{
System.Data.Linq.DataLoadOptions dl = new System.Data.Linq.DataLoadOptions();
dl.LoadWith<Category>(c => c.Products);
dc.LoadOptions = dl;
var categories = ...;
}
What this translates to is: For any queries I do against this DataContext, where these queries return a Category, also return any child Products. This can be multi-level so that Customers can be set to load all Orders, and Orders can in turn load all linked OrderDetails (DataLoadOptions has some other possible uses, but I'll hopefully be blogging about those soon). To sum up, as with any framework that supports Lazy Loading, be aware of what it will do by default, and be aware of how to turn it on and off. One of the features of Linq To SQL, in my opinion, is that it lets you control where and when you want this option, compared to some frameworks which are all-on, or all-off for the entire model, all the time. Incidentally, if you want re-use of the DataLoadOptions settings for your DataContext, consider putting the code into a public method on a partial class of the DataContext.
[Update: For a more detailed discussion on DataLoadOptions, check out my post on LoadWith and this one on AssociateWith]
Well, it seems this post is getting a little long, so I'm going to wrap up here and continue with the remaining topics soon. Stay tuned and please give feedback,
- H