Tuesday, September 18, 2012

Entity Framework: Expressing the Missing LINQ

I have worked on quite a few projects that use the Entity Framework. It is a powerful ORM and does quite a lot out of the box. I've worked with code-first, database-first, and every other flavor in between. If I were to name the one reason I believe developers enjoy working with LINQ the most, it would be the support for LINQ via LINQ to Entities.

Download Source Code for this post

What's not to like? You can query data in a very easy, straightforward manner that is consistent. Unfortunately, the Entity Framework can wreak havoc on an otherwise stable web application if it is not handled with care. There are a number of "gotchas" you will run into (updateable materialized views, anyone?) ranging from improper use of the context used to access the database to "features" of LINQ that can become defects in production. In this post I will focus on two very subtle LINQ problems I see people run into quite often.

It Will Be Deferred

The first is probably the easiest to understand and the fastest to catch in testing. Consider a very contrived model called Thing:

public class Thing
{
    public int Id { get; set; }
    public string Value { get; set; }
}

The code-first definition for a database of things is simple enough:

public class SampleContext : DbContext 
{
    public DbSet<Thing> Things { get; set; }

    protected override void OnModelCreating(DbModelBuilder modelBuilder)
    {
        modelBuilder.Conventions.Remove<PluralizingTableNameConvention>();
    }
}

I could use an initializer to then seed the data but I got really lazy and created a controller instead that does this (not recommended for production code but will work perfectly well to illustrate the points in this blog post):

public class ThingsController : ApiController
{
    public ThingsController()
    {
        using (var context = new SampleContext())
        {
            if (context.Things.Any()) return;

            for (var x = 0; x < 1000; x++)
            {
                var thing = new Thing
                                {
                                    Value = Guid.NewGuid().ToString()
                                };
                context.Things.Add(thing);
            }
            context.SaveChanges();
        }

    }
}

As you can see, I'm ensuring we have at least 1000 things to deal with. You'll notice this is an ApiController that makes it very easy to expose a REST service on top of the collection. In fact, I'm going to do just that - first with a method that returns the full list:

public IEnumerable<Thing> GetThings()
{
    using (var context = new SampleContext())
    {
        return (from thing in context.Things select thing);
    }
}

The Get convention will automatically map this to /api/Things and then I can get a nice list of them by navigating to the local URL, correct? Not quite. This is the first and most common mistake made: forgetting about deferred execution. The query is passed back to the controller, which faithfully tries to serialize it by enumerating the list ... only by that time, you have left the using block for the context and therefore the connection is closed. This will fail every time until you find a way to force execution before disposing the context ... the easiest way is by converting it to a list like this:

public IEnumerable<Thing> GetThings()
{
    using (var context = new SampleContext())
    {
        return (from thing in context.Things select thing).ToList();
    }
}

Casting to a list ensures it is enumerated (and thus executed) immediately to populate the list.

Expressive Functions

The second issue is far more subtle. It is very easy to get excited about using LINQ to access your queries, and it's not uncommon to have a strategy that involves passing in filters and order clauses to your repository. Consider a basic LINQ query that looks like this:

var query = from thing
                in context.Things
            orderby thing.Value                            
            select thing;

Now, what if you wanted to dynamically filter this based on various options? You might go down this path:

Func<Thing,bool> filter;

You can then assign the filter:

filter = thing => thing.Value.Contains("-e");

And attach it to the query, execute it, then return the result:

var modifiedQuery = query.Where(filter);
return modifiedQuery.ToList();

In fact, if you pull down the sample code from this link and use the following URL: http://localhost:XXXX/api/Things/?useFunction=true you will see the result (replace XXXX with your port number), something like this:

[{"Id":229,"Value":"031662bd-14be-4562-9b34-e13ab193b112"},{"Id":330,"Value":"04a35727-9b64-4b5d-99fd-e421fe7340d7"}...]

If you compare this to the full result set, you'll see the filter worked fine. Many developers will be satisfied at this point and move onto other things — even integration tests for the filter will likely pass. But is this doing what you want? If you are like me, you never trust the ORM. You always want to know how it is interpreting what you send so you will profile and trace and verify results. If you run a trace on the above code (in the sample project, it will write to the console in case you don't have a profiler handy) you'll find the query that is passed to SQL looks like this:

SELECT 
[Extent1].[Id] AS [Id], 
[Extent1].[Value] AS [Value]
FROM [dbo].[Thing] AS [Extent1]
ORDER BY [Extent1].[Value] ASC

At this point you probably see the problem. We're dealing with 1000 records in this example, but what would happen if we had 1000000? The problem is the query is loading all records from the database, then filtering the resulting list. No matter how clever your filter is, you are always pulling the entire table and then using LINQ to Objects to filter it in memory. Probably won't scale, you think?

The solution is very simple. What you created above was a function that is passed to the query. The LINQ to Entities provider doesn't know how to map a function to the database, so it handles the part it understands and then applies the function. (As a reader was kind enough to comment below, technically the filter never gets passed to the Entity Framework ... there is an extension method for expressions, but not functions, so the query is cast to an enumerable after the call to EF and then the filter is applied.) There is only one change you need to make for this to work:

Expression<Func<Thing,bool>> filter;
That's it! Change the definition. You can assign and execute it exactly the same way:
filter = thing => thing.Value.Contains("-e");
var modifiedQuery = query.Where(filter);
return modifiedQuery.ToList();

While the returned result is the same, the SQL is slightly different. Hit the URL http://localhost:XXXX/api/Things?useFunction=false instead, and what you'll find is:

SELECT 
[Extent1].[Id] AS [Id], 
[Extent1].[Value] AS [Value]
FROM [dbo].[Thing] AS [Extent1]
WHERE [Extent1].[Value] LIKE N'%-e%'
ORDER BY [Extent1].[Value] ASC

Now the filter is being passed to SQL, so it can do what SQL is good at: filtering and ordering records. It's a subtle but important difference. In both cases, you designated the filter using a lambda expression, but the first forced it into a function whereas the second method loaded it into an expression tree. The LINQ to SQL provider does know how to parse an expression tree and map it to SQL, so you get the desired result by using LINQ to Entities and SQL rather than LINQ to Objects as a fallback.

Summary

The bottom line is that ORM tools make it very easy to interact with data, and even easier to have side effects that can lead to performance issues and production defects. It is very important to understand the technology you are working with and dig into the nuances of how LINQ works under the covers and the relationship between delegates, lambda expressions, and expression trees. I've heard it come up in interviews that I sometimes ask "academic" or "textbook" questions, but more often than not those questions are related to real world scenarios and knowing the answer is the difference between 20 and 1000000 records in a single database call.

Download Source Code for this post

Jeremy Likness