What's in Your Collection? Part 3 of 3: Custom Collections

This is the last installment in a three part series about using collections in C#.

The entire series can be accessed here:

We've now covered the interfaces and some concrete instances of collections provided by the .NET Framework. Now you are interested in moving things to the next level. What if the provided collections simply don't meet your business requirements? What are some ways you can use the collections concept to build your own classes to solve business problems?

Yield to Iterators

The first important thing to understand when you begin building your custom collection is the concept of iterators in .NET and the yield statement. I'm surprised that many people use the language without truly understanding this statement, what it exists and how it can be used.

You might have encountered yield in your journeys. If you've built custom AJAX client controls, you probably implemented IScriptControl. One method asks for IEnumerable<ScriptReference>. The implementation is usually presented as:

...
ScriptReference sr = new ScriptReference("~/MyUserControl.js");
yield return sr;
...

You could alternatively have created a List or any other collection of ScriptReference and returned that. What does yield really do for us?

To better understand, I've created a short little console application. You can create a new console project and simply paste this code to build and run.

using System;
using System.Collections;
using System.Collections.Generic;

namespace Yield
{
    internal class Program
    {
        private delegate bool DoSomething();

        private sealed class Doer
        {
            private readonly DoSomething _doSomething;
            private readonly string _msg;

            public Doer(DoSomething doSomething, string message)
            {
                _doSomething = doSomething;
                _msg = message;
                Console.WriteLine(string.Format("{0}: Ctor()", _msg));
            }

            public bool Do()
            {
                Console.WriteLine(string.Format("{0}: Do()", _msg));
                return _doSomething();
            }
        }

        private sealed class DoerCollection : IEnumerable<Doer>
        {
            public IEnumerator<Doer> GetEnumerator()
            {
                yield return new Doer(() => true, "1");
                yield return new Doer(() => false, "2");
                yield return new Doer(() => true, "3");
                yield break;
            }

            IEnumerator IEnumerable.GetEnumerator()
            {
                return GetEnumerator();
            }
        }

        private static void _DoIt(IEnumerable<Doer> doerCollection)
        {
            foreach (Doer doer in doerCollection)
            {
                if (!doer.Do())
                {
                    break;
                }
                Console.WriteLine(".");                
            }
            Console.WriteLine("..");                
        }

        private static void Main(string[] args)
        {
            _DoIt(new DoerCollection());
            _DoIt(new List<Doer>
                      {
                          new Doer(() => true, "4"),
                          new Doer(() => false, "5"),
                          new Doer(() => true, "6")
                      });

            Console.ReadLine();
        }
    }
}

So let's walk through the code.

First, I define a delegate called DoSomething that simply states "I want a method that takes no parameters and returns a boolean." This is a contrived example, of course, but in the "real world" you may have a pipeline or chain of responsibility that performs actions and then returns a status indicating that the process should continue or there is another node to consider, etc. I encapsulated the delegate in the class Doer. The constructor takes a "message" and an implementation of the delegate. The only reason I pass in the message is to track which object is doing what. What's important here is to see when the classes are created compared to when the main method is called, which simply invokes the delegate.

Next, I created my custom collection, DoerCollection. This is a collection of "activities" to perform. Obviously I am simply returning true or false in the example, but again, in a real-world scenario this could be a file system processor that iterates through a directory and returns files until no more can be found, or calls a web service and returns the status ... you get the idea. Notice that I simply yield return different instances of Doer that I pass the delegate implementation and a unique message identifier. If you recall from the first article in this series, this class is a collection because it implements IEnumerable.

The DoIt method takes any collection typed to the Doer class, and loops through the classes calling their "do" method until false is returned. It also emits some output just to demonstrate how it is looping, etc.

Finally, we get to implementation. The whole point of this example is to demonstrate how the yield command operates. We perform the exact same function on two very similar collections. The first pass uses an instance of my custom collection. The second pass creates a list and passes that into the method. What do you expect the output to look like? Compile the program and run it, and if you guessed correctly, you have a strong grasp of IEnumerator and yield.

Both collections were wired to contain three instances. Both had an instance return true, then false, then true, so the expected result would be to make it through two items and then break out of our loop. This is exactly what happens, but it's the output that is interesting. It turns out that using the List forced me to create everything up front, even if I wasn't going to use it (and who knows when that garbage collector will come by). The custom class using yield however only created two instances. The third class was never created!

Yield is nothing more than syntactic sugar for a state engine.

But wait, we are simply spinning through a collection. What do I mean by state engine??

If you recall in part 1, the key to collections is the Enumerator. An enumerator is a state engine. The "current state" represents either nothing (empty collection or already iterated through the entire collection) or an instance within the collection. The only transition this state engine can make is to move to the next item or end up in an uninitialized state.

This is what the program outputs to the console:

Yield

Now we'll pull out ildasm to peek beneath the hood. I've highlighted the DoerCollection class.

Yield IL

You'll notice that the GetEnumerator implementation actually creates a nested class behind the scenes. That class is our state engine. In red you can see the key pieces of that engine: a state, a current Doer instance, and the reference to the parent class. Highlighted is the key method called to transition state, MoveNext.

What is really interesting is pulling open the MoveNext method. I've used RedGate's free Reflector tool to reverse engineer the code. This will take the generated IL and provide a C# representation, so we can see what the actual underlying algorithm for the enumerator is.

private bool MoveNext()
{
    switch (this.<>1__state)
    {
        case 0:
            this.<>1__state = -1;
            if (Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatea == null)
            {
                Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatea = new Program.DoSomething(Program.DoerCollection.<GetEnumerator>b__7);
            }
            this.<>2__current = new Program.Doer(Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatea, "1");
            this.<>1__state = 1;
            return true;

        case 1:
            this.<>1__state = -1;
            if (Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegateb == null)
            {
                Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegateb = new Program.DoSomething(Program.DoerCollection.<GetEnumerator>b__8);
            }
            this.<>2__current = new Program.Doer(Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegateb, "2");
            this.<>1__state = 2;
            return true;

        case 2:
            this.<>1__state = -1;
            if (Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatec == null)
            {
                Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatec = new Program.DoSomething(Program.DoerCollection.<GetEnumerator>b__9);
            }
            this.<>2__current = new Program.Doer(Program.DoerCollection.CS$<>9__CachedAnonymousMethodDelegatec, "3");
            this.<>1__state = 3;
            return true;

        case 3:
            this.<>1__state = -1;
            break;
    }
    return false;
}

You can quickly see that what is generated is really a massive switch statement. Based on the current state, it updates the current reference and changes the state. Most importantly, however, is the fact that the results of the yield are executed "on demand." In other words, it is not creating a large list, filling it with instances, and then iterating. Instead, the classes are instantiated "on demand" and then referenced for re-use later in case the collection is iterated again.

The whole key to this process is that the enumerator hides the underlying implementation. The consuming code simply knows there is a collection to iterate through. How that collection is built is up to the enumerator, which leads to very interesting possibilities. In the case of the ASP.NET page, this means that controls can be called iteratively and yield their script references and descriptors. The "master" code is simply iterating through the collection and wiring up the script references.

Thinking of collections as different ways of grouping objects is certainly valuable and can pertain to many different business situations. Understanding that Enumerator is really a state machine, however, allows you to start thinking of collections as processes. They aren't necessarily pools of instances, but can be algorithms or other processes as well. The key is that the use of the enumerator hides the implementation so that the consuming code simply iterates through something without having to understand the underlying implementation of how something is provided.

C#er : IMage

Tuesday, September 22, 2009

What's in Your Collection? Part 3 of 3: Custom Collections

Yield to Iterators