Wednesday, August 12, 2009

What's in Your Collection? Part 1 of 3: Interfaces

The collection is a powerful construct that allows a developer to logically group related elements and navigate through them. In .NET, a collection is specifically anything that implements the ICollection interface.

The entire series can be accessed here:

(If you enjoy this article, please vote for it at The Code Project)

The purpose of this series is to expose the collections provided with .NET in the System.Collections namespace and learn when and why you would use different types of collections. We'll understand the provided interfaces first, work through the concrete implementations available in the base .NET distribution then examine custom collections.

What's in a Bag?

No doubt you've dealt with collections in the past. There are many business requirements that relate to taking a group of similar elements and being able to parse, sort, and modify the set. The first such grouping most people learn about is an array, or an indexed list of items. Most languages have some sort of support for an array:

...
string[] myArray = new[] { "One", "Two", "Three" }; 
...

You can think of collections as being items in a bag, but this is not precise enough for our definition. In a bag, you can blindly reach in and grab something. A collection in .NET on the other hand allows you to enumerate through the items, so they must be indexed in a deterministic way. In this respect, it's more like a clothes rack: the clothes are hanging in a precise order and you can turn the rack and go through one piece of clothing to another. A single item is there only once, and it is a finite set of objects: you can count exactly how many exist and even tell me that one item is next to another.

Now we can take a look at the .NET supplied interfaces and start to come up with more concise definitions for different types of collections.

The System.Collections Interfaces

System.Collections Interfaces

It all Starts with Enumerator

As you can see, it really all "starts" with the interface IEnumerator. Something that implements this interface can move through the set of items. You can Reset() to the beginning, take a look at the Current() object, and get to the next one with MoveNext().

The popular for loop that exists in similar form in multiple languages really encapsulates the entire interface. A typical example looks like this:

For Loop as IEnumerator

As you can see, the concept is similar ... get me to the beginning, get me to the next, and always have a reference to the current.

Are you Enumerable?

If the answer to this question is "yes" then it means you implement IEnumerator. Classes that implement IEnumerable are willing to give away their IEnumerator via the GetEnumerator() method.

Notice that the dictionary has it's own IDictionaryEnumerator. This extends IEnumerator by adding the entry, key, and value. This brings us to another set of definitions that are important to understand:

Arrays and Lists are indexed collections. You can use an offset or integer value to indicate the "position in line." Dictionaries are Maps. Huh? A "map" basically maps one value to another. A dictionary is not so much concerned with what position someone has in line, but what are they holding in their hand? Alex maps to a hat, and Jane maps to a purse.

If you are looking to simply put some items together and move through them, then you'll be looking at lists, arrays, and similar constructs. If you are looking to map one type of object to another, then you are dealing with a dictionary (the definition goes with the word). The dictionary enumerator exposes the mapping: an "entry" (the word and its definition), the "key" (the word), and the "value" (the definition).

More on dictionaries later.

Finally, my ICollection!

Looking at the ICollection interface, you can see that collections are not the same as lists or dictionaries. A collection is a basic building block. What's important to note is that a collection, unless extended, is not indexed. I cannot say up front, "Give me the 3rd element." Instead, I must iterate through the collection using the IEnumerator provided by GetEnumerator() to find out what is available. The collection interface says I must:

  • Be able to copy my collection to an Array (then I'll have my index!)
  • Be able to provide a count of objects, so it is a deterministic set with a known, finite set of items
  • Optionally allow synchronization.

The synchronization aspect is important. Imagine maintaining a single queue of items in a multi-threaded application. As a naive example, consider holding a list of IP addresses in a web process that spawns a new thread every time someone connects. Since the list is stored in only one place (presumably as a static class accessible via the singleton pattern), there must be a way to add an item to the list without colliding with the other threads that are competing for the same resource. If my implementation can be synchronized, I return true for IsSynchronized() and provide SyncRoot() for synchronization. This is an object that acts as a mutex for access. In order synchronize access, I follow this pattern:

public void AddItemToCollection(ICollection collection, object item)
{
   lock (collection.SyncRoot())
   {
       // add item here - oops, can't do that with just ICollection, I'll
       // need something more ...
   }
}

Making a IList, Checking it Twice...

Now we get to one of the more commonly used collections: the collection that implements IList. A list is a collection, so by way of collection it is also enumerable, and therefore provides an enumerator. In addition, it can be synchronized and has a count. Let's explore what else a list can do:

  • You can add an item to the list without caring or needing to know where it gets added
  • You can clear a list and start over from scratch
  • You can see if a list already contains an item (and it is implied that add won't actually add it if the item already exists)
  • You can figure out the cardinal index of an item in the list, if it exists (so I really can ask for that "3rd item")
  • You can insert an object into the list at a specified index (instead of adding it to the tail end)
  • You can remove individual items from the list, by looking at the item or considering the position (index) of the item
  • Lists can be read-only and/or fixed size (i.e. you cannot add beyond a certain limit set when the list is created)

IList provides a powerful interface for collections that have plenty of flexibility around adding and removing items.

Have that IDictionary Handy?

The dictionary is very similar to the list, but instead of giving us an enumerator, it provides a dictionary enumerator. Because a dictionary is a map, instead of just dealing with items, we actually deal with keys and values (remember the analogy of words and definitions). The key for a list is the index, i.e. to grab an item from the middle of the list, you map it based on its position. A dictionary, on the other hand, can have a key of any type and they way you reference an item is by the key, rather than the index.

This makes for an interesting observation: a list can be thought of as a dictionary where the key is the cardinal index of the value in the list.

A Question of Equality?

The remaining interfaces relate to functions that help keep the lists organized by knowing whether or not items are duplicates and how items compare to each other.

IComparer provides a scalar comparison between two objects. Regardless of what the object is, implementing IComparer means you can have the concept of something being less than, equal to, or greater than another object. This comes in handy for sorting objects and the Compare signature of the method allows the developer to define just how the objects literally compare to each other.

IEqualityComparer, on the other hand, determines simply if two objects are equal: true or false. How you determine equality is up to you to implement in the Equals method.

Both IEqualityComparer and IHashCodeProvider define a method for returning a 32-bit integer, GetHashCode. Obviously, these interfaces are for collections that are stored as hash tables. A hash is an integer generated from a hash function that tries to generate a unique number for an object. It maps potentially complex data to a smaller fingerprint or signature that should strive to be as unique as possible (however, hash functions do not guarantee uniqueness and when two pieces of data generate the same hash, the result is referred to as a hash collision).

While .NET provides a built-in hash function for strings, the common algorithm for this will add character values to each other and ignore overflows (values that exceed the upper limit for integers). To illustrate this, build this simple console application:

using System;

namespace HashTest
{
    class Program
    {       
        static void Main(string[] args)
        {
            foreach(string str in args)
            {
                Console.WriteLine(string.Format("{0}\t{1}", (uint)str.GetHashCode(), str));
            }
        }
    }
}

You can then run it from the command line and pass unique strings, like this:

Hash Function Example

As you can see, each word generated a unique integer, regardless of the size of the word, and the same words ("of") generated the same integer. In this case, we did not have a collision. Hash functions help ease the look up of complicated classes and structures. You should take care in generating your hash function and try not to overcomplicate it.

For example, a class with contact information that contains a first and last name can probably return a hash that is either the hash code of the concatenated string, or the sum of the hash codes for the first name and last name. However, if the phone number is stored, then the phone number itself may become the hash! If you are holding a lot of information in a class that is persisted and has a unique key, then the hash may be as simple as the value of the key itself, rather than something that looks at other properties on the class.

Random Thoughts

To tie this section together, I put together a small console application that demonstrates an implementation of these interfaces. You'll find this is a rather unique approach to the collection, but it honors the interfaces 100%. What the program does is create a special enumerator that generates random integers. This is wrapped in a collection. The collection can have a fixed "size." Any instance of this type of collection in the same AppDomain will provide the same sequence of random numbers. This is because a random seed is generated, then stored, and used as the seed for the sequence.

The key here is to note the behavior of IEnumerator and realize that just because we can deal with it like a collection doesn't mean we have to have a fixed list of integers in memory - in this case, we exploit the random number algorithm and the fact that the same seed returns the same sequence of numbers. Also note how simply by implementing IEnumerable, the collection can suddenly be used in a foreach. Here is the code:

using System;
using System.Collections;

namespace Enumerable
{
    internal class Program
    {
        private sealed class RandomEnumerator : IEnumerator
        {
            private int _index;
            private readonly int _size;
            private static readonly int _seed;
            private int _currentValue;
            private Random _random;

            static RandomEnumerator()
            {
                Random seedGenerator = new Random();
                _seed = seedGenerator.Next();                
            }

            public RandomEnumerator(int size)
            {
                _size = size;
                _random = new Random(_seed);
            }

            public bool MoveNext()
            {
                _currentValue = _random.Next();
                _index++;

                bool retVal = _index <= _size;

                if (!retVal)
                {
                    Reset();
                }

                return retVal;
            }

            public void Reset()
            {
                _random = new Random(_seed);
                _index = 0;                
            }

            public object Current
            {
                get { return _currentValue; }
            }
        }

        public class RandomCollection : ICollection
        {
            private readonly IEnumerator _random;

            private readonly int _size;

            public RandomCollection() : this(5)
            {
            }

            public RandomCollection(int size)
            {
                _size = size;
                _random = new RandomEnumerator(size);
            }

            public IEnumerator GetEnumerator()
            {
                return _random;
            }

            public void CopyTo(Array array, int index)
            {
                GetEnumerator().Reset();
                while (GetEnumerator().MoveNext())
                {
                    array.SetValue(GetEnumerator().Current, index++);
                }
            }

            public int Count
            {
                get { return _size; }
            }

            public object SyncRoot
            {
                get { return null; }
            }

            public bool IsSynchronized
            {
                get { return false; }
            }
        }

        private static void Main()
        {
            RandomCollection col = new RandomCollection(7);

            foreach (int item in col)
            {
                Console.WriteLine(item);
            }

            Console.WriteLine();

            int[] copy = new int[col.Count];
            col.CopyTo(copy, 0);

            for (int x = 0; x < copy.Length; x++)
            {
                Console.WriteLine(copy[x]);
            }
        }
    }
}

Once you compile this program, you can disassemble it using ildasm.exe to see what is generated "behind the scenes." The first block of code after instantiating the collection is the foreach loop. Look at how it appears in IL:

IL_0011:  br.s       IL_0029
    IL_0013:  ldloc.s    CS$5$0000
    IL_0015:  callvirt   instance object [mscorlib]System.Collections.IEnumerator::get_Current()
    IL_001a:  unbox.any  [mscorlib]System.Int32
    IL_001f:  stloc.1
    IL_0020:  nop
    IL_0021:  ldloc.1
    IL_0022:  call       void [mscorlib]System.Console::WriteLine(int32)
    IL_0027:  nop
    IL_0028:  nop
    IL_0029:  ldloc.s    CS$5$0000
    IL_002b:  callvirt   instance bool [mscorlib]System.Collections.IEnumerator::MoveNext()
    IL_0030:  stloc.s    CS$4$0001
    IL_0032:  ldloc.s    CS$4$0001
    IL_0034:  brtrue.s   IL_0013
    IL_0036:  leave.s    IL_0055

You don't have to be an expert in IL to see the use of the IEnumerable interface. The foreach literally invokes the interface by referencing Current and then looping through a call to MoveNext() until it returns false.

One thing you'll notice is the unbox.any operation. This is because we are using the non-generic version of the interfaces. This forces the runtime to box and unbox any value types in the collection. Fortunately, the namespace System.Collections.Generic gives us IEnumerator<T> and ICollection<T> that allows us to strongly type our collection. This would allow the actual types to be referenced and remove the need to box and unbox when iterating the collection.

We've taken a look at the basic interfaces provided by System.Collections. In the next article, I will cover the various concrete implementations of collections that are available so we can discover when, where, and why we'd want to use different types of collections in our own projects.

(If you enjoyed this article, please vote for it at The Code Project)

Jeremy Likness