Monday, September 7, 2009

What's in Your Collection? Part 2 of 3: Concrete

The collection is a powerful construct that allows a developer to logically group related elements and navigate through them. In this article, we'll explore some concrete implementations of collections that are part of the base .NET framework.

The entire series can be accessed here:

(If you enjoy this article, please vote for it at the Code Project by clicking here, then scroll to the bottom to vote!)

There are two basic name spaces that provide rich collection functionality "out-of-the-box" and are useful in a number of ways. There are System.Collections for non-generic collections and System.Collections.Generic. For more specialized collections, we'll also look at System.Collections.ObjectModel. (Extra Credit: we won't cover it here, but after reading this article you may want to investigate System.Collections.Specialized).

An Out of the Box Interview Answer

A very common interview question is to explain the difference between ArrayList and List. If you got that one correct, you probably mentioned something about boxing, or taking a value type and converting it to an object so it essentially becomes part of the heap instead of the local stack. This operation is expensive. Because ArrayList is not generically typed, it must box and unbox value types. For this reason, any type of collection that deals with value types (and for that matter, structs) should focus on the List<T> implementation. Just how expensive is the boxing operation? Try this little console program and see for yourself:

using System;
using System.Collections;
using System.Collections.Generic;

namespace Arrays
{
    internal class Program
    {
        private static void Main()
        {
            const int ITERATIONS = 9999999;

            DateTime startBuild = DateTime.UtcNow;

            ArrayList integers = new ArrayList();

            for (int x = 0; x < ITERATIONS; x++)
            {
                integers.Add(x);
            }

            DateTime endBuild = DateTime.UtcNow;

            for (int x = 0; x < ITERATIONS; x++)
            {
                int y = (int) integers[x];
            }

            DateTime endParse = DateTime.UtcNow;

            TimeSpan buildArray = endBuild - startBuild;
            TimeSpan parseArray = endParse - endBuild;

            startBuild = DateTime.UtcNow;

            List<int> integerList = new List<int>();

            for (int x = 0; x < ITERATIONS; x++)
            {
                integerList.Add(x);
            }

            endBuild = DateTime.UtcNow;

            for (int x = 0; x < ITERATIONS; x++)
            {
                int y = integerList[x];
            }

            endParse = DateTime.UtcNow;

            TimeSpan buildList = endBuild - startBuild;
            TimeSpan parseList = endParse - endBuild;

            double build = (double) buildArray.Ticks/(double) buildList.Ticks;
            double parse = (double) parseArray.Ticks/(double) parseList.Ticks;
            double total = (double) (buildArray.Ticks + parseArray.Ticks)/(double) (buildList.Ticks + parseList.Ticks);

            Console.WriteLine(string.Format("Build Array: {0} List: {1} {2}", buildArray, buildList, build));
            Console.WriteLine(string.Format("Parse Array: {0} List: {1} {2}", parseArray, parseList, parse));
            Console.WriteLine(string.Format("Total Array: {0} List: {1} {2}", buildArray + parseArray, buildList + parseList, total));

            Console.ReadLine();
        }
    }
}

It basically spins through a list of integers, storing them in both an ArrayList and a List. On my machine, the ArrayList takes over 7 times longer to load, and 1.2 times longer to retrieve values, than the strongly typed List implementation. That is something important to keep in mind when considering collections.

I'm Just Not Your Type

The first collections we'll look at are not generically typed. That doesn't mean they aren't typed ... some in fact are designed for explicit types, but they don't support generics. We already covered the ArrayList, which I believe is there for backwards compatibility to the versions that didn't support generics, as I cannot imagine a situation when I would use that over a List.

These classes derive from CollectionBase and DictionaryBase which are abstract classes that implement ICollection and, in the dictionary, IDictionary.

BitArray

Use this class when manipulating bits. It exposes the bits as an array of bool, so you can do something fun like:

...
if (myArray[x])
{
blah blah
}
...

The underlying storage is done at a bit level for compact storage. What's nice is that you can initialize the collection with a byte array and perform bitwise operations (logical NOT, AND, OR, XOR) between two arrays (great for masks, etc).

Hashtable

The Hashtable serves and important function. It makes large collections of objects easier to parse and search based on the implementation of the hashcode algorithm. One important decision to make is whether you will use a Hashtable or a Dictionary. What's the difference?

The dictionary maps a key to a value. Each key is unique. Different keys might have the same value, but if you are searching for a specific key, you will get exactly one entry. What's more important to note with the Dictionary type is that it is defined with a generic type. Therefore, there is no boxing or unboxing and it will, in general, perform faster and better than a hash table when you are using value types.

The hash table requires that its object implement the hashcode algorithm (or that an algorithm is injected into the constructor). The idea is that objects will have a "mostly" unique key per the hashcode algorithm. However, hash tables will allow multiple objects to exist for the same hash code because the algorithm does not guarantee uniqueness. Hash tables are most often used when there is not a well-defined key to map to the value. The hash code function is used to resolve a "region" of objects, then that subset of objects can be further scanned to complete the algorithm. Using a hashcode when you have a well-defined key is also more expensive because it only stores objects, not generic types, so boxing and unboxing will occur if the targets are value types.

Queue: First in Line, First to Eat!

The Queue is often compared to a physical line. In the lunch line, the first person in the line is also the first person to leave the line (usually). The queue functions this way. To put someone in line, you call the Enqueue method. To get the person at the front of the line (the next one to "leave") you call the Dequeue method.

For an idea of how the Queue collection could be used, consider this practical example: syslog. Syslog is a standard way for network equipment to broadcast status. By default, syslog messages are sent to a host via the UDP protocol on port 514. UDP, unlike TCP/IP, is a disconnected protocol (it doesn't wait for nor require a response, and does not support routing of large packets that must be broken into chunks and reassembled). While you can configure what hardware sends for syslog, some equipment can be incredibly verbose and send out dozens of status updates every second.

Imagine writing a syslog server that retrieves these values from a listening UDP port. The thread listening to the port must be incredibly fast or it will block the port and miss important messages. In order to keep the listen port open, you could implement a synchronized queue. The listener would simply Enqueue the incoming message, then go back and listen to the next message. A background thread (or even several threads running simultaneously) could then call Dequeue to perform processing on those messages.

Most of the time you'll want to use the generically typed equivalent for the Queue to avoid boxing and unboxing.

SortedList

The sorted list is a hybrid between the List and the Dictionary. The keys in the list are sorted, so after adding values to the list, you can enumerate the keys and get the value in the sort order of the key. This might be useful to enumerate countries based on the country or perhaps files based on a key that includes their directory, etc.

Just Put it on the Stack

The stack is a very popular pattern for collections. It is a last-in first-out (LIFO) collection, compared to the queue which is FIFI (first-in, first-out). Stacks are important for composite operations that require a history of state. Calculators work by pushing operands and operators on the stack, computing the values, then popping those values to integrate into the next operation. Stacks are also important in recursive functions — if you wanted to recurse without using a method call, you'd loop instead and place your values on the stack, then pop them off until the stack is empty.

Many years ago in the days of VB6 I helped build a complex web application that had many multi-page transactions. To enable the user to navigate these transactions, we used a custom stack. Each navigation involved pushing the parameters and page directives onto the stack, then the target pages would pop these values and use them. A multi-page transaction would only pop the final values when the transaction was complete. This allowed us to rollback transactions, as well as nest transactions (for example, if you were in the middle of transaction A, then navigated to "B" and hit cancel, you'd pop back into A instead of some generic menu).

Again, you will more often than not use the generically-typed version of the Stack to get the job done.

Generics are Less Expensive

Many of the collections we discussed have generically-typed equivalents that eliminate the need for boxing and un-boxing. When it comes to value types, generically typed classes are almost always less expensive and provide better performance. In addition to generically typed versions of the collections we've already discussed, System.Collections.Generic provides some unique collections only available as strongly-typed implementations.

Dictionary Lookup

By far one of the more commonly used collections, the dictionary has a strongly typed key that maps to a strongly typed value. This is the classic use for mapping one item to another, whether it's an image name to the bytes of the actual image or a security key to a security context object.

Ready, HashSet, Go!

The HashSet class does what the name implies: manages sets. Sets are different than typical lists in a few ways. Sets are loose collections of objects: order is not important. Each object must be unique. If you do not require the classes you are collecting be in a particular order, hash sets exhibit very good performance benefits over indexed and ordered lists. The hash set also provides set operations such as union and intersection. According to Kim Hamilton's article introducing the Hash set, the preferred name for this would have been simply set (you can see the article to learn why the hash part was added).

LinkedList

The linked list is a powerful linked list implementation that provides nodes that link both forward (to the next node) and backwards (to the previous node). The list maintains an internal count. Inserting, deleting, and counting nodes are all O(1) operations. An O(1) operation is an operation that takes the same amount of time regardless of the size of the data it is being performed against ... this means that the list performans just as well when adding or removing nodes as a small or a large list.

Type<T>

The remaining items in this namespace are counterparts to the collections and implementations of the interfaces we've discussed. The caveat is that they are all strongly typed which means better performance in almost all cases involving value types and often for reference types as well. This really leads us to the last collection to be discussed (remember, I left the specialized namespace for homework). This also takes us into a new namespace!

Just an Observation

The System.Collections.ObjectModel namespace is for object-based operations that belong in reusable libraries. It relates to classes which have methods that return or consume collections. Perhaps the most often used collection here is the ObservableCollection.

The ObservableCollection provides a collection that implements INotifyCollectionChanged, which is similar to INotifiyPropertyChanged but at the collection level. In short, whenever an object is added to or removed from the collection, or items within the collection are refreshed, the collection will raise the CollectionChanged event. This is important when there is a dependency on the collection that should be notified whenever the underlying collection changes.

Of course, the most common implementation of this is for databound user interface elements. Objects like lists and grids need to refresh when the underlying lists change. Technologies like Windows Presentation Foundation (WPF) and Silverlight rely on observable collections to optimize the UI and only refresh the elements when there is a requirement, such as the list changing. In fact, these frameworks automatically hook into the events when databound to refresh, so whenever you are dealing with lists that change, you should consider using the observable collection instead of one of the other collection types for binding.

Conclusion

That is a lot of information to cover but hopefully provided insights into the various types of collections and some uses for them. In the next and final installment, we'll consider custom collections and how to tap into IEnumerable for more advanced functionality.

Jeremy Likness