Duck Typed Concepts for Ruby (was Re: A use case for an ordered hash)



I read the post regarding a taxonomy of hash-like data structures with interest. It is something I have been thinking about a bit lately (as well as the more general subject of all collections). It may be appropriate to back up a step and consider the idea I mentioned in another thread regarding the naming of concepts. Doing that will allow more powerful duck typing in the design.

Ruby already defines Enumerable and Comparable. By depending upon #each Enumerable the requirements / concept that must be met by classes that include it. To keep things simple we can call the concpet Enumerable as well. Here are some other potential concepts that may fit nicely in Ruby (please focus more on the definition / duck typing than on the names):

Integral
supports to_int
MutableEnumerable
depends on each, add, remove
Hashable
depends up hash
Associative
depends on fetch, store, possibly remove, each key, and each value
Unique
depends a single value per key, duplicate keys could be either rejected or overwritten
Multi Valued
depends upon values being Enumerable
Sequencable
depends on each, insert, remove, and possibly fetch and store
Sorted / Ordered
see below

This allows us to define mixins which depend upon these operations or properties. For example, a MutableEnumerable mixin can be defined that implements set operations, compact, clear, remove_if, etc. An Sequencable mixin can be defined that implements all sorts of operations such as append, concat, splice, sort, etc. The dependency of a Sorted / Ordered mixin would be a bit different. It would depend upon a property of the a class mixing it in rather than operations. For example it could define merge and binary search operations.

A Pair concept depending upon first and second may also be desirable. It allows things such as implementing each for Associative that takes a single parameter.

It is clear that there are some relationships between the various concepts. For example, some classes implementing Associative behavior are likely to depend upon keys that implement Hashable behavior and Sequencable classes are also Associative if we assume Integral keys. Also, in some cases a user may know about the state of a particular object that would allow them to leverage a mixin that otherwise wouldn't apply to a class. For example, extending an instance of Array with Sorted if the array is known to be sorted.

Now returning to the thread at hand we can see that the difference between the associative array and hash hierarchies is that the hash hierarchy depends upon Hashable keys (please forgive me if this is stating the obvious). Of course they would also have different implementations of primitive methods. However, many higher level operations that depend upon those primitives would be the same and could indeed be implemented in mixins that expect these primitive operations. The concrete implementations would also be welcome to implement optimizations and additional parameter to higher level operations that are not possible without knowledge of the implementation of the data structure, but this is a bonus and is not strictly necessary. By clearly defining small groups of *naturally* cohesive primitive operations and naming the group (the concepts) we can gain the ability to create concrete implementations with a variety of properties that all support the rich higher level operations that can be supported by their primitives for free (almost for free - the mixins do each need to be implemented once, but will then be available to all concrete implementations support the operations the depend upon). We are now exploiting the power of combinations rather than letting it run away from us.

The properties I see being important are:
Associativity
Integral (ex. Array), Hashable (ex. Hash), any object (ex. associative array)
Ordering
insertion order, key order (by <=>), value order (by <=>), any sort function
Quantity
any value, unique valued, multi-valued
Sequencability
Mutability
probably less important in Ruby, but immutable variants may be desirable for various reasons

The taxonomy in terms of the concepts (please keep in mind that there are probably better names for the concepts than the names I am using):

a. associative array
i. unordered
Associative, MutableEnumerable, Sequencable?
ii. insertion order-ordered
Associative, MutableEnumerable, Sorted
iii. sort function-ordered
1. manual (adhoc)
Associative, MutableEnumerable, Sequencable
2. automatic (upon insertion)
Associative, MutableEnumerable, Sorted

b. hash
i. unordered
Associative, MutableEnumerable, Hashable elements
ii. insertion order-ordered
Associative, MutableEnumerable, Sorted, Hashable elements
iii. sort function-ordered
1. manual (adhoc)
Associative, MutableEnumerable, Sequencable, Hashable elements
2. automatic (upon insertion)
Associative, MutableEnumerable, Sorted, Hashable elements


As we can see, sequencability was not explicitly mentioned in the previous post, although it was implied by the ad-hoc sort. In order to be ad-hoc sortable by an arbitrary sort function the ability to re- order elements must be supported. Once we have this we actually have much more general capabilities of arbitrarily sequencing elements whether the sequence is sorted or not. We can also see that automatic ordering by key, value, and automatic ordering of arrays may also be useful. We can generate further types of collection structures based on a unique value per key or enumerable values in addition to the default of an arbitrary quantity of values per key.

For instance are a.i. and a.ii. really the same thing by virtue of there not being a hashing function?

I would consider a.i to be most likely insertion ordered by default (when created with literal syntax for example), but should proabably be Sequencable and thus the ordering may not always be maintained. Also, insert and other such operations may be supported thus allowing items to be inserted without being "insertion sorted" (in terms of the actual chronological insertion order).

Does a.i. make any sense and can be done away with from the get-go?

It is definitely an independent type of data structure with its own use cases.

And are a.ii. and b.ii, and a.iii. and b.iii. really the same thing also? Not if the hash still retained the hash as you would it expect it to.

No. Using hashable elements is different and is likely to increase the performance of fetch operations.

And is there a term which could reasonably cover all of a. and b.? I might have used the term 'hash' to refer to anything which is a key-value pair, but this is apparently confusing or inaccurate.

Both a and b are associative.

I tried set, but that's wrong. I tried enumerable, but that's wrong too.

A set essentially an associative collection where the elements are objects. Conceptually a hash table, an array, and an associative array are associative based on key or index pairs.

And so are these three sorts of data structures (array, associative array, and hash) sufficiently different to warrant none subclassing from the other? If so, then it would also probably warrant a syntax addition and a core implementation (as discussed). And they are presumably sufficiently similar that it might make sense to subclass all of them from the same base class; list for instance, and then mixin Enumerable there instead. Although one may as well just put the methods in directly to List then however?

I hope I have been able to shed some light on the similarities and differences of them. The implementation of primitive operations are different leading to different potential operations and performance characteristics while many of the higher level operations are the same and can be shared via mixins.

There have been concerns that a sort function-ordered or an insertion-ordered, or just ordered hash will cause slowdowns and memory bloat, so I was wondering if it was possible to adjust according to need. So, thank you to Charles, as I am reminded of what I did in compsci all that time ago!

The ability to select the features you require for a given application would be very desirable indeed. Here's a syntax idea - let's borrow the idea of regex options where we can have /a regex/ix and use option characters to specify what concepts should be supported. {}s could be sequencable, {}i could be insertion ordered, {}k would be key ordered {}v would be value ordered, etc. The same could then be implemented for associative arrays [key => value]v for a value ordered associative array. We could even define ordering for regular arrays if desired [a, b, c]o would be ordered by <=> on the value of a, b, and c and maintain ordering when items are added. Some of these options may be of debatable value, but I think the notation is elegant, concise, and consistent with our current notation.

This post has discussed duck typed concepts in the context of collections, but the usefulness of duck typed concepts are certainly not limited to collections. Their essence is to identify and name a set *naturally* cohesive group of requirements on a type. These requirements can be operations, operations of an associated type (ex. Enumerable requiring elements to support <=> for sort operations to work),

Matthew

.



Relevant Pages

  • Re: Suggestions for double-hashing scheme
    ... >> The items that are being moved are the items in the hash table itself, ... >> which are of fixed size (they are in an array, ... > typedef struct { ... One "uchar" aka 'unsigned char' is plenty to hold a probe ...
    (comp.programming)
  • [SUMMARY] Index and Query (#54)
    ... This was a fun quiz for me because I really don't know anything about indexing ... We see in initializethat the index is just a Hash. ... an Array of symbolic document names where the word can be found ). ...
    (comp.lang.ruby)
  • Re: Comment on how to uniquely track your objects in C# / hash table / get hash code
    ... Array, correct? ... This is largely irrelevant to the issue of performance, since hash ... where both insertions and lookups happen frequently, ... about fast lookups for balanced red/black binary trees. ...
    (microsoft.public.dotnet.languages.csharp)
  • Re: problem with hash & sort array
    ... Then I put the hash into an array with the sort ... The twist is I have to identify the duplicate ... a real data structure, and in the next place I make the exact opposite ...
    (comp.lang.perl.misc)
  • Re: Returning a Dictionary from a webservice
    ... There is no such thing as "associative array" in XML Schema. ... If a client needs to access this data in an associative manner, ...
    (microsoft.public.dotnet.framework.webservices)