RiK0 Tech Temple: January 2012

Thursday, January 26, 2012

Handle this! (views, const, state in Clojure, Java, C++ and Python)

Introduction

The original vision Alan Key had on object oriented programming was about separate entities communicating through message passing. A logical consequence is that the global programming state is the sum of the individual states of these entities (called objects). State of such objects is naturally hidden from the outside and state modifications occur only as a consequence of the exchanged messages.

I would like to mention that in this model the "privacy" of internal variables is not exactly simply a matter of a keyword, but a consequence of a programming philosophy. This is not the kind of limitation you get in Java classes or C++, where the field is there, you just cannot access it. It somewhat more similar to calling a black-box with a state that is its own business; there are no fieldsand if there are, they are just an implementation detail. Or even more so, private variables are not accessed in the same sense that the physical address of an object in Java is not part of the programming model.

Such objects do not necessarily have their own thread of execution (in the sense that they are concurrently in control). However, if they had, the logical model would not be overly different. But back to the objects…

I somewhat believe that objects are an overloaded metaphor. In fact, there are at least two types of objects. And while the object oriented message only metaphor well applies for domain specific objects, I somewhat feel that it is not appropriate for some data structures. Sometimes, it is a nice property that "similar structures" have a common interface so that, for example, switching from an array to a linked-list is a painless transition, because it eases experimentation with different trade-offs regarding computational efficiency (although such problems are better solved with pen and paper).

However, in other situations, accessing the internals of some complex structure is plainly the "right thing to do". It is a walking-horror from the object oriented point of view, but it plainly makes sense for computational reasons. I often have to deal with graphs with billions of nodes, and more often than not I feel that usual OO laws are too restrictive.

Graph example

A clear example here is the design of networkx.Graph: I have nothing against the design, by the way. I believe they do the right thing. Here the idea is that they have implemented their Graph internals in some way (does not matter how, right now). However, you may want to get a list of all the nodes in the graph. Now, how to do this? The first issue, is that the nodes may not be memorized in a way which is easier to return. This is actually the case: nodes are a dictionary keys, under the hood. So essentially there is no easy way to return them without calling some dict member which returns a newstructure holding the nodes.

State and "Static" OOP

OOP is all about state change. Perhaps just local state change, if done correctly. And hopefully the state's effect do not propagate too far from where the state is hidden. About C++, I found no other very mainstream OO language that makes it clear what you shall change and what you shall not.

The C++ pragmatics is really precise on consting whatever you can const. And to solve issues where it is not practical to have a logically const object which actually mutates something inside, you can use the mutable modifier to support the idea that the object realstate did not change while some irrelevant parts of it indeed changed. Examples are forms of caching, counting stuff, logging to a logger we hold a reference to.

Another important aspect of C++ is that it quite distinguishes between a const pointer (a pointer that cannot change) and a pointer to a const object (the pointed object cannot change). As always all this leads to additional complexity. However, declaring stuff const is good: first it is a rather strong safety guarantee, second it really leads to optimizations otherwise impossible. Still, it is tragically inadequate wrt. plainly immutable objects.

Moreover, although many other languages do not have pointer arithmetics, they do have references. In Java it is possible, for example, to mark such a reference final, which essentially means it will always refer to a given object. However, there is no way to state that the actual object could not be mutated by accesses through that specific reference.

In Java, the only way to achieve that goal is not providing methods that mutate the state. In fact this approach makes sense. Somewhat you make the language simpler without really losing much. And C++ newbies really do not get the whole constness thing very well.

Essentially, in Java you do not have the possibility to have a mutable object that some clients cannot mutate. There are options, however. For example, in Figure 1) we have two interfaces, one mutable and one immutable. We have the mutable interface extend the immutable one and the appropriate base classes.

Immutability at class level can be obtained both (a) with a true immutable implementation implementing the immutable interface and a mutable implementation implementing the mutable one; or (b) with just a mutable implementation: clients that should not mutate the object will use the immutable interface. This is quite similar to the const in C++ in the sense that a const_cast is usually possible (and in this case we could just cast to the mutable interface). Such things somewhat break the whole immutability thing, but sometimes have their uses.

And what is the big deal with immutability? Basically, in this context immutable stuff can be shared with no fear. And copying huge datasets is too inefficient to be considered.

Dynamic OOP

The essential problem here is that the OO language we have discussed so far are built around the idea that your co-workers will screw the project if they can do stuff. So the objective is not letting them do it. Constness shall be enforcedby the language (you had the opinion that I was happy about C++ const, did you?) because otherwise someone will foobar the project.

On the other hand in languages such as Python you may well do everything to every object and consequently the const-enforcement does not fare very well. A bit more could be done (formally) in Ruby. Still, even then you could always hack the objects to let you do whatever you want. And believe me, you could do that also in C++ and Java, provided you have sufficient control of the environment where the program is going to be run. It is just way harder.

In fact, I believe approaches where good policies about code isolation can be also (easily) implemented in Python. Good API design is of paramount importance. A C++ wise advise (from Meyers) was "Avoid returning Handles to object internals" (Item 28, Effective C++, 3rd ed, Scott Meyers, Addison-Wesley).

Essentially the idea is never to let your object guts exposed and never ever let someone mess with it. This is not about trust. This is about such handles are just a sure way to break your object constraints (why I'm talking like a static programmer anyway?). The point is that such handles change state independently by the core object and this is probably going to be bad, because the corruption of the state will be revealed in a place and time extremely distant from when and where it actually happen.

So, we have to carefully design our APIs, even (shall I say, especially?) if we are dynamic programmers. For example, we can return views on our object internals. Since our languages are very dynamic, such views can be easily constructed: they just have to quack like the original objects. When it makes sense, it is probably just better place the functionality in the "large" object and to delegate to the attribute (delegation is so trivial to implement in dynamic languages!). Notice that strongly interface based languages such as Java could make this approach even more natural, provided that formally specified interfaces make sense for the specific case.

Sometimes it makes also sense to return object which can mutate and where their mutation influences the state of the object from which they come from. However, in this situations such objects shall be built in a way that they do not break the behavior of the object from which they were gotten. Essentially here we are just obeying principle like SRP (single responsibility principle) and design things to work together. In fact, they are not handles to the object internals at all. We are not exposing the implementation of the object: we are just exposing an interface to a part of the object state (perhaps even state that cannot be changed through the main object interface).

What are the problems with this approach? As long as things are not modified, copying is fine. A view is a good thing, because it may be as efficient as possible for reading, while being completely safe. The problem essentially arises when we want to mutate the objects state: internals handles are bad, so we have to:

1. carefully craft the object interface to allow modifications efficiently and that make sense to the problem at hand, without making it excessively general (because it clashes with efficiency) or excessively big (because it clashes with almost every good property OOP tries to give to programs)

2. Perhaps create special objects that are able to perform controlled modifications on the original object. This may give lot of generality, in a sense, but also complicates the class hierarchy significantly.

Graph example

Back to our example… we may have many solutions. Suppose that this "get the list of nodes" operation is frequent enough. It may make sense to memorize such list separately from the dictionary. If node removal and addition is not too frequent, the additional memory may well be worth it (well, perhaps not, if we really have lots of nodes). Even if such operations are frequent, we double the cost of the addition and make deletion O(N)… but if instead of a list we use a set, we have both operation O(1) simply with an increased multiplicative constant. Of course a language could offer a dict implementation which essentially offers an efficient view over the set of keys, so that separate memorization is not needed.

We could use a mutable datatype to hold the list, but then we should make a copy before returning it (this what actually happens with networkx). Not making a copy has the same problems of returning an internal handle. If we make a copy, then we could return something immutable or mutable. Essentially returning something immutable has not a lot of sense, as modifications would not affect the graph andmodifications to the graph are not reflected in the node-list. The simplest thing to do is plainly return a list of nodes.

The true solution would be that dictionary supported a "true" view object which is able to modify the original dictionary. And actually Python 2.7 and Python 3 have it. At this point we could just return such thing and have both efficiency and functionality… were it not for a simple issue: a networkx graph has more than one internal structure holding the nodes. Thus a higher level view would need to be created which could work across the different point were the same information is memorized inside a Graph. And we are back to the "complicating the class hierarchy thing".

Immutable by default

The thing is that actually having to specify things to be const, is a bit a pain. And perhaps it is just me... but consider the Java solutions (this apply to things which roughly work like Java): we are talking about having two class and two interfaces (or just one class and two interfaces) for lots of objects. In my opinion, this is not practical. And if we want to create "well-behaved handles" things become even more complex.

In fact, this is probably why it is not done (most of the times). Probably it should be sufficient to limit such strategies for things where it really matters. Think about the collections framework.

On the other hand... think about a world where most things are just immutable. I think it is just a safer mind model of programming. It is not about limiting your colleagues (or yourself) on not doing things which are licit in the model and that we want to restrict.

If we thinkimmutable, things are just easier. But then we are definitely moving towards the functional side of things. I'm not claiming that functional languages have onlyimmutable stuff. Even though many functional languages (Clojure, Haskell) have mostly immutable stuff. However, reasoning in terms of flows of functions and immutable objects is just easier than thinking about immutable immutable objects. At least, it should be, if we were trained to think functionally from the beginning.

Here we are used to deal with const objects. Sometimes we needto change the state. Two typical scenarios spring to mind. We aren't doing "Object Oriented Programming": we are just writing an algorithm and the algorithm was conceived for imperative languages. Sometimes there is no clear conversion into the functional world. Not an efficient one, at least. In this case we may want to use some special mutable object (arrays?) to perform our computation efficiently. And this may even generally work.

In the second case, it is simply not practical to structure the state of the world as some function parameters. In fact most of the times the global state is to big to be wisely represented as a huge set of parameters. In this case we probably want to express the computation as a set of transformations (functions, basically) that shall be executed one after the other on the world. Here I am mostly thinking about Haskell's monads. Though, even different from a syntactical and semantical point of view, we are not far from the realm of refs/agents.

The issue of efficiency, however, remains. We should still keep in mind that well buried under layers of object orientations there may be lots of hidden costs. Interfaces often get in the way of really efficient implementations, because costs are not part of the interface. The collections framework is beautiful… but sort is still implemented copying everything to an array and sorting the array.

Welcome under the sign of the Lambda

Not only it is better to have const object by default, that is to say object mutability shall be an opt-in rather than an opt-out. In fact, a part from the famous koanabout objects and closures, we have to avoid returning handles to our objects guts… but I do not see often closures that open up the enclosed state to the world.

The point is that avoiding all the copy costs may be simply thething to do when we have to deal with huge datasets. Restrict mutability where needed (e.g., implementing the algorithms) but mostly use mutable input and outputs from functions. Moreover, functional code is generally flatter, which can also, in the long run, improve efficiency.

Eventually, with languages such as clojure, even the perceived drawbacks of lists can be avoided using vectors, which support efficiently a different sets of primitives. Lazyness is also extremely helpful: actions that are not performed do not cost.

Wednesday, January 25, 2012

Social Network Analysis

This is a presentation I gave some days ago.
It is rather maths heavy and the point of view is more maths than computer science.

Social Network Analysis

View more presentations from rik0.

Thursday, January 5, 2012

Java and Clojure - How data structures influence language usage

One of the essential points is that we are using object for at least two different things: i) data-types and ii) domain objects. Data-types are usually general entities that can make sense in every program and are not particularly related to a specific application. Examples of these objects are numbers and strings, but also vectors, lists, maps. On the other hand domain objects are domain specific and depend on the specific context. Both a payroll system and a web-server have probably some use for a vector class, however web-server probably does not have an employee class.

The logic level of the objects of the two different kinds is different and code using both should not be written. Essentially, the idea is that low-level code should be encapsulated in methods or functions that work at an higher level, thus making the whole source more details because low level stuff does not clutter the flow of information. Also notice that most standard library code uses low-level abstractions, because it shall be rather domain independent. Thus better data structures mean better code.

So what are these data-types? Languages differ greatly in their choice of datatypes. C has only integers and arrays, for example (ok, it is not OO, but does not matter right now). C++ inherit this. Strings are a library function, so are more well behaved collections. The interesting thing is that the different parts of the STL do not depend much on each other. So for example streams do not play really well with strings (e.g., file names cannot be strings, but in fact streams do not play well with about everything else) and std::string does not have a method such as

vector<string> string::split(string sep);

or, better:

some_range<String> string::split(string sep);

In Java, we have Strings, but collections were retro-fitted. So for example String#split returns an array of strings. In my opinion, one of the essential problems of many "static" programming languages such as C++ and Java is right here: they have the wrong primitives. Writing low level code, is very cumbersome and unnecessarily imperative (C++ STL somewhat makes an exception as it has a somewhat functional flavor the rest of the language lack).

I think that having entities that fit a similar role but have very different usage places a heavy conceptual burden. In Java arrays are really different from collections: they have a ~~nice~~ ugly syntax (but at least they have one) and cannot be used in the same ways collections can. Have different methods, names, facilities. On the other hand, collections lack a literal syntax, which is something that makes them feel "first-class". C++ has basically the very same problem; however, the STL does a great job in normalizing access patterns between STL collections and regular arrays.

Back to the main subject, I feel that Java actually lacks algorithmic code that eases manipulation of collections. In fact, the only "user-friendly" feature is the "for-each" statement. Other than that, code is very imperative and comparable with C code manipulating the same structures. Plus, Maps are really ugly because a first class tuple data-type is missing.

Clojure, on the other hand, has plenty of functions to manipulate low level stuff. This is rather relevant as there is a lot of code that works at this level. Being able to write it quicky is very important. Moreover, using the correct abstractions, avoid bugs. Consider for example the map functions? In a col like

(map f coll)

there can be only one single point of failure: what f does on the single elements. There is no explicit looping, no offset problems, nothing. Consider for example the number of things that can go wrong imperatively writing a loop that calls a function on pairs of consecutive elements in a collection. Compare it with its functional equivalent:

(let [coll (range 10)] (map + coll (rest coll)))

Here nothing can go really wrong. It is easier to write and easier to understand (provided you know a bit of clojure). It is worth noting that real high level object oriented languages have almost this kind of abstraction. But hey, they also have FOF and LOL (first order functions and let ovel lambda -- actually, they do not have let, but the name doesn't matter, does it?). Besides, that kind of code may also be easier to parallelize (in the sense that the compiler could do it).

Then comes class specific code. Object are great, fine. However, sometimes they are just used to group data together. In a sense, it makes sense. In Java there are no tuples, maps are ugly as hell and have the stupid requirement of type uniformity. Which means that either we abuse Object or we can't use them at all.

As a consequence, many specific classes are created. They have no true purpose, but carrying around pieces of data together. And that is just the task for maps, tuples, vectors. Maps, tuples and vectors have a uniform interface, "do the right thing" regardless of the types (in a decently typed language -- new definition… both static, e.g., Haskell, and dynamic, e.g., Clojure, Ruby, Python, languages can qualify! ). And there are plenty of functions to manipulate such stuff.

I think that one of the clearest examples is the Python tuple. A tuple is not magic. Is not clever. It is simple. Still, while in Python functions return only single values, tuples in fact allow to simulate multiple return values. And that is something which is overly useful in many different situations. In Clojure vectors can be used with a similar meaning (in fact, Clojure vectors are quite similar to Python tuples). And the destructuring syntax is amazing (once again, both in Python and in Clojure).

for k, v in some_map:
    do_something_with(k, v)

(for [[k v] some-map] (do-something-with k v))
(doseq [[k v] some-map] (do-something-with k v))

Amazingly enough, the set of orthogonal language features used in both languages is roughly the same. Consider that with using a Set[Map.Entry[K,V]] Map#getEntrySet method... here we have this new Entry class. And perhaps somewhere else we have a Pair class which basically does the same thing, but it is a separate type, with different methods and different usage patterns. In dynamic languages at least, as long as the signature of the methods is compatible, something could be saved... but in statically typed object oriented languages, good luck with that.

I think these are some of the reasons I find especially pleasing writing algorithmic code in Python or Clojure and rather despise the very same thing in Java.