Erlangish

I Have Moved

2011-08-09T08:56:00.001-07:00

For various reasons (the biggest being markdown support and the fact that the blog has less of an erlang focus) I have moved my blog to blog.ericbmerritt.com. You will see new content there in the future.

Build Process Integration

2009-01-25T19:06:00.000-08:00

Introduction

This post isn't going to be Erlang or Language oriented at all. One of my other hobbies revolves around the build process and build process tools. Over the last few years I have been spending a lot of time thinking about improving them, making the build process more transparent etc. I have a way to do that, I believe. Unfortunately, it would take the cooperation of build too implementors to get it off the ground. That, or a new implementation of the existing tools. In any case, let me describe to you what I am talking about.

Many of you may be familiar with a product called Trac. This is an open source project management tool. Actually it calls itself a 'enhanced wiki and issue tracking system for software development projects' but it includes facilities for project management, source control artifact integration, authorization and authentication etc. Its a very interesting and reasonably complete tool. However, there is one single feature that makes this product especially interesting. That feature is the ability to link between the various kinds of information that trac keeps track up. Usually this linking is done with very simple, easy to remember micro-formats. For example, you can easily link a ticket to a changeset with the notation of 'changeset:' in the ticket. Trac understands this format and will produce a click-able link when the ticket is rendered anywhere. This type of linking works between any of the artifacts stored in Trac. The other features are really just there to support that single killer feature. Unfortunately, you only get to use this feature while you are in Trac.

Thats a problem. Because this linking only works with information owned by a Trac system, Trac is forced into an 'own the world' mentality. That is if the designers want to give you the ability to link between a wiki page and an issue the product must provide functionality that implements a wiki and an issue tracking system. If it wants to give you the ability to link to artifacts in a build system it must provide a build system. This is true of any artifact that Trac, or systems, like Trac want to provide. This forces the implementors to spread their time and efforts over a range of products instead of sticking a single product and getting it right. It doesn't allow the implementors of any one product to focus on that one product. It also means that a consumer of this product doesn't have the ability to swap out parts of the product for something he may like better. This is why Trac implements a wiki, a project management system, and a source control display system. It must do this even though quite good systems already address this space in the open source world. It also means that if the developers of Trac or similar systems want to add linking to a new type of artifact they must implement the artifact in their system.

A Better Approach

There is a better approach although it will take a bit of effort to see it realized. Fundamentally this better approach is to let each system handle focus on its purpose and provide some system agnostic way to tie these artifacts together. That is, that some other system should exist that understands how artifacts are related to one another. This system would then allow users to create 'links' or relationships and query those relationships at will.

We can do this by utilizing REST based services (other ways probably work as well) and using the REST semantic. By vending an artifact at a specific unchanging url we cane provide a universally unique identifier that we can use for linking. For example, lets say we were building up a Trac like system. We put our issue artifacts in an issue tracking service at a specific url, say 'http://issues.com/api/issue/'. We also put our projects at a specific url, say
'http://projects.com/api/project/'. Both of these apis vend data according to some predetermined format. With these in place it becomes very easy to link these two artifacts together. This implicitly indicates the existence of a couple of things. First and foremost that consumable formats exist for issues and project. Secondly that there exists some means of resolving these relationships. Basically, that you somewhere to put the fact that these issues are related. I think that a separate type of system should be set up that manages these relationships. For nowe I will call that a relationship management service. So with these facts established, if we wanted to associate issue 13 with project Foo we would just create a relationship between http://issues.com/api/issue/13 and http://projects.com/api/project/foo in our relationship management service. With this service in place each individual system wouldn't need to understand linking at all nor store link information. Only clients that wanted to consume the information would need to understand linking.

There are a few and advantages here and some disadvantages. The biggest advantage is that systems don't need to understand how linking occurs with other systems. We can drop any system we want into the mix and get reasonable linking semantics. The second big advantage is we can manipulate the links in one location, traversing the entire graph of links with jumping around to each service that stores information. The disadvantage is that we have yet another service that we must manage.

How To Do It

Of course, actually accomplishing the task of building these relationships is not so simple in the pragmatic world. Several prerequisites have to be met before we can get started.

Systems must vend their data in some generally, consumable, system agnostic way.

Systems must actually vend the artifacts that other systems may find interesting.

There must be a uniform, unchanging way to resolve artifacts so that they can be linked to.

Each of these prerequisites are more complex then they may first appear. For example, the first prerequisite says that systems must vend their artifacts in some system agnostic way. However, in reality you want them to vend it in as simple a way as possible. You also want them to vend their data with either some type of schema or self describing data. Finally you want all of your systems to vend data in a similar way to reduce the complexity of linking and consumption. Similar issues exist with the second prerequisite. Artifacts that should be exposed may be sub artifacts of previously exposed artifacts. For example, in a project management system, projects are probably an exposed artifact but so are milestones which are lower in the hierarchy then projects. Exposing these in a consistent consumable way will require some forethought. Finally and most importantly, artifacts must be resolvable for linking to have any meaning at all. That means that each artifact must have a consistent and unchanging URI that can be used to identify and consume that artifact. Designing such an api will take some significant forethought on the part of the system implementors.

What we are essentially talking about is evolving a set of tools from a broad array of isolated stacks to a set of components that can plug in to a generic, flexible, component architecture. 'Plug in to' in this case just means forming consumable relationships between artifacts vended by each component. This is very similar to the way the familiar World Wide Web is designed. In our case, we are linking artifacts instead of documents and storing the links outside of the artifacts, but the design philosophy is very similar.

To that end, a set of principles that inform the creation and management of components and the relationships between the artifacts that are vended by those components needs to be designed. This must not and should not be some heavy handed attempt to mandate interoperability. It should be a set of general guidelines about what types of services must must be vended by a system. The relationship management should occur within a dedicated system as well.

This approach would free product developers to concentrate on a specific feature or set of features. It would then allow individuals or entities to consume artifacts as they will and forming relationships between arbitrary artifacts. This is an important meme that I think should be propagated along with the REST approach to web services.

Forth: The Other White Meat

2008-07-23T12:40:00.000-07:00

Lisp has been constantly touted as a language that any self respecting coder must learn. The proponent's of Lisp have every right to make the claims that they do. From the prospect of stretching your mind Lisp actually is as good as they say it is. Lisp is a principle member of a small, elite group of languages, the use of which really causes a fundamental change in the way people think about programming and system design. However, it is not the only member of this group. The group includes at least one other language, Forth. Forth is just as mind bendy, introspective and universal as Lisp but it doesn't tend to get the same amount of press as its older brethren. Well, I intend to change that. At least for the small corner of the world that reads this column.

I know that as soon as I mentioned the F word (Forth that is ;) a general groan went up from those of you that have some experience with it. Don't worry, groaners, I feel your pain. However, take a couple of aspirin sit back and listen to me for a bit and you might gain a different prospective on the language. For those of you not familiar with the language, the reason for the groaning will become obvious as our story progresses.

I had given some thought to going about and gathering up all of the 'quotes' that people use to promulgate Lisp and then justifying them for forth. After thinking about it, I decided to take a slightly different tack. I went this other route mostly because Bill Clementson has already done most of the gathering work for me and also because I think that you're smart enough to draw correlations without me beating you over the head with them. So take a couple of minutes, pop over Bill's site and read the quotes. Don't worry, I'll wait.

What Is Forth

Back so soon? good. Forth is a language that takes a wildly different tack then any other language out there. It was initially developed by Charles Moore at the US National Radio Astronomy Observatory in the early 1970s. Back in the early sixties, while working for the Smithsonian Astrophysical Observatory Mr. Moore found a need for a little bit more dynamisim then he had had in the past. To that end he put together a little interpreter intended to control a punch reader. By using this interpreter he could compose different equations for tracking satellites without recompiling the entire system. This was no small feat for the time. Mr. Moore, like any good hacker, took his interpreter with him when he left that job. He carried it around for the next five or ten years constantly tweeking it. By 1968 it was finished enough to build up a little game called SpaceWar as well as pretty nifty Chess system. This version of the language was the first to be called 'Forth'.

This early evolution and constant tweaking produced a fairly interesting language. A forth program is simply a series of tokens, called words, and two stacks. The stacks represent the main data stack and the return stack. For right now we will ignore the return stack and talk about the data stack. To give you comparitive examples, lets look at an operation in Lisp and Forth. The Lisp version is perfectly recognizable to just about anyone. It just adds two numbers together.


> (+ 1 2) 
 3

In Forth, it goes as follows.


> 1 2 +
 3

A more complex example


> (/ (+ 27 18) (* 3 3))
 5

In Forth, it goes as follows.


> 27 18 + 3 3 * /
 5

If you have been exposed to some of the old TI calculators or even Postscript you may be able to tell whats going on here. Each number as it appears gets pushed onto the explicit data stack. The '+' words (and words they are) take exactly two numbers off the stack and replace them with the value resulting from there addition. So what we get on terms is.


> 1 2 + ( results in 3 on the stack)

This explicit stack is the way all data is handled in Forth, everything. You may remember that I said everything is a token. Thats absolutly correct. Forth reads each token and looks it up in a special dictionary that is used to store references to words. We create entries in that dictionary as follows.


 : square
    DUP * ;

The colon is an immediate word (read macro) that reads ahead one word in the token stream and uses that word to create a new entry in the system dictionary.

It then reads ahead until it finds a semi-colon ';' which it uses to indicate the close of the word. It then compiles the intervening tokens into references to words. It is as simple as that. 'Hold on, Hold on!' you say. 'You said everything was tokens in a forth system, aren't these special'. Nope, I didn't lead you astray. The colon and semi-colon aren't special at all and can be overwritten and changed at any point you like. The are examples of Forths powerful macro facility in which you can create syntax for a language that essentially has none. This is a powerful concept that Lisp shares. As Paul Graham says you can use this to build up your language to the system instead of the other way around. Of course, there is a downside. Any time you give the programmer extreme flexibility someone is going to abuse it. For example take a look at this example from a Forth library in the wild.


: rework-% ( add - ) { url }  base @ >r hex 
   0 url $@len 0 ?DO 
       url $@ drop I + c@ dup '% = IF 
           drop 0. url $@ I 1+ /string 
           2 min dup >r >number r> swap - >r 2drop 
       ELSE  0 >r  THEN  over url $@ drop + c!  1+ 
   r> 1+ +LOOP  url $!len 
   r> base ! ;

Granted I have taken away the context, but you begin to understand the impetus behind those groans you heard a short while ago. Forth is one of the most flexible languages available, but the very flexibility that makes it interesting also makes it dangerous. It takes diligence on the part of the programmer to write really clean and maintainable code. However, if the coder does take that care he can write imminently readable and maintainable code. Take a look at the following examples that appeared in Starting Forth by Leo Brodie. You can get a good feel for whats going on even without knowing the context.


 : WASHER  WASH SPIN RINSE SPIN ;
 : RINSE  FAUCETS OPEN  TILL-FULL  FAUCETS CLOSE ;

So doing Forth well or not is entirely in the hands of the coder, that's why the Forth experience varies so much. That's why Forth is a Language for Smart People.

Forth Basics

I have the very good luck of owning a copy of the book Thinking Forth. This is one of those books that you should add to your library even if you never intend to write a single line of Forth code. It has a huge amount of great insight onto coding and system design in general, though you probably don't want to use it as your first introduction to Forth. In any case, I am going to be borrowing some topics and examples from that book to see us on our way.


: breakfast
    hurried? if cereal else eggs then clean ;

The above example illustrates the fundamental building block of every Forth system, the word. This specific example is whats called a colon definition. Its how you define a word in Forth. Basically the colon ':' in an immediate word (think macros operating on the word stream) that takes the very next word and creates an entry in the dictionary based on that word. It then reads all the words up to the semi-colon ';' that tells it to stop. For each word it encounters it looks up the position in the dictionary and puts that location in the code stream. Of course, in this case we have the if immediate word that takes control for a little while before handing it back. So what specifically might be going on here. In this case, hurried? probably looks at some parameter in the system to decide if haste is in order and puts a boolean on the stack. 'If' is an immediate word that compiles to a jump based on the 'else' and 'then' words. You can probably figure out what it does from here.

This, admittedly simplistic, example does illustrate a few of the special characteristics of Forth. Most notably, the stack, words, and use of immediate words.

Factoring

Long before any one had ever heard of Agile Methodologies or Extreme Programing the idea of factoring was already hard at work in the Forth community. The idea of constantly looking at code, breaking up functions and generally simplifying the systems in a foundational concept in Forth. In fact it's taken to something of an extreme. Words are a huge part of Forth systems and the factoring process. For example, consider the following word which finds the sum of squares of two integers:


 : sum-of-squares   dup * swap dup * + ;

The Stack inputs to the word at run-time are two integers. The Stack output is a single integer. By the process of factoring, the example would be re-written in Forth using a new definition called 'squared' to allow sharing the common work of duplicating and multiplying a number. The first version was overly complex and illustrated the notorious line noise aspect of Forth. Fortunately, by factoring the system we can make a much more readable and understandable system.


 : squared  dup *  ;
 : sum-of-squares  squared swap squared + ;

Good Forth programmers strive to write programs containing very short (often one-line), well-named word definitions and reused factored code segments. The ability to pick just the right name for a word is a prized talent. Factoring is so important that it is common for a Forth program to have more subroutine calls than stack operations. Writing a Forth program is equivalent to extending the language to include all functions needed to implement an application. Therefore, programming in Forth may be thought of as creating a Domain Specific Language. As Lispers well know, this paradigm, when coupled with a very quick edit/compile/test cycle, seems to significantly increase productivity.

Immediate Words (The Macros of Forth)
In comparison with more main stream languages, Forth's compiler is completely backwards. Most traditional compilers are huge programs designed to translate any foreseeable, legal combination of available operators into machine language. In Forth, however, most of the work of compilation is done by a single definition, only a few lines long. As I have said before, special structures like conditionals and loops are not compiled by the compiler but by the words being compiled (IF, DO, etc.) You may recognize this sort of thing from Lisp and it's macro system. Defining new, specialized compilers is as easy as defining any other word, as you will soon see. As you know, When you've got an extensible compiler, you've got a very powerful language!

Forths I Have Known

There are a couple of new Forths out there that are breaking with the Forth tradition somewhat and innovating in this area. One that I especially like is Factor. It's a very minimal but consistent for with some modern ideas like garbage collection, CLOS like object system, strait forward higher order functions, etc. This forth kind of bridges the gap between Lisp and Forth. A good traditional forth is Gforth. Its reasonably fast, reasonably stable and implements the full ANSI spec. You really can't go wrong with gforth if you want a good forth that will stand the test of time. If you just want an easy, interesting forth that will run just about anywhere I suggest you take a look at Ficl. Its ANSI compliant Forth written with in C as a subroutine threaded interpreter. It is stupid simple and about as robust as you can get. This is a really nice forth to start you tinkering with. Now if you want a Forth that is really stretching the bounds of what is possible and desirable take a look at Chuck Moore's latest Forth, Color Forth. It's the only language that I know of where color (yes, color) actually has semantic meaning. That the custom 25 core chip that color forth is designed to run show the Mr. Moore is really pushing the boundaries. It's worth while taking a bit of time exploring ColorForth just because of it's divergence from mainstream.

So Why the Comparisons to Lisp?

I think that many more people have had some exposure to Lisp then Forth. Because, Lisp and forth share so much 'meta' philosophy, I thought it would be beneficial to draw comparisons between Lisp and Forth. You will have to be the judge of whether this was a successful approach or not.

An Introduction to Erlang

2008-06-05T12:39:00.000-07:00

As some of you may have guessed, I am a fan of Erlang. I think that it's a very interesting language with a tremendous amount of promise for the type of server side applications that I usually end up working on. I have talked a lot about various things here on Erlangish, so I thought it would finally be appropriate to spend a bit of time talking about the topic of the blog. For the most part I will be delving into the, somewhat obscure, history of Erlang. I will also spend a bit of time providing some instructions on how to get started with the language.

So What Is Erlang and OTP?

Erlang is a distributed, concurrent, soft real time functional programming language and runtime environment developed by Ericsson, the Telecoms Infrastructure supplier. It has built-in support for concurrency, distribution and fault tolerance. Since its open source release in 1999, Erlang has been adopted by many leading telecom and IT related companies. It is now successfully being used in other sectors including banking, finance, ecommerce and computer telephony.

OTP is a large collection of libraries for Erlang to do everything from compiling ASN.1 to providing an application embeddable http server. It also consists of implementations of several patterns that have proven useful in massively concurrent development over the years. Most production Erlang applications are actually Erlang/OTP applications. OTP is also open source and distributed with Erlang.

Although Erlang is a general purpose language, it has tried to fill a certain niche. This niche mostly consists of distributed, reliable, soft real-time concurrent systems. These types of applications are telecommunication systems, Servers and Database applications which require soft real-time behavior. Erlang excels at these types of systems because these are the types of systems that it was originally designed around. It contains a number of features that make it particularly useful in this arena. For example; it provides a simple and powerful model for error containment and fault tolerance; concurrency and message passing are a fundamental to the language, applications written in Erlang are often composed of hundreds or thousands of lightweight processes; Context switching between Erlang processes is typically several orders of magnitude cheaper than switching between threads in other languages; it's distribution mechanisms are transparent, programs need not be aware that they are distributed and The runtime system allows code in a running system to be updated without interrupting the program.

Given that there are things that Erlang is good at there are bound to be a few things that it's not so good at. The most common class of 'less suitable' problems is characterized by iterative performance being a prime requirement with constant-factors having a large effect on performance. Typical examples are image processing, signal processing and sorting large volumes of data.

Origins of Erlang

I am firmly convinced that Erlang's history is a key ingredient to its success. I am not aware of any other language whose early development was so straightforward and pragmatic. For most of its life Erlang was developed inside Ericsson, originally for internal use only. Later on it was made available to the external world and eventually open sourced. The timeline of its development goes something like this.

1982–1985 Ericsson identified some issues that existed with telecom languages at the time. To address these difficulties they started experiments with the programming of telecom applications using more then twenty different languages. These early experimenters came up with a few features that a useful system needed to supply. They realized that the target language must be a very high level symbolic language in order to achieve productivity gains. This new requirement vastly subseted the language space and resulted in a very short list of languages. The languages included Lisp, Prolog, Parlog, etc.

1985–1986 Further experiments within this subseted list where performed. New results were generated from this round of experiments as well. They found that the theoretically ideal language also needed to contain primitives for concurrency and error recovery, and the execution model needed to exclude back-tracking. This ruled out two more of the contending languages, Lisp and Prolog. This ideal language also needed to have a granularity of concurrency such that there would be a one to one relationship between one asynchronous telephony process and one process in the language. This ruled out Parlog. At the end of this round of experiments they where left with out any languages in the list they had started with. Being the pragmatic folks that they were, they decided to develop their own language with the desirable features of Lisp, Prolog and Parlog, but with superior concurrency and error recovery built into the language.

1987 The first experiments began with this nascent language which became Erlang.
1988 The ACS/Dunder project was started at Ericsson. This was a prototype implementation of PABX by developers external to the core Erlang developers.
1989 The ACS/Dunder project became a full fledged project with the reconstruction of about a tenth of the complete, production PABX called the MD-110 system. The preliminary results where very promising. In this early phase developer productivity was already ten times greater then during the development of the original system in the PLEX language. This reimplementation also pushed forward experiments directed at increasing the speed of Erlang.
1991 At this point the experiments directed at speeding up Erlang bore fruit. A fast implementation of Erlang was released to growing user community. Erlang was also presented at an international telecom conference that year.
1992 During this year the user base started growing significantly. Erlang was ported to VxWorks, PC, Macintosh and other platforms. Three new, complete applications were written in Erlang where presented at another conference. The first two production projects using Erlang are started.
1993 Distribution primitives where added to the language, which made it possible to run a homogeneous Erlang system on heterogeneous hardware. A corporate decision was made within Ericsson to begin selling and supporting Erlang externally. A new organizational structure was built up to maintain and support Erlang implementations and Erlang Tools. This resulted in the creation of Erlang Systems AB.
1996 OTP was formed into a separate product group within Erlang Systems AB. This represented the maturing of the OTP platform within Erlang. After nearly ten years of development the (non-Erlang) AXE/N project was closed down and pronounced a failure. This left a large whole in Eriksson's product line and development of a new replacement product was started to fill it, it was decided that this replacement product would be written in Erlang.
1998 After two years the AXE/N replacement project, now called the AXD301 was delivered. Around this same time the CEO of Ericsson Radio became the CEO of Ericsson as a whole. This person had banned Erlang at Ericsson Radio and though he never banned Erlang at Ericsson proper it became career suicide to propose Erlang new Erlang projects. This problem effectively killed opportunities for Erlang Systems AB to sell the language and support. The primary question potential customers asked was 'Who wants to use a language developed by Ericsson when Ericsson won't use the language itself?'. This just goes to show that corporate bureaucracy will be corporate bureaucracy autocracy no matter where you are. In any case, this turned into a bit of a blessing in disguise for the rest of us.
1998-99 In order to drive further development a decision was made within Erlang Systems AB to release Erlang as an Open Source project. This didn't imply an abandonment of Erlang by Ericsson or Erlang Systems AB. Erlang continued and continues to be supported by these two organizations. This decision was made wholly with the idea of spreading Erlang and removing the, somewhat negative, idea that it was a proprietary language. It has remained open source and supported to this day.

As you can see the Erlang didn't start out as Erlang at all. It started out as just a series of requirements backed up by experiments. A large number of experiments where done to find the language that matched those requirements. When no existing language was found Ericsson decided to create their own. Considering Ericsson's resources and the costs associated with development of their products

I think this was a very pragmatic decision. However, that conclusion is open to interpretation. In any case, after the initial development there was a constant back and forth dialog between the users and developers of the language as the language moved through its formative process. I think this fact alone is one of the reasons that Erlang is as good as it is today. Later on in its development as Ericsson grew less resourceful Erlang started to have political problems within the company. Even though Ericsson had several successful and profitable products in Erlang and other languages the de-facto ban occurred. Fortunately by this time Erlang could and did stand on its own. The ban actually turned out to be fortunate for the rest of us because it led, pretty directly, to Erlang's eventual Open Sourcing.

Joe Armstrong, one of the original Erlang Developers and a productive member of the community put together a number of tutorials that are very useful. Its worth going through these and playing with the code.

There are a couple of good editors to use with Erlang. The gold standard is the Erlang Emacs mode distributed as part of the Erlang distro. A very updated version is now available from the Erlware folks. You can get it here If you go this route I suggest you also get Distel, an add-on written by Luke Gorrie. It's available from the the good folks at google code. There are instructions included with both of these to get you up and going. For those of you more inclined to the IDE world you may want to take a look at Erlide. This is a set of Eclipse plugins that add support for Erlang to Eclipse. Its still pretty beta, but it's very usable.

So How Do I Get Started?

Learning Erlang is a fairly quick process. For an experienced developer it shouldn't take more then a few days before they can write nontrivial programs, about a week or two to feel really comfortable and a month or so before feeling ready to take on something big by themselves. It helps a lot to have someone who knows how to use Erlang around for some hand-holding.

Start off by going through the quick start part of the FAQ. Then go through the Erlang Course. You can skip the history part if you would like. I have gone over it in more detail here. Once you have done the course play around with some of the examples. Then go read the long version of the getting started docs. This should put you on the road to being able to write some Erlang code. If you are one to worry about coding conventions then you may want to take a look at http://www.erlang.se/doc/programming_rules.shtml. This has quite a number of useful and well thought out programming rules. One of the things that makes Erlang really interesting is the OTP System. If you really want to get to know something about Erlang then it make sense to spend a bit of time learning OTP and http://www.erlang.org/doc/design_principles/part_frame.html is a very good place to start.

An Overview of Concurrency

2008-05-21T12:40:00.002-07:00

Introduction

Over the last few weeks I have had several conversations with people about concurrency, more specifically the ways in which shared information is handled in concurrent languages. I have gotten the impression that there isn't really a good understanding of whats out there in the world of concurrency. That being the case it would be a good idea to just give a quick overview of some of the mechanisms that are gaining mind share in the world of concurrency.

Aside from some engineers that are currently so deep in their rut that they can't see sunlight its been accepted that the current mainstream approach to concurrency just wont work. The idea of using mutexes and locks to try to take advantage of the up and coming massively multi-core chips is really just laughable. We can't ignore this topic. As software engineers we don't really have a choice about supporting large numbers of CPUs, thats the direction that hardware is going its up to us to figure out how to make it work in software. Fortunately a bunch of really smart folks have been thinking about this problem for a really long time. A few of the things they have been working on are slowly making their way into the gestalt of the software world.

We are going to talk about three things. They are Software Transactional Memory (STM), Dataflow Programing specifically Futures and Promises and Message Passing Concurrency. There are others, but these currently have the most mind share and the best implementations. I have limited space and limited time so I am going to stick to these three topics. You may have noticed that up to this point I have only talked about concurrency. More specifically the communication between processes in concurrent languages. Thats intentional, I am not going to talk about parallelism at all. Thats a different subject and only faintly related to concurrency, but often conflated with it. So if your looking for that you are looking in the wrong place. On another note, I am going to use processes and threads pretty much interchangeably. I know that in certain languages the two terms have very different meanings, however, in many other languages they mean the same or one of the terms doesn't apply. What I am getting at is that if you open the scope of the discussion to the languages at large the meanings become ambiguous. When I use the term process or thread I am talking about a single concurrent activity that may or may not communicate with other concurrent activities in same way. Thats about the best I can do for you.

Traditional Shared Memory Communication

Shared Memory Communication is the GOTO of our time. Like GOTO of years past its the current mainstream concurrent communication technique and has been for a long, long time. Just like GOTO, there are so many gotchas and ways to shoot yourself in the head that its scary. It so bad that this approach to concurrency has tainted an entire generation of engineers with an deeply abiding fear of concurrency in general. This great fear has crippled the ability of our industry to adapt to the new reality of multi-core systems. The sooner shared memory dies the horrible death it deserves then the better for us all. Having said that, I must now admit that, just like GOTOs, shared memory has a small niche where it probably can't be replaced. If you work in that niche then you already know you need shared memory and if you don't you don't. Just a hint, implementing business logic in services is not that niche. OK, I am all done with my rant and I feel much better, now on to the show.

Shared memory typically refers to a large block of random access memory that can be accessed by several different concurrently running processes. This block of memory is protected by some type of guard that makes sure that the block of memory isn't being accessed by more then one process at any particular time. These guards usually take the form of Locks, Mutexes, Semaphores, etc. There are a bunch of problems with this shared memory approach. There is complexity in managing the serial access to the block of memory. There is complexity managing lock contention for a heavily used resources. There is a very real possibility of creating deadlocks in your code in a way that isn't easily visible to you as a developer. There is just all kinds of nastiness here. This type of concurrency is found in all the current mainstream programming and scripting languages, C, C++, Java, Perl, Python, etc. For whatever reason its ubiquitous and we have all been exposed to it that doesn't mean we have to accept it as the status quo.

Software Transactional Memory (STM)

The first non-traditional concurrency mechanism we are going to look at is Software Transactional Memory, or STM for short. The most popular embodiment of STM is currently available in the GHC implementation of Haskell. As for the description I will let wikipedia handle it.

Software transactional memory (STM) is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. It functions as an alternative to lock-based synchronization, and is typically implemented in a lock-free way. A transaction in this context is a piece of code that executes a series of reads and writes to shared memory. These reads and writes logically occur at a single instant in time; intermediate states are not visible to other (successful) transactions.

STM has a few benefits that aren't immediately obvious. First and foremost STM is optimistic. Every thread does what it needs to do without knowing or caring if another thread is working with the same data. At the end of a manipulation if everything is good and nothing has changed then the changes are committed. If problems or conflicts occur the change can be rolled back and retried. The cool part of this is that there isn't any waiting for resources. Threads can write to different parts of a structure without sharing any lock. The bad part about this is that you have to retry failed commits. There is also some, not insignificant, overhead involved with the transaction subsystem itself that causes a performance hit. Additionally in certain situations there may be a memory hit, ie if n processes are modifying the same amount of memory you would need O(n) of memory to support the transaction. This is a million times better then the mainstream shared memory approach and if its the only alternative available to you you should definitely use it. I still consider it shared memory at its core. Thats an argument that I have had many, many times.

Dataflow - Futures and Promises

Another approach to concurrency is the use of Futures and Promises. Its most visible implementation is in Mozart-Oz. Once again I will let wikipedia the description for me.

In computer science, futures and promises are closely related constructs used for synchronization in some concurrent programming languages. They both refer to an object that acts as a proxy for a result that is initially not known, usually because the computation of its value has not yet completed.

Lets lay down the difference between Futures and Promises before we get started. A future is a contract that a specific thread will, at some point in the future, provide a value to fill that contract. A promise is, well a promise, that at some point some thread will provide the promised value. This is very much a dataflow style of programming and is mostly found in those languages that support that style, like Mozart-Oz and Alice ML.

Futures and Promises are conceptually pretty simple. They make passing around data in concurrent systems pretty intuitive. They also serve as a good foundation on which to build up more complex structures like channels. Those languages that support Futures and Promises usually support advanced ideas like unification and in that context Futures and Promises work really well. However, although Futures and Promises remove the most egregious possibilities for dead-locking it is still possible in some cases.

In the end both of these approaches involve shared memory. They both do a reasonably good job at mitigating the insidious problems of using shared memory, but they just mitigate those problems, they don't eliminate them. The next mechanism takes a completely different approach to the problem. For that reason it does manage to eliminate most of the problems involved with shared memory concurrency. Of course, there is always a trade off and in this case the trade off is in additional memory usage and copying costs. I am getting ahead of myself let me begin at the beginning and then proceed to the end in a linear fashion.

Message Passing Concurrency

The third form of concurrency is built around message passing. Once again I will let wikipedia describe the system as it tends to be better at it then I am.

Concurrent components communicate by exchanging messages (exemplified by Erlang and Occam). The exchange of messages may be carried out asynchronously (sometimes referred to as "send and pray", although it is standard practice to resend messages that are not acknowledged as received), or may use a rendezvous style in which the sender blocks until the message is received. Message-passing concurrency tends to be far easier to reason about than shared-memory concurrency, and is typically considered a more robust, although slower, form of concurrent programming. A wide variety of mathematical theories for understanding and analyzing message-passing systems are available, including the Actor model, and various process calculi.

Message passing concurrency is about processes communicating by sending messages to one another. Semantically these messages are completely separate entities unrelated to whatever data they where built from. This means that when you are writing code that uses this form of concurrency you don't need to worry about shared state at all, you just need to worry about how the messages will flow through your system. Of course, you don't get this for free. In many cases, message passing concurrency is built by doing a deep copy of the message before sending and then sending the copy instead of the actual message. The problem here is that that copy can be quite expensive for sufficiently large structures. This additional memory usage may have negative implications for you system if you are in any way memory constrained or are sending a lot of large messages. In practice, this means that you must be aware of and manage the size and complexity of the messages that you are sending and receiving. Much like Futures and Promises the most egregious 'shoot yourself in the head' possibilities of deadlocking are removed its still possible to do. You must be aware of that in your design and implementation.

Conclusion

In the end any one of these approaches is so much better then the shared memory approach that it almost doesn't matter which one you choose for your next project. However, they each have very different philosophical approaches to concurrency that greatly affect how you go about designing systems. You should explore each one so that you are able to make a logical decision about which one to use for your next project. That said, opinions are like umm, I can't really complete that, but you get my drift. My opinion on the subject is the message passing concurrency is by far the best of the three, where best is defined by most conceptually simple and scalable. In the end the industry will decide which direction is right for that and head in that direction. We are still to early in the multi-core age to get any good impression of which will win out.

Event Based Stream Parsing for JavaScript Object Notation (JSON)

2008-03-21T23:39:00.000-07:00

Overview

I have been playing around with the idea of a stream based, sax like, parsing API for JSON. In my mind it has a few very direct benefits. I suspect that it would simplify implementing a parser for an already simple syntax. It would also allow for parsing arbitrarily large documents. In my case I need to return information to the user as quickly as possible, I absolutely cannot wait for the entire document to be parsed. This would solve that problem. All that said I seriously doubt that I am the first one to think of this, though I can't find any references to anything similar out there. If any of you have any pointers let me know in the comments and I will reference them. I plan to implement this in erlang and at least one other language. I will post links as soon as that is done. Without further ado.

Abstract

The JavaScript Object Notation (JSON) is defined in RFC 4627. This document describes method of parsing JSON using an event driven api. It is expect that details of implementation will change from language to language. However, this document should present a uniform approach to stream parsing of JSON

Introduction

JavaScript Object Notation (JSON) is describe as a text format for serialization of structured data. More commonly it is used as a transport protocol for various network based applications. This document describes an event based parsing mechanism for JSON data that can be implemented in a simple and strait forward manner. Reading and understanding RFC 4627 is a prerequisite to reading and understanding this document.

Description

The stream oriented parser produces a series of events that describe the structure currently being parsed. These events are then consumed by the calling application in some manner. The actual mechanism of consumption will vary from language type to language type though a few different types of APIs are discussed in the appendix of this document.

JSON is composed of two types of data structures primitive and complex types. The events for the two types of structures remain the same, changing only in the description of the structure being described.

The events around primitive types are relatively simple and should be implemented as simply as possible. Individual API definers will choose how they want to implement it using the examples in the appendix as a guide.

Primitive types in JSON are strings, numbers, booleans and null. These are described within the event itself


      string = STRING_DATA(value)
      number = NUMBER_DATA(value)
      boolean = BOOLEAN_DATA(value)
      null = NULL_DATA(value)

Complex types in JSON are objects and arrays. These are represented by a series of events that describe the object. Because they are complex types their representation is much more complex then that of primitive types. However, it allows the object to be consumed as it occurred in the stream.


       object =  OBJECT_BEGIN
                     KEY(string_value)
                     VALUE_BEGIN
                      ... # recursive type description
                     VALUE_END
                      ... # arbitrary number of additional key/value pairs
                  OBJECT_END

       array =  ARRAY_BEGIN
                     VALUE_BEGIN
                       ... # recursive type description
                     VALUE_END
                ARRAY_END

Callback API Description for Erlang

This would be implemented as a behavior that defines these callbacks. Client code that wishes to receive these callbacks would implement these methods. This should allow a high degree of flexibility for the client.


        %% should guard data will be one of the primitive types
        %% null will be represented by the atom null
        data(Value, State) -> State2

        object_begin(State) -> State2

        object_end(State) -> State2

        key(Value, State) -> State2

        value_begin(State) ->  State2

        value_end(State) -> State2

        array_begin(State) -> State2

        array_end(State) -> State2

Erlware Progress Update

2007-12-19T16:09:00.001-08:00

I have had very little time to post of late. The
other maintainers and I have been coding feverishly to get
the erlware system to version 1.0 by the end of January. This has
taken up a huge amount of my time, leaving me very little to
time to blog. It been so long now that I feel bad. So
I decided that I needed to spend a bit of time talking how
development in our new model is going.

All I can say is wow! This Open Development Model has
proven to be a huge boon to the project. Not only has it
encouraged user participation by a large margin, it has also
encouraged quite a bit more design process. Rather
spontaneously the maintainers have started to submit design
specs when they are preparing to do major changes. These design
specs are usually one pagers that give a high level overview of
the problem and describe the nature of the fix the developer is
implementing. It lets the community and the other maintainers
know and comment on whats upcoming. It also makes the developer
think more closely about the changes he is planning to make to
the system. All in all it doesn't take a large amount of time
and its helps increase the quality of our offering. Since each
design spec goes to the mailing list it fits right in with the
low level patch model. I like it.

Another major development is that we have moved our website to
the Open Development Model. Originally, we used dokuwiki for
our website. Dokuwiki is an great product, but its a wiki with
the attendant problems that a wiki has. In our case, we had kept
wiki access pretty limited. Only the core maintainers actually
had access to it. This made it hard for the community to fix
issues and expand docs. What we really wanted was to combine the
easy editing of the wiki with our, somewhat more careful, patch
based approach to changes. We spent some time thinking about the
problem and decided that it would be best if our site was
updatable, via git, just like the rest of our projects. However,
we really didn't want to hand edit html and do all the manual
work involved. We liked the readable usable wiki syntax. We
needed some mix of wiki with static source files. We went to
google with low expectations. We where pleasently surprised by
the number of good offerings in various languages. Eventually we
settled on webgen. This is a nice
little ruby framework for static site generation. It supports
various markups, including our choice Markdown. What
we ended up with is an open site with very nice wikish syntax
that is easy to extend and change via a model that our
contributors are familiar with.

Once a change is in our canonical git repo getting the changes
onto the sight is completly automated. A cronjob on our server
pulls the site from the repository and runs the generator. Then
runs rsync to push the changes into the area that they are
served from. This is a very recent change so we don't actually
know if it will accomplish our purpose of more contributions to
the site. One unexpected side benefit is that our site is now
really fast. All the work is happening at generation time so
the server just needs to serve static html files.

Erlware and the Git Development Model

2007-10-25T23:43:00.000-07:00

Introduction

Recently the core commiters for Erlware, mostly Martin and I, have decided to migrate Erlware's source control system from the previous Google hosted subversion repository to self-hosted git. Linus' Tech Talk on Git convinced us that this was a good idea. This isn't just a source control change, but a methodology change. It changes the way code gets from it's source (the Erlware community) into Erlware itself. The main things we really want to get out of this migration is

Greater visibility into the code base for participants
Increased code quality
Faster, cleaner incremental development

It's yet to be seen whether we will achieve our third goal, but we have already achieved the first two, which is encouraging.

We have had a few false starts. That is just to be expected. When we first started, we took the easy default and just replaced the central subversion repository with a git repository. We then treated that git repository as if it were just another subversion repo. That makes for a nasty commit history and it doesn't leverage git's new development model very well at all. So I started searching for information about how to do a large project in git in a truly distributed fashion. Let me tell you, there just isn't that much information out there about organising a project around git. Its just not there! Some people are working on that, I hear. In any case, I finally pulled aside a friend (hey Scott!), who was familiar with git, and asked them how this whole distributed development thing is supposed to work. Much to my amazement he actually knew! He could even explain it to a dim bulb like me!

What follows describes how we are implementing git in Erlware. It is based on what Scott told me, lurking on the git mailing list and widely varied sources out there on the interweb.

The Development Model

The model we are using is actually dirt-simple. Each person has one or more personal git repositories. This is where they do their hacking and keep track of their commit history. There is one canonical repository. However, this canonical repository never has commits pushed to it directly. To get some change into the canonical repository (and into the project) you have to send a patch representing your change to the erlware-dev@googlegroups.com mailing list. This means that every single change is pushed out and viewed by everyone on the mailing list. If we foster the community well, people will give good feedback and code reviews. Hopefully, we can build a community that is as engaged as the community that surrounds git itself.

In any case, once the patch arrives on the list, gets commented upon, changed as required, etc, it will be applied to the canonical repository and become part of Erlware. Think about this for a second, every change to Erlware goes through the Erlware dev list as a patch. Every member of the community has a chance to comment, critique and discuss it. The direction of the project is obvious to anyone who has access to the mailing list, which is anyone that wants access. It makes it extraordinarily easy for anyone to contribute to the project without ever having to code. They can just subscribe to the list and provide their knowledge and insight on the code passing through to the implementers of the patch. Of course, the commiters have the final say into what actually makes it into the canonical repository. However, the community has a huge amount of leeway in making sure that the code is correct and of the highest quality.

You may think that having to submit a patch would surely slow down development on the project. However, you would be wrong. Each developer has his own personal repository with which he can do anything he wants. We encourage developers, and require commiters, to make their repository publicly available. So that developers and anyone working with that developer can have easy access to each other's code. Their development velocity can be anything they are comfortable with. When they finally have something that they think is ready for commit to the canonical repository, they can create patches and submit them. Of course, they will need to spend some time actually refactoring their changes into a nice set of small interrelated patches. This, though, is time well spent and will give them one final chance to refactor their code.

Overall, this should be a huge boost to our productivity as a project and our transparency as project leaders.

The Nuts and Bolts

Actually getting this entire thing set up isn't a trivial project. You need to have some public place to put your git repo, http://git.erlware.org/git or git://git.erlware.org/git in our case). You need to set up that repo with, at the very least, an http server fronting it. You should probably also set up the git-daemon. It makes cloning a git repository very, very fast. Much faster then cloning a repository over http. git-daemon doesn't really have any idea of permissions or users though. So you should set this up for read access only. You do that by making sure that the user git-daemon is running as doesn't have authority to write to any of the files in the repo.

Setting Up a Git Repo

These instructions work whether you are setting up your own repository or the canonical repository.

1) Create a directory for your repos. You can just use mkdir for this, though if you are going to have multiple submitters you probably want to set the sticky group bit on the directory.


$> mkdir repo_dir
$> chown me:group_that_every_commiter_is_in repo_dir
$> chmod g+s repo_dir
$> cd repo_dir
$> mkdir project_git_dir
$> cd project_git_dir
$> git --bare init

This is going to act as a public server, so we want to enable a bit of index generation. To do that we simply make the post_update hook executable. Git will do the right thing with that.


$> chmod a+x hooks/post_update

If you look in post update you will see the command 'git-update-server-info'. This command allows git to update the indices that it needs when cloning over a 'dumb' protocol like http.

Point your http server at the repo and you are done! If you want to do some fancy stuff like send an email on every commit then you need to get then copy the post_receive_email script from the contrib/hooks directory of the git source tar ball to hooks/post_receive in your git repo. Then make that file executable. You need to fiddle with the config a bit, but the instructions for how to do that are in the file itself so there isn't any need to repeat it here.


$> cp /contrib/hooks/post_receive_email ./hooks/post-receive
$> chmod a+x ./hooks/post-receive

Working as a Member of The Project

Working with a git project means that there are three commands that you are going to be using quite a lot. These are git-format-patch, git-send-email, and git-am. Getting good with these commands will let you interact well with Erlware and any other git based project.

git-format-patch takes some command line options to detail which commits you want to turn into patches and then writes the appropriate patches out a set of files, one patch per commit. Its dirt simple, though you will need to learn how to structure your commits in a reasonable way. This isn't hard, but it is a bit involved so I will save that process for another blog. In any case, you will learn how to do it pretty quick once you start supplying patches.

git-send-email lets you take the patches created by git-format-patch and send them to a specified email address. The process is very well documented in the man page linked above so there really isn't any need to go into much detail. One note though, if you are a gmail user and don't already have sendmail or procmail setup to use gmail, I suggest you use msmtp. There are detail instructions on how to do it on the GitTips page of the GitWiki. If you are as blunt as I am, following the instructions there will save you a huge amount of time.

Finally, git-am allows you to take patches from the mailing list and apply them to your local git repository. It understands both mbox format and raw patches generated by git-format-patch. This makes it pretty damn useful. You can set up a pull from your mail server to a local mbox and then just run git-am on it. You can also just pull the patches manually and run git-am on those. The whole process is really well thought out and very, very simple.

The only real problem that I have had so far is that git-send-email doesn't let you prepend any information to the subject line of the sent email. They all end up with [PATCH] . This would be fine for most projects but we run several projects under Erlware and it would be really nice if we could do something so that the subject looked something like [PROJECT][PATCH] . Anything that would let us indicate the project would be great. When and if I get some time I intend to remedy this and send a patch back to the git community.

Conclusion

That's it. No doubt I have missed a huge number of things. Hopefully, the commenters will be nitpicky and point them out so I can fix the problems. I am really excited about this new model and expect it to do some really awesome things for our project. If it doesn't do anything but spread knowledge about the codebase to all our commiters I will be overjoyed.

New faxien

2007-10-06T18:37:00.000-07:00

So its been a little bit since I posted. We have been working hard and heavy on several changes to the erlware repo formats. One of the big time sinks in this project has been writing a bootstrapper/otp release launcher for faxien. Well let me break out the two before I go into details.

The Repository
A few months ago we made some changes to the repository to reduce the complexity and reduce the total number of release versions we had to keep around. Originally we organized the OTP apps in our repository around the erts version number, major, minor and patch. This worked but it meant we had a lot of copies of the same apps laying around. We had hoped that we would be able to organize it around the erts major, minor version instead. This would have saved a huge amount of space. so we made some changes in support of that. Well after we made those changes we found out that the patch version number wasn't as unimportant as we thought. In fact, among other things it seems that the OTP guys feel free to modify the wire protocol in patch versions. They also change the magic version numbers in the c lib so that they wont communicate with with an erts with a patch version different then what they where compiled for. What this means for us is that we had to go back to supporting major, minor and patch versions. We also changed the way release packages are stored in the repo. Not much, but enough to require some code changes. All this took more time then we would have liked.

The Faxien Bootstrapper
The other problem we had is around matching erts versions with releases that faxien pulls down. In the past we used whatever version or erts/erlang was available on the local box. This caused problems if faxien was built for a version of erts that was greater then the one present on the box. Of course, this would be a problem for any and every bit of erlang code that faxien pulls down as well. So we had to come up with a way to pull down an erts to run on as well as the code to run. This meant that we had to come up with some way to bootstrap the system. After much thought and quite a few experiments we decided to write a minimal bootstrapper in ocaml. What this means is that folks can download a small binary that will pull down the required erts version, faxien and all its dependencies. The bootstrapper will then launch faxien to complete the install process.

With this approach the user doesn't need to have erlang on their system at all. They just pull down 'faxien' and it pulls down everything thats required. Thats pretty cool. It also only pulls whats needed so instead of going out and getting 20 megs of erlang distribution they will get what they need and just what they need.

We put in a few other nifty features to make command line OTP releases easier and to make cross platform OTP launch scripts easer to write. However, I will talk about those in a different post.

Sinan 0.8.4 Alpha is Out

2007-07-10T15:41:00.000-07:00

The latest version of Sinan is out. This version adds support for spawning a new Erlang node with shell for the current project. This makes debugging and exploratory programming much, much easier. It also spawns the node with an sname 'sinan_shell' so that the new shell will interact well with distel. This is a change that a lot of people have asked for and I am pleased to be able to integrate it.

I am still having issues with the analyze task that I am working through. However, I hope to have it fixed and working shortly.

Distributed Bug Tracking - Again

2007-06-25T14:26:00.000-07:00

This is a bit of a clarification and expansion on a previous topic of this blog. To recap what I was talking about was a distributed issue tracking system making use of an underlying distributed version control system for its versioning, but augmented by command line tools that support necessary issue tracking features; like searching, merging, etc. Distributed issue tracking is a very new thing and has a few hurtles to overcome. I am going to talk about what these hurtles are and offer some ideas on how to overcome them.

In the comments of the last blog post on this topic Alex and I had a reasonably long conversation around merging. He ended up posting his thoughts here. Alex has some interesting and useful ideas, though we differ in some specifics.

Merging and Discovery

In the last article I used the term 'merging' in a ambiguous way. I used it to refer to both merging two issues into one another and finding the duplicate issues. For the rest of this article I am going to refer to merging as merging two issues and discovery as finding issue duplications. This should reduce the ambiguity a bit.

Merging Multiple Changes To The Same Issue

Merging is actually a pretty strait forward concept. I think you can treat issues the same way you treat a source file when a merge conflict occurs. By automatically merging what you can and allowing the user to resolve conflicts manually you get reasonable merge behavior with a high probability of a correct result. There is some overhead for the user but, as with source changes, it shouldn't be onerous.

Merging Two Issues Into A Single Issue

This problem is slightly more complex but its really just an extension of the last topic we talked about. In this case we just apply that merge algorithm to two disparate issues instead of two versions of a single issue. There may be some ambiguity around which issue becomes the canonical issue and how to merge history for these two files, however, these issues are mostly solved in distributed version control systems and those solutions would work just fine in this instance as well.

Discovery

Discovery is by far the most complex issue here and its a problem that occurs in any issue tracking system. Unfortunately, in a distributed issue tracking system the problem has the potential to be much much worse then in an issue tracking system with a central repository. This is due to the fact that each and every user has his own canonical version of the issue repository. For example: User Y sees a bug in the system and enters Issue X to describe it and User Z sees the exact same bug at a similar time and enters Issue W to describe it. Because User X and Z both have canonical versions of the issue repository and they have yet to sync their repositories there is no way for either user to detect that a issue has already been created for that bug. So when they replicate suddenly there are multiple issues in both repositories.

In more normal issue tracking systems this can be mitigated to some extent by encouraging your users to search for existing bugs first and having people familiar with the issue repository reviewing new issues as they are entered. However, this approach wont work with a distributed issue tracking system because each user has a private canonical set until he syncs with some other user. I believe that this problem will be one of the fundamental problems that will plague new distributed issue tracking system for some time.

There are ways to mitigate this. There are very good document similarity algorithms out there and applying them to this problem wouldn't be too difficult. Unfortunately, the text associated with issues tends to be very short and this doesn't give these similarity algorithms much room to work. There are ways we can mitigate these problems though.

First we can reduce the total document corpus by using attributes of the issue to subset the issues for similarity searching. For example, we might only search for similarities within issues that have a specific component tag. Actually, generalizing this statement we can just use emantic properties of the issue to subset the issues that we need to process for the similarity search.

Second we need to give the user a fast and easy way to run through the output of a similarity search and approve/disapprove the merge. This should be something that allows the user to view the issues side by side and hit a single button or key combination to approve or disprove the merge, then the next set pops up. This cycle would allow the user to quickly move through all possible matches. If this worked well we may be able to loosen the similarity constraints a bit to allow for more matches.

Hopefully, a combination of approaches will make the discovery issue more tractable.

Referencing

The third and last major problem (there are undoubtedly others that I can't think of right now) is simple referencing. There needs to be a way to reference an issue regardless of the repository its on or where it was created. The easiest way to do that would be to make use of simple UUIDs for issue identifiers. They are a bit unwieldy but their inherent uniqueness makes them usable for our purposes. We can reduce the level of pain in using UUIDs manually by allowing the user to specify the unique part of a UUID in the tools that support this distributed issue tracking system. I think monotone and, maybe git, allow something similar for change set identifiers.

The Shape Of Your Mind

2007-05-29T17:52:00.000-07:00

In several cultures through out history, like the Karen Paduang of Southeast Asia, the ancient Han of China and the ancient native tribes of the Paracas region of Peru, various body modification practices where quite common. In fact, among the members of these cultures unusual body shapes were (and in some cases still are) considered very beautiful. Parents went to great lengths to achieve these shapes using mechanical devices to mold an infant's growth into a particular shape. For example, among the Karen Paduang this procedure resulted in women with very, very long necks. In the case of the Han it was women with very tiny, actually unusable feet. For the natives of the Paracas region of Peru it was large cone shaped skulls. Although grotesque by Western standards, the individual subjected to these procedures was considered significantly more beautiful then a person shaped along more natural lines. However, if you removed these people from their culture and time to cast them into many modern societies they would be considered very, very strange at the least and grotesque at worst.

Fascinating, you say, but what does this have to do with computer science and more specifically languages? In many cases, the communities that form around languages are very similar to the insular communities that created these practices. To take this analogy one step further, in many ways programming languages act quite a lot like the devices used to shape the skulls of infants in Paracas, the feet of Han women or the necks of Karen Paduang women. In the case of these languages, instead of shaping the skull they tend to shape the way we think about problems, the way we form ideas and the way those ideas are applied to a particular problem. For example, if you have only every written code in early versions of Fortran (Fortran 77 and before) then you probably don't even know recursion exists. Also, if you have only ever coded in Haskell you probably would know very little about imperative style loops.

If I might take a bit my own history as an example, very early in my forays into coding I came across a problem requiring a search through a directory structure. I quickly came up with an imperative-loop-based solution that did the job. Unfortunately, the code was ugly and (if I may borrow a term from the refactoring community) it didn't quite smell right. I didn't know what the right solution was but I knew I didn't have it. At that time, the public internet was a new fangled thing but I had already found it to be useful for gathering information. So I searched for solutions that others had found to similar problems. Eventually, I came across a piece of code that used recursion to walk the tree instead of imperative looping. It took me a little while to get my mind around this new concept, having never been exposed to non-imperative code. However, once I did, I realized that the recursive solution was a much cleaner and more natural solution for this problem. Recursion isn't always a more natural solution then imperative iteration, but in this case it was. That made me think about this type of problem in a completely different light. It added another tool to my toolbox. If I had been exposed to functional languages before this point, it would have saved me a great deal of time and trouble.

This is a minor example, but it does help prove a point: what a language allows and does not allow affects how you think about problems. This is an extremely important realization because it means that, with certain caveats, the more programming languages you know the more insight you may have into the solution to a particular coding problem. So, if you accept the fact that languages tend to mold your way of thinking about problems then you can easily see how languages can be compared to these body modification devices we spoke of earlier. If Programing Languages can be compared to these devices then we, the users of programming languages, can be compared to the subjects that undergo modification.

Granted, the comparison is not exact. As engineers, we don't start out with a nice round head. We have to work to achieve it using the same tools that provided the initial distortion. To elaborate, we all start out learning a single language and that language affects the shapes our 'mind' so to speak. This initial language pushes our mind out in one direction, perhaps upward. So after we learn a single language most of us are walking around with a big cone shaped mind.

Unfortunately, many of us never go on to learn any other languages. We keep our big cone shaped mind for the entirety of our career. That may not be a bad thing. If you are solidly embedded in a particular 'language culture' then cone-shaped minds are probably considered quite beautiful. In fact, you may be considered some type of elder because of the cone-iness of your mind. We tend to call these people Gurus and they are deserving of some respect.

These cone-shaped minds are probably not considered all that beautiful out side of their specific 'language culture'. C gurus aren't going to be very useful in Scheme community and Scheme gurus aren't going to be very useful to the C community. That's bad because both languages have ideas and features that are generally useful to understand. Those Cone minds that keep to a single language throughout their entire career are never realize their full potential. That's unfortunate because a programmer with a well-shaped mind is generally more efficient and better able to find the most elegant solution to a problem. In fact, he ceases to be a programmer and becomes an Engineer. If he is diligent and studies hard he may even become a Good Engineer.

Now, about this time you are probably thinking to yourself, 'I don't really like the idea of walking around with a big cone-shaped mind.' If thats the case, great! Fortunately, unlike the natives of Paracas, you can do something about they way your algorithmic 'mind' is shaped. How do you go about reshaping your mind? Well, it's not a simple process, you basically need to force your mind into a new shape using the devices that warped your mind in the first place. You must learn more and distinctly different languages. Each language forces your mind to grow in a different direction. Learn enough languages and your mind will have a nice round shape.

So how many languages do you have to learn and what languages are the best? There is no fixed number. I usually suggest five as a minimum number and recommend ten or fifteen. That may sound like a lot but after the first couple languages picking up new ones starts becoming much easier. In that regard its a little like picking up a new spoken language. For example, if you know Spanish then Portuguese isn't all that hard. If you know Spanish and Portuguese, then Italian is pretty simple. If you know Spanish, Portuguese and Italian, then picking up French is a snap. This goes on and on. In the case of programing languages there are a small number of additional rules you need to apply to get the most out of this process, but the overall process is the same. The additional rules are listed below.

1. Each language must come from a different family oflanguages. See this history of languages information.

2. All three major paradigms (Procedural, Object Oriented, and Functional) must be covered.

3. At least two minor paradigms (Concurrent and Logic/Declarative) must be covered.

4. Both static typing and dynamic typing need to be represented.

If you have never heard of functional programing and don't have a clue what procedural means don't worry. I will help you out a bit by providing a list of languages broken down by family and paradigm at the end of this little missive. As you learn new languages you will soon have a good idea of the different families of languages and the paradigms they represent. Before you know it, you will know quite a few different languages and you'll be able to think about problems from many different angles and your mind will have a nice round shape. You will be able think clearly about a programing problem in any number of ways instead of the small number of ways your previous cone-shaped mind allowed.

Dave Thomas of Pragmatic Programmers (not Wendy's) fame came up with the idea of learning a language each year. They started a group to accomplish this back in 2002. Unfortunately it seems to have started and stopped all in that same year.

The Languages

A language breakdown by family is available on the 'History of Programming Languages' information.

As for the breakdown by type, I am not going to try to do this for every language available. So I am just going to give you a list of ten of fifteen programming languages broken down according to the rules I provided previously. This should give you enough of a group to pick five that interest you.

Descriptions are arranged as follows ([paradigms], Typing, Family). If family doesn't exist assume the language is in its own family.

Erlang ([Functional, Concurrent, Distributed], Dynamic Typing)
Forth or Postscript (Dynamic Typing)
Mercury ([Logic, Declarative], Dynamic Typing)
Prolog ([Logic, Declarative], Dynamic Typing)
Mozart-Oz ([Functional, Procedural, Object Oriented, Logic, Distributed, Concurrent], Dynamic Typing)
Lisp ([Functional, Procedural, Object Oriented, Logic], Dynamic
Typing, Lisp)
Scheme ([Functional, Object Oriented, Logic], Dynamic Typing, Lisp)
Ada ([Procedural, Object Oriented, Concurrent], Static Typing, Pascal) (Another resource)
Python ([Procedural, Object Oriented, Functional], Dynamic Typing)
Haskell ([Functional, Lazy], Static Typing)
Lua ([Procedural, Object Oriented], Dynamic Typing)
Ruby ([Object Oriented], Dynamic Typing, Smalltalk)
Smalltalk ([Object Oriented], Dynamic Typing, Smalltalk)
SML ([Functional], Static Typing, SML)
Ocaml ([Functional], Static Typing, SML)
Clean ([Functional], Static Typing)
D ([Procedural, Object Oriented], Static Typing, Algol)

I didn't include the languages that are common (C,C++,Perl,Java) because there is a good chance you already know them. They also don't count for the purposes of this exercise (C,C++, and Java are all part of the Algol family and would only count once anyway). Feel free to choose other languages that you may be aware of and find interesting. This list is only a 'Getting Started' list.

I strongly suggest that you learn a Lisp dialect and a Forth. These two languages are very good at shaping your mind and the two languages specifically tend to force the shape of your mind in opposite directions. Its a somewhat painful process but well worth the quick results. At the very least make sure that one of these languages is included on your list.

Footnotes

Thanks to a comment by Vince I have found that I am not the only one thinking along these lines, not that I actually thought I was. In the linguistics community there seems to be a hypothesis call the Sapir–Whorf hypothesis
that describes something similar. Also Kenneth Iverson gave his Turing Award lecture around the same topic. It was called "Notation as a tool of thought". Unfortunately, I can't seem to find a good link for it right now.

Off Line Development

2007-05-18T10:55:00.000-07:00

After numerous requests I have finally implemented off-line development. You no longer have to be connected to build your projects. In the past the dependency analysis task ran every build and if it thought that there was a chance that the dependencies had changed it connected to the repository to check the dependencies. This no longer happens. There is now a check_depends task that checks to make sure that dependencies have been run at some time in the past. It then checks if the dependencies need to be updated. If the dependencies do need to be updated it asks the user if he wants to update the dependencies (by connecting to the server). If the answer is affirmative then the update occurs if not then it continues with the existing dependencies. The user may run dependencies at any time by running the depends task directly. This approach gives users much, much more control over when and how dependencies are resolved. It also allows the user to control when and how the build system connects to the repository. I hope I have done this in such a way as to not add any additional burden to the user.

Distributed Bug Tracking

2007-05-02T21:24:00.000-07:00

I came across a project called DisTract last week. DisTract is basically a distributed bug tracking system. Specifically, its a file based bug tracking system who's directory structure sits inside of a repository managed by a distributed version control system. It uses this distributed version control system to manage distribution. This system was preceded by another, somewhat bit rotted, system called Bugs Everywhere.

I really like the idea of distributed bug tracking. It fits in really well with distributed version control and over all distributed development. Being able to create new branches for your bug system along with it source. Merge those branches back into mainline all the while keeping history etc. Thats a very, very powerful thing. However, this approach has a fundamental, maybe intractable, problem when it comes to merging. In a version control system merging is relatively simple. Files identifiers (basically the path in the workspace) are fixed there is no ambiguity if the same file is changed on two different boxes. Its just a matter of merging the contents of that file. Distributed bug tracking still has the problem of merging changed bugs. However, it has an additional problem. That is the fact that there is now why to relate on bug created by one person to a different bug created by another person. This is a problem that every bug tracking system has to some extent. In a distributed bug tracking system its even worse because the bugs created by another person wont be visible until they get a push or do a pull from that other persons repository. There may be a way to solve the problem using some of the modern document similarity algorithms. However, considering the small amount of text usually supplied with a bug report this is probably unworkable. I don't have a solution to this yet but I may play around with some ideas and use some information form some of the larger public repositories to run some tests.

Erlang and the Web

2007-04-23T17:27:00.000-07:00

I have been spending a lot of time thinking about leveraging Erlang's concurrency features and OTP in web applications. Right now I don't believe that any of the available frameworks do that very well. Erlyweb tries to be 'rails' for Erlang without really leveraging the features that make Erlang great. Yaws is more a web server then a web app server. It tries to make some amends here by providing things like appmods and yapps but they feel bolted on and they don't really leverage OTP at all.

This has been an open issue in my mind for some time. I created the tercio project as a starting point to solve this problem back in November or December of last year. It languished for awhile, partly because I had yet to figure out a elegant solution. Well, I think that I finally have. Its the logical conclusion of current web development trends. I am surprised that no one has thought of it yet. The idea is two fold.

1) Let client side handle the client side
2) Let the server side handle the server side

The general idea is that you will let the client side handle all client side rendering. The only server side participation in this is serving up files. In turn, the server will handle all server side (business) logic. The two should really have very little knowledge of one another. Hmm, seems a bit too simplistic doesn't it? I thought so, before I realized that the client side in this continuum already has a perfectly good language on which to base things. That languages is
javascript. I can hear the groans of dismay already. I emitted those very same groans back in the late '90s through the mid '00s and I wouldn't have even considered this two years ago. However, the landscape has changed alot in two short years. Ajax has gained prominence, libraries like prototype have been created. Its just a whole different world. We are already in good shape for the server side with OTP.

So what are the mechanics of making this happen. First we need to get away from generating html on the server side. To do that we need to make it easy to generate it on the client side. Manually creating dom objects really isn't the right way to go. I think we can do this with a library called JavascriptTemplates. This provides a reasonable templating language on top of javascript. Since each snippet of template will be small this should provide a reasonable efficient way to go. The second problem is how do we remove the client side knowledge from the server? I think tercio already does a good job here. It provides a javascript<->erlang bridge. With this javascript can send and receive messages to processes that have registered interest in client side message no the backend. Its webserver agnostic, so by providing a small shim you can make it work with any webserver you want. I am building a small startup on top of this framework. I am quite sure there will be a lot of issues to work out. However, I think the fundamental principles are solid and should provide for the right web development experience in Erlang.

Tercio isn't yet ready for prime time or even late night yet. It should be soon though so keep your eyes posted here.

Build Flavors

2007-04-08T12:14:00.000-07:00

I had quite a few requests to support different types of builds within the same project. Usually the request centered around being able to do 'development' and 'release' builds. In these two cases, development would enable debugging information and unit tests while release would strip them out. Providing static development and release build flavors wouldn't really solve the underlaying problem, which is the need to parameterize the build process. I ended up solving this by adding a 'flavors' option to the build config and having the tasks take arguments. Together these two features should allow a pretty wide range of build customizations.

Following is the flavors entry in the default build config. Its pretty self explanatory, 'default_flavor' indicates which flavor should be used when the user doesn't specify a flavor. 'flavors' is assigned to the build flavor definition. Within each flavor you assign an argument to each build task.


default_flavor: development,

flavors : {

   development : {
      build : "+debug_info -W1"
   },

   release : {
     build : "-DNOTEST=1 -W1"
   }


},

Right now only the build and test tasks take arguments. However, over time the other tasks that need arguments will take them as well. On a side note, I have created a module that trys to parse out erlc arguments.

Busy Times

2007-03-31T16:58:00.000-07:00

I have had a bit of a perfect storm this week. I decided to upgrade my Ubuntu to feisty. Although the upgrade went very nicely it took a couple of days to get the configuration the way I like it. It always does. I tend to tweak my setup to no end. Well after all that the OTP folks went and released the new version of Erlang so it took a few days to get the repository updated. At last, though, things have settled down and I am able to get back into sinan development. So I should be releasing a couple of new versions of the alpha over the next few days.

On a side note, Ubuntu feisty just rocks. It has, by far, the best wireless support that I have seen any any distribution. That alone makes it worth the upgrade. If you haven't already done it I suggest that you do the upgrade.

Full OTP Migration

2007-03-18T23:29:00.000-07:00

I just finished migrating the sinan to a full OTP application. It already followed all of the OTP principles it just wasn't setup to run as an application. I ran into a few issues that encouraged me to do the conversion. It wasn't actually difficult, as I said , I already followed OTP for the most part. It was mostly a matter of turning the tasks into gen_servers and figuring out a way for them to work together in a meaningful way. For now there isn't much difference for the user, but it sets things up so that I can rapidly iterate on the current outstanding issues.

The hardest part of all of this was getting the error_logger logging set up right. A lot more loggers then I suspected are involved from the get go. Kernel sets up a very primitive error_logger to start. It also sets up a slightly better tty logger. Then sasl sets up its own set of loggers. Figuring out where this was coming from, getting rid of the loggers and adding the custom logger was more difficult then I actually expected. In the end I got it done via a combination of configs for kernel and sasl and actually removing the primitive logger via the gen_event api.

In any case, I should be able to start knocking my open issues. Thanks to the beta testers for providing the feedback.

More on Configuration

2007-03-15T01:06:00.000-07:00

The first, lightly tested, version of fconf is out. At the moment its undocumented, but that should change pretty quickly. Its a otp app that supports multiple simultaneous configurations and merging configs together. It supports everything that I talked about in my last post. Sinan has already been ported over to fconf from its built in configuration system.

Configurations

2007-03-12T01:39:00.000-07:00

OTP applications have their own configuration mechanism in the app config structure. However, this doesn't always suit every applications needs. Currently, I have two distinct config mechanisms right now, one for sinan and one for tercio. They share a lot of similarities and I suspect other applications share similar needs. To that end I have split the config system out into its own project, fconf with the intention of using it in both projects. There isn't much out there yet but I should have the config subsystem pulled out of sinan and refactored in the next day or so. There are a few features that I want to support.

Reloadable config files (maybe auto reloaded)

Config file syntax agnostic; use the syntax that is right for your user community

Good, well defined override semantics

Of course, it all needs to be robust and scalable as well. The existing sinan config subsystem supplies most of this. It needs to be converted over to gen_server, create a supervision tree, and convert it to an OTP application.

First Version Out to Beta Testers!

2007-03-08T21:25:00.000-08:00

I just sent the first really usable version of sinan out to the beta testers who registered their interest. I expect there to plenty of issues. However, the fact that the product is out and being used is great. I can't wait to see the feedback, both positive and negative, that should come in. I will try to relay some of the relevant bits here on this blog.

Dependencies Again

2007-03-03T15:03:00.000-08:00

Just when you think you are finished something pops up and bites you. As I was getting ready to do the release to my beta customers I noticed that some of my unit tests in the dependency code no longer passed. I had made some changes the week before to alter conflict reporting and I guess I for got to run the tests. Well I jumped in and started debugging. An hour or so latter whop-a-mole with bugs I realized that the core algorithm was wrong. Leaky edge cases are almost always a symptom of an underlying fault in the logic. In almost all of these cases about the only thing you can do is through out the existing implementation and start fresh.

Well, in this case, I wrestled with a creating a new solution with little or no luck. Eventually, I went to lunch with some friends to talk out the problem. We, or more specifically Scott Parish, realized that at its core this is a backtracing problem very similar to that Prolog was designed to handle. Fortunately, he had just spent the last month living and breathing a similar backtracing problem and volunteered to code up a solution. He took the core of his solution from chapters 11 and 12 of PAIP. It seems that Erlang, due to its immutability, makes for a very good platform for these types of issues. In any case, Later on that night he sent me the solution that turned out to be a special cut down, special purpose prolog interpreter. Even then it was a complete, fast solution that occupied no more then a hundred lines of code. I am working on getting it integrated back into the build system as I write this. Its amazing how simple and concise an elegant solution can be.

Open Beta for Sinan

2007-02-28T13:50:00.000-08:00

So we have moved right along and we are very close to release. Before we do a general release we decided to do something of a small beta. We are trying to get together a group of solid people who know Erlang that are interested in using the system. They shouldn't mind going to a bit of extra trouble to provide us with useful information. They also need to be able to put up with potential issues that might arise from using the new build system. If you want to participate either post a comment to this blog letting me know or send an email to me or Martin Logan. You can find both of our email addresses in the erlang-questions archives. Once you do that we will provide you with some nice tarballs and pointers to the documentation.

Windows Support

2007-02-20T18:35:00.000-08:00

So far support for Windows has been, at most, an after thought for me. I own only one rather old windows box that I use for the occasional gaming. For that reason, windows just isn't a big priority. However, after a discussion with Martin Logan we came to the conclusion that sinan will need to support windows from the initial release. Fortunately, I have used the filename and filelib modules through out the implementation. That should remove any path name issues between the two systems. Unfortunately, I use symlinks pretty heavily through out the build. I have also used two very unix specific os commands as part of development, uname and tar. At first I thought that replacing these commands would pose a big problem. That proves not to be the case as the stdlib and kernel applications provide for my needs, though that functionality is pretty well hidden.

To figure out what platform and architecture I am on I use uname. I need this to pull down the correct version of binary dependencies from the repository. I didn't see any real solution to replace this command. I finally ran across a reference to erlang:system_info(system_architecture). This should solve my needs pretty well. Unfortunately, it only works in R9 and above. So, as you would expect, its a trade off. If I use this I can make the system work in windows but not in pre R9 systems. I think the right choice is to make it work in windows. Hopefully a pre R9 solution will present itself.

I also thought tar would present more of a problem as tar isn't a windows command. There is an erl_tar module in stdlib but it claims to support only Sun's version of tar. However, on examination of the docs for both erl_tar and gnutar this seems not to be the case. The format followed by erl_tar is IEEE Std 1003.1. This is an extended version of tar called ustar and its a POSIX standard. It seems that modern versions of gnutar support IEEE Std 1003.1 as well, as you can see here (check the references at the bottom of the page). So the erlang docs seem either in accurate or out of date, fortunately for me.

Symlinks are a bit of a harder problem. The need for symlinks is greatly reduced now that I don't have to build up tar-able structures. However, I still use them to build up a deployable version of the OTP apps with sources in the _build directory. I think that the only solution to this problem is to copy the sources and related files into the binary structures instead of symlinking them in. It will add a bit of overhead but I think thats worth it.

I do have one final problem, this one just occurred to me. I have written a nice little shell script to kick off a build. This shell script uses a bit of magic (borrowed from firefox) to follow any symlinks back to its deployed area where it gathers the paths for sinan. It then kicks off erl with all the proper paths to the sinan code. Not being a windows person I have no idea how to handle this on windows boxes. Hopefully, some erlang oriented windows guy will step forward and help me out with this.

Sinan Documentation

2007-02-14T23:51:00.000-08:00

The documentation for Sinan goes apace. I have finished all of the user level documentation and have started the developer documentation.

Why worry about developer documentation? Sinan is a very pluggable system. Most of the current functionality that ships with the system is composed of tasks that are plugged into the engine. Third party tasks will use the exact same mechanism and be first class parts of the system right along with them. I think this is pretty darn important and I want to support it right out of the box. Thats why I am spending a bit of extra time and getting the developer documentation out with the user docs.