Erlangish

Tuesday, August 9, 2011

I Have Moved

For various reasons (the biggest being markdown support and the fact that the blog has less of an erlang focus) I have moved my blog to blog.ericbmerritt.com. You will see new content there in the future.

Sunday, January 25, 2009

Build Process Integration

Introduction

This post isn't going to be Erlang or Language oriented at all. One of my other hobbies revolves around the build process and build process tools. Over the last few years I have been spending a lot of time thinking about improving them, making the build process more transparent etc. I have a way to do that, I believe. Unfortunately, it would take the cooperation of build too implementors to get it off the ground. That, or a new implementation of the existing tools. In any case, let me describe to you what I am talking about.

Many of you may be familiar with a product called Trac. This is an open source project management tool. Actually it calls itself a 'enhanced wiki and issue tracking system for software development projects' but it includes facilities for project management, source control artifact integration, authorization and authentication etc. Its a very interesting and reasonably complete tool. However, there is one single feature that makes this product especially interesting. That feature is the ability to link between the various kinds of information that trac keeps track up. Usually this linking is done with very simple, easy to remember micro-formats. For example, you can easily link a ticket to a changeset with the notation of 'changeset:' in the ticket. Trac understands this format and will produce a click-able link when the ticket is rendered anywhere. This type of linking works between any of the artifacts stored in Trac. The other features are really just there to support that single killer feature. Unfortunately, you only get to use this feature while you are in Trac.

Thats a problem. Because this linking only works with information owned by a Trac system, Trac is forced into an 'own the world' mentality. That is if the designers want to give you the ability to link between a wiki page and an issue the product must provide functionality that implements a wiki and an issue tracking system. If it wants to give you the ability to link to artifacts in a build system it must provide a build system. This is true of any artifact that Trac, or systems, like Trac want to provide. This forces the implementors to spread their time and efforts over a range of products instead of sticking a single product and getting it right. It doesn't allow the implementors of any one product to focus on that one product. It also means that a consumer of this product doesn't have the ability to swap out parts of the product for something he may like better. This is why Trac implements a wiki, a project management system, and a source control display system. It must do this even though quite good systems already address this space in the open source world. It also means that if the developers of Trac or similar systems want to add linking to a new type of artifact they must implement the artifact in their system.

A Better Approach

There is a better approach although it will take a bit of effort to see it realized. Fundamentally this better approach is to let each system handle focus on its purpose and provide some system agnostic way to tie these artifacts together. That is, that some other system should exist that understands how artifacts are related to one another. This system would then allow users to create 'links' or relationships and query those relationships at will.

We can do this by utilizing REST based services (other ways probably work as well) and using the REST semantic. By vending an artifact at a specific unchanging url we cane provide a universally unique identifier that we can use for linking. For example, lets say we were building up a Trac like system. We put our issue artifacts in an issue tracking service at a specific url, say 'http://issues.com/api/issue/'. We also put our projects at a specific url, say
'http://projects.com/api/project/'. Both of these apis vend data according to some predetermined format. With these in place it becomes very easy to link these two artifacts together. This implicitly indicates the existence of a couple of things. First and foremost that consumable formats exist for issues and project. Secondly that there exists some means of resolving these relationships. Basically, that you somewhere to put the fact that these issues are related. I think that a separate type of system should be set up that manages these relationships. For nowe I will call that a relationship management service. So with these facts established, if we wanted to associate issue 13 with project Foo we would just create a relationship between http://issues.com/api/issue/13 and http://projects.com/api/project/foo in our relationship management service. With this service in place each individual system wouldn't need to understand linking at all nor store link information. Only clients that wanted to consume the information would need to understand linking.

There are a few and advantages here and some disadvantages. The biggest advantage is that systems don't need to understand how linking occurs with other systems. We can drop any system we want into the mix and get reasonable linking semantics. The second big advantage is we can manipulate the links in one location, traversing the entire graph of links with jumping around to each service that stores information. The disadvantage is that we have yet another service that we must manage.

How To Do It

Of course, actually accomplishing the task of building these relationships is not so simple in the pragmatic world. Several prerequisites have to be met before we can get started.

Systems must vend their data in some generally, consumable, system agnostic way.

Systems must actually vend the artifacts that other systems may find interesting.

There must be a uniform, unchanging way to resolve artifacts so that they can be linked to.

Each of these prerequisites are more complex then they may first appear. For example, the first prerequisite says that systems must vend their artifacts in some system agnostic way. However, in reality you want them to vend it in as simple a way as possible. You also want them to vend their data with either some type of schema or self describing data. Finally you want all of your systems to vend data in a similar way to reduce the complexity of linking and consumption. Similar issues exist with the second prerequisite. Artifacts that should be exposed may be sub artifacts of previously exposed artifacts. For example, in a project management system, projects are probably an exposed artifact but so are milestones which are lower in the hierarchy then projects. Exposing these in a consistent consumable way will require some forethought. Finally and most importantly, artifacts must be resolvable for linking to have any meaning at all. That means that each artifact must have a consistent and unchanging URI that can be used to identify and consume that artifact. Designing such an api will take some significant forethought on the part of the system implementors.

What we are essentially talking about is evolving a set of tools from a broad array of isolated stacks to a set of components that can plug in to a generic, flexible, component architecture. 'Plug in to' in this case just means forming consumable relationships between artifacts vended by each component. This is very similar to the way the familiar World Wide Web is designed. In our case, we are linking artifacts instead of documents and storing the links outside of the artifacts, but the design philosophy is very similar.

To that end, a set of principles that inform the creation and management of components and the relationships between the artifacts that are vended by those components needs to be designed. This must not and should not be some heavy handed attempt to mandate interoperability. It should be a set of general guidelines about what types of services must must be vended by a system. The relationship management should occur within a dedicated system as well.

This approach would free product developers to concentrate on a specific feature or set of features. It would then allow individuals or entities to consume artifacts as they will and forming relationships between arbitrary artifacts. This is an important meme that I think should be propagated along with the REST approach to web services.

Wednesday, July 23, 2008

Forth: The Other White Meat

Lisp has been constantly touted as a language that any self respecting coder must learn. The proponent's of Lisp have every right to make the claims that they do. From the prospect of stretching your mind Lisp actually is as good as they say it is. Lisp is a principle member of a small, elite group of languages, the use of which really causes a fundamental change in the way people think about programming and system design. However, it is not the only member of this group. The group includes at least one other language, Forth. Forth is just as mind bendy, introspective and universal as Lisp but it doesn't tend to get the same amount of press as its older brethren. Well, I intend to change that. At least for the small corner of the world that reads this column.

I know that as soon as I mentioned the F word (Forth that is ;) a general groan went up from those of you that have some experience with it. Don't worry, groaners, I feel your pain. However, take a couple of aspirin sit back and listen to me for a bit and you might gain a different prospective on the language. For those of you not familiar with the language, the reason for the groaning will become obvious as our story progresses.

I had given some thought to going about and gathering up all of the 'quotes' that people use to promulgate Lisp and then justifying them for forth. After thinking about it, I decided to take a slightly different tack. I went this other route mostly because Bill Clementson has already done most of the gathering work for me and also because I think that you're smart enough to draw correlations without me beating you over the head with them. So take a couple of minutes, pop over Bill's site and read the quotes. Don't worry, I'll wait.

What Is Forth

Back so soon? good. Forth is a language that takes a wildly different tack then any other language out there. It was initially developed by Charles Moore at the US National Radio Astronomy Observatory in the early 1970s. Back in the early sixties, while working for the Smithsonian Astrophysical Observatory Mr. Moore found a need for a little bit more dynamisim then he had had in the past. To that end he put together a little interpreter intended to control a punch reader. By using this interpreter he could compose different equations for tracking satellites without recompiling the entire system. This was no small feat for the time. Mr. Moore, like any good hacker, took his interpreter with him when he left that job. He carried it around for the next five or ten years constantly tweeking it. By 1968 it was finished enough to build up a little game called SpaceWar as well as pretty nifty Chess system. This version of the language was the first to be called 'Forth'.

This early evolution and constant tweaking produced a fairly interesting language. A forth program is simply a series of tokens, called words, and two stacks. The stacks represent the main data stack and the return stack. For right now we will ignore the return stack and talk about the data stack. To give you comparitive examples, lets look at an operation in Lisp and Forth. The Lisp version is perfectly recognizable to just about anyone. It just adds two numbers together.


> (+ 1 2) 
 3

In Forth, it goes as follows.


> 1 2 +
 3

A more complex example


> (/ (+ 27 18) (* 3 3))
 5

In Forth, it goes as follows.


> 27 18 + 3 3 * /
 5

If you have been exposed to some of the old TI calculators or even Postscript you may be able to tell whats going on here. Each number as it appears gets pushed onto the explicit data stack. The '+' words (and words they are) take exactly two numbers off the stack and replace them with the value resulting from there addition. So what we get on terms is.


> 1 2 + ( results in 3 on the stack)

This explicit stack is the way all data is handled in Forth, everything. You may remember that I said everything is a token. Thats absolutly correct. Forth reads each token and looks it up in a special dictionary that is used to store references to words. We create entries in that dictionary as follows.


 : square
    DUP * ;

The colon is an immediate word (read macro) that reads ahead one word in the token stream and uses that word to create a new entry in the system dictionary.

It then reads ahead until it finds a semi-colon ';' which it uses to indicate the close of the word. It then compiles the intervening tokens into references to words. It is as simple as that. 'Hold on, Hold on!' you say. 'You said everything was tokens in a forth system, aren't these special'. Nope, I didn't lead you astray. The colon and semi-colon aren't special at all and can be overwritten and changed at any point you like. The are examples of Forths powerful macro facility in which you can create syntax for a language that essentially has none. This is a powerful concept that Lisp shares. As Paul Graham says you can use this to build up your language to the system instead of the other way around. Of course, there is a downside. Any time you give the programmer extreme flexibility someone is going to abuse it. For example take a look at this example from a Forth library in the wild.


: rework-% ( add - ) { url }  base @ >r hex 
   0 url $@len 0 ?DO 
       url $@ drop I + c@ dup '% = IF 
           drop 0. url $@ I 1+ /string 
           2 min dup >r >number r> swap - >r 2drop 
       ELSE  0 >r  THEN  over url $@ drop + c!  1+ 
   r> 1+ +LOOP  url $!len 
   r> base ! ;

Granted I have taken away the context, but you begin to understand the impetus behind those groans you heard a short while ago. Forth is one of the most flexible languages available, but the very flexibility that makes it interesting also makes it dangerous. It takes diligence on the part of the programmer to write really clean and maintainable code. However, if the coder does take that care he can write imminently readable and maintainable code. Take a look at the following examples that appeared in Starting Forth by Leo Brodie. You can get a good feel for whats going on even without knowing the context.


 : WASHER  WASH SPIN RINSE SPIN ;
 : RINSE  FAUCETS OPEN  TILL-FULL  FAUCETS CLOSE ;

So doing Forth well or not is entirely in the hands of the coder, that's why the Forth experience varies so much. That's why Forth is a Language for Smart People.

Forth Basics

I have the very good luck of owning a copy of the book Thinking Forth. This is one of those books that you should add to your library even if you never intend to write a single line of Forth code. It has a huge amount of great insight onto coding and system design in general, though you probably don't want to use it as your first introduction to Forth. In any case, I am going to be borrowing some topics and examples from that book to see us on our way.


: breakfast
    hurried? if cereal else eggs then clean ;

The above example illustrates the fundamental building block of every Forth system, the word. This specific example is whats called a colon definition. Its how you define a word in Forth. Basically the colon ':' in an immediate word (think macros operating on the word stream) that takes the very next word and creates an entry in the dictionary based on that word. It then reads all the words up to the semi-colon ';' that tells it to stop. For each word it encounters it looks up the position in the dictionary and puts that location in the code stream. Of course, in this case we have the if immediate word that takes control for a little while before handing it back. So what specifically might be going on here. In this case, hurried? probably looks at some parameter in the system to decide if haste is in order and puts a boolean on the stack. 'If' is an immediate word that compiles to a jump based on the 'else' and 'then' words. You can probably figure out what it does from here.

This, admittedly simplistic, example does illustrate a few of the special characteristics of Forth. Most notably, the stack, words, and use of immediate words.

Factoring

Long before any one had ever heard of Agile Methodologies or Extreme Programing the idea of factoring was already hard at work in the Forth community. The idea of constantly looking at code, breaking up functions and generally simplifying the systems in a foundational concept in Forth. In fact it's taken to something of an extreme. Words are a huge part of Forth systems and the factoring process. For example, consider the following word which finds the sum of squares of two integers:


 : sum-of-squares   dup * swap dup * + ;

The Stack inputs to the word at run-time are two integers. The Stack output is a single integer. By the process of factoring, the example would be re-written in Forth using a new definition called 'squared' to allow sharing the common work of duplicating and multiplying a number. The first version was overly complex and illustrated the notorious line noise aspect of Forth. Fortunately, by factoring the system we can make a much more readable and understandable system.


 : squared  dup *  ;
 : sum-of-squares  squared swap squared + ;

Good Forth programmers strive to write programs containing very short (often one-line), well-named word definitions and reused factored code segments. The ability to pick just the right name for a word is a prized talent. Factoring is so important that it is common for a Forth program to have more subroutine calls than stack operations. Writing a Forth program is equivalent to extending the language to include all functions needed to implement an application. Therefore, programming in Forth may be thought of as creating a Domain Specific Language. As Lispers well know, this paradigm, when coupled with a very quick edit/compile/test cycle, seems to significantly increase productivity.

Immediate Words (The Macros of Forth)
In comparison with more main stream languages, Forth's compiler is completely backwards. Most traditional compilers are huge programs designed to translate any foreseeable, legal combination of available operators into machine language. In Forth, however, most of the work of compilation is done by a single definition, only a few lines long. As I have said before, special structures like conditionals and loops are not compiled by the compiler but by the words being compiled (IF, DO, etc.) You may recognize this sort of thing from Lisp and it's macro system. Defining new, specialized compilers is as easy as defining any other word, as you will soon see. As you know, When you've got an extensible compiler, you've got a very powerful language!

Forths I Have Known

There are a couple of new Forths out there that are breaking with the Forth tradition somewhat and innovating in this area. One that I especially like is Factor. It's a very minimal but consistent for with some modern ideas like garbage collection, CLOS like object system, strait forward higher order functions, etc. This forth kind of bridges the gap between Lisp and Forth. A good traditional forth is Gforth. Its reasonably fast, reasonably stable and implements the full ANSI spec. You really can't go wrong with gforth if you want a good forth that will stand the test of time. If you just want an easy, interesting forth that will run just about anywhere I suggest you take a look at Ficl. Its ANSI compliant Forth written with in C as a subroutine threaded interpreter. It is stupid simple and about as robust as you can get. This is a really nice forth to start you tinkering with. Now if you want a Forth that is really stretching the bounds of what is possible and desirable take a look at Chuck Moore's latest Forth, Color Forth. It's the only language that I know of where color (yes, color) actually has semantic meaning. That the custom 25 core chip that color forth is designed to run show the Mr. Moore is really pushing the boundaries. It's worth while taking a bit of time exploring ColorForth just because of it's divergence from mainstream.

So Why the Comparisons to Lisp?

I think that many more people have had some exposure to Lisp then Forth. Because, Lisp and forth share so much 'meta' philosophy, I thought it would be beneficial to draw comparisons between Lisp and Forth. You will have to be the judge of whether this was a successful approach or not.

Thursday, June 5, 2008

An Introduction to Erlang

As some of you may have guessed, I am a fan of Erlang. I think that it's a very interesting language with a tremendous amount of promise for the type of server side applications that I usually end up working on. I have talked a lot about various things here on Erlangish, so I thought it would finally be appropriate to spend a bit of time talking about the topic of the blog. For the most part I will be delving into the, somewhat obscure, history of Erlang. I will also spend a bit of time providing some instructions on how to get started with the language.

So What Is Erlang and OTP?

Erlang is a distributed, concurrent, soft real time functional programming language and runtime environment developed by Ericsson, the Telecoms Infrastructure supplier. It has built-in support for concurrency, distribution and fault tolerance. Since its open source release in 1999, Erlang has been adopted by many leading telecom and IT related companies. It is now successfully being used in other sectors including banking, finance, ecommerce and computer telephony.

OTP is a large collection of libraries for Erlang to do everything from compiling ASN.1 to providing an application embeddable http server. It also consists of implementations of several patterns that have proven useful in massively concurrent development over the years. Most production Erlang applications are actually Erlang/OTP applications. OTP is also open source and distributed with Erlang.

Although Erlang is a general purpose language, it has tried to fill a certain niche. This niche mostly consists of distributed, reliable, soft real-time concurrent systems. These types of applications are telecommunication systems, Servers and Database applications which require soft real-time behavior. Erlang excels at these types of systems because these are the types of systems that it was originally designed around. It contains a number of features that make it particularly useful in this arena. For example; it provides a simple and powerful model for error containment and fault tolerance; concurrency and message passing are a fundamental to the language, applications written in Erlang are often composed of hundreds or thousands of lightweight processes; Context switching between Erlang processes is typically several orders of magnitude cheaper than switching between threads in other languages; it's distribution mechanisms are transparent, programs need not be aware that they are distributed and The runtime system allows code in a running system to be updated without interrupting the program.

Given that there are things that Erlang is good at there are bound to be a few things that it's not so good at. The most common class of 'less suitable' problems is characterized by iterative performance being a prime requirement with constant-factors having a large effect on performance. Typical examples are image processing, signal processing and sorting large volumes of data.

Origins of Erlang

I am firmly convinced that Erlang's history is a key ingredient to its success. I am not aware of any other language whose early development was so straightforward and pragmatic. For most of its life Erlang was developed inside Ericsson, originally for internal use only. Later on it was made available to the external world and eventually open sourced. The timeline of its development goes something like this.

1982–1985 Ericsson identified some issues that existed with telecom languages at the time. To address these difficulties they started experiments with the programming of telecom applications using more then twenty different languages. These early experimenters came up with a few features that a useful system needed to supply. They realized that the target language must be a very high level symbolic language in order to achieve productivity gains. This new requirement vastly subseted the language space and resulted in a very short list of languages. The languages included Lisp, Prolog, Parlog, etc.

1985–1986 Further experiments within this subseted list where performed. New results were generated from this round of experiments as well. They found that the theoretically ideal language also needed to contain primitives for concurrency and error recovery, and the execution model needed to exclude back-tracking. This ruled out two more of the contending languages, Lisp and Prolog. This ideal language also needed to have a granularity of concurrency such that there would be a one to one relationship between one asynchronous telephony process and one process in the language. This ruled out Parlog. At the end of this round of experiments they where left with out any languages in the list they had started with. Being the pragmatic folks that they were, they decided to develop their own language with the desirable features of Lisp, Prolog and Parlog, but with superior concurrency and error recovery built into the language.

1987 The first experiments began with this nascent language which became Erlang.
1988 The ACS/Dunder project was started at Ericsson. This was a prototype implementation of PABX by developers external to the core Erlang developers.
1989 The ACS/Dunder project became a full fledged project with the reconstruction of about a tenth of the complete, production PABX called the MD-110 system. The preliminary results where very promising. In this early phase developer productivity was already ten times greater then during the development of the original system in the PLEX language. This reimplementation also pushed forward experiments directed at increasing the speed of Erlang.
1991 At this point the experiments directed at speeding up Erlang bore fruit. A fast implementation of Erlang was released to growing user community. Erlang was also presented at an international telecom conference that year.
1992 During this year the user base started growing significantly. Erlang was ported to VxWorks, PC, Macintosh and other platforms. Three new, complete applications were written in Erlang where presented at another conference. The first two production projects using Erlang are started.
1993 Distribution primitives where added to the language, which made it possible to run a homogeneous Erlang system on heterogeneous hardware. A corporate decision was made within Ericsson to begin selling and supporting Erlang externally. A new organizational structure was built up to maintain and support Erlang implementations and Erlang Tools. This resulted in the creation of Erlang Systems AB.
1996 OTP was formed into a separate product group within Erlang Systems AB. This represented the maturing of the OTP platform within Erlang. After nearly ten years of development the (non-Erlang) AXE/N project was closed down and pronounced a failure. This left a large whole in Eriksson's product line and development of a new replacement product was started to fill it, it was decided that this replacement product would be written in Erlang.
1998 After two years the AXE/N replacement project, now called the AXD301 was delivered. Around this same time the CEO of Ericsson Radio became the CEO of Ericsson as a whole. This person had banned Erlang at Ericsson Radio and though he never banned Erlang at Ericsson proper it became career suicide to propose Erlang new Erlang projects. This problem effectively killed opportunities for Erlang Systems AB to sell the language and support. The primary question potential customers asked was 'Who wants to use a language developed by Ericsson when Ericsson won't use the language itself?'. This just goes to show that corporate bureaucracy will be corporate bureaucracy autocracy no matter where you are. In any case, this turned into a bit of a blessing in disguise for the rest of us.
1998-99 In order to drive further development a decision was made within Erlang Systems AB to release Erlang as an Open Source project. This didn't imply an abandonment of Erlang by Ericsson or Erlang Systems AB. Erlang continued and continues to be supported by these two organizations. This decision was made wholly with the idea of spreading Erlang and removing the, somewhat negative, idea that it was a proprietary language. It has remained open source and supported to this day.

As you can see the Erlang didn't start out as Erlang at all. It started out as just a series of requirements backed up by experiments. A large number of experiments where done to find the language that matched those requirements. When no existing language was found Ericsson decided to create their own. Considering Ericsson's resources and the costs associated with development of their products

I think this was a very pragmatic decision. However, that conclusion is open to interpretation. In any case, after the initial development there was a constant back and forth dialog between the users and developers of the language as the language moved through its formative process. I think this fact alone is one of the reasons that Erlang is as good as it is today. Later on in its development as Ericsson grew less resourceful Erlang started to have political problems within the company. Even though Ericsson had several successful and profitable products in Erlang and other languages the de-facto ban occurred. Fortunately by this time Erlang could and did stand on its own. The ban actually turned out to be fortunate for the rest of us because it led, pretty directly, to Erlang's eventual Open Sourcing.

Joe Armstrong, one of the original Erlang Developers and a productive member of the community put together a number of tutorials that are very useful. Its worth going through these and playing with the code.

There are a couple of good editors to use with Erlang. The gold standard is the Erlang Emacs mode distributed as part of the Erlang distro. A very updated version is now available from the Erlware folks. You can get it here If you go this route I suggest you also get Distel, an add-on written by Luke Gorrie. It's available from the the good folks at google code. There are instructions included with both of these to get you up and going. For those of you more inclined to the IDE world you may want to take a look at Erlide. This is a set of Eclipse plugins that add support for Erlang to Eclipse. Its still pretty beta, but it's very usable.

So How Do I Get Started?

Learning Erlang is a fairly quick process. For an experienced developer it shouldn't take more then a few days before they can write nontrivial programs, about a week or two to feel really comfortable and a month or so before feeling ready to take on something big by themselves. It helps a lot to have someone who knows how to use Erlang around for some hand-holding.

Start off by going through the quick start part of the FAQ. Then go through the Erlang Course. You can skip the history part if you would like. I have gone over it in more detail here. Once you have done the course play around with some of the examples. Then go read the long version of the getting started docs. This should put you on the road to being able to write some Erlang code. If you are one to worry about coding conventions then you may want to take a look at http://www.erlang.se/doc/programming_rules.shtml. This has quite a number of useful and well thought out programming rules. One of the things that makes Erlang really interesting is the OTP System. If you really want to get to know something about Erlang then it make sense to spend a bit of time learning OTP and http://www.erlang.org/doc/design_principles/part_frame.html is a very good place to start.

Wednesday, May 21, 2008

An Overview of Concurrency

Introduction

Over the last few weeks I have had several conversations with people about concurrency, more specifically the ways in which shared information is handled in concurrent languages. I have gotten the impression that there isn't really a good understanding of whats out there in the world of concurrency. That being the case it would be a good idea to just give a quick overview of some of the mechanisms that are gaining mind share in the world of concurrency.

Aside from some engineers that are currently so deep in their rut that they can't see sunlight its been accepted that the current mainstream approach to concurrency just wont work. The idea of using mutexes and locks to try to take advantage of the up and coming massively multi-core chips is really just laughable. We can't ignore this topic. As software engineers we don't really have a choice about supporting large numbers of CPUs, thats the direction that hardware is going its up to us to figure out how to make it work in software. Fortunately a bunch of really smart folks have been thinking about this problem for a really long time. A few of the things they have been working on are slowly making their way into the gestalt of the software world.

We are going to talk about three things. They are Software Transactional Memory (STM), Dataflow Programing specifically Futures and Promises and Message Passing Concurrency. There are others, but these currently have the most mind share and the best implementations. I have limited space and limited time so I am going to stick to these three topics. You may have noticed that up to this point I have only talked about concurrency. More specifically the communication between processes in concurrent languages. Thats intentional, I am not going to talk about parallelism at all. Thats a different subject and only faintly related to concurrency, but often conflated with it. So if your looking for that you are looking in the wrong place. On another note, I am going to use processes and threads pretty much interchangeably. I know that in certain languages the two terms have very different meanings, however, in many other languages they mean the same or one of the terms doesn't apply. What I am getting at is that if you open the scope of the discussion to the languages at large the meanings become ambiguous. When I use the term process or thread I am talking about a single concurrent activity that may or may not communicate with other concurrent activities in same way. Thats about the best I can do for you.

Traditional Shared Memory Communication

Shared Memory Communication is the GOTO of our time. Like GOTO of years past its the current mainstream concurrent communication technique and has been for a long, long time. Just like GOTO, there are so many gotchas and ways to shoot yourself in the head that its scary. It so bad that this approach to concurrency has tainted an entire generation of engineers with an deeply abiding fear of concurrency in general. This great fear has crippled the ability of our industry to adapt to the new reality of multi-core systems. The sooner shared memory dies the horrible death it deserves then the better for us all. Having said that, I must now admit that, just like GOTOs, shared memory has a small niche where it probably can't be replaced. If you work in that niche then you already know you need shared memory and if you don't you don't. Just a hint, implementing business logic in services is not that niche. OK, I am all done with my rant and I feel much better, now on to the show.

Shared memory typically refers to a large block of random access memory that can be accessed by several different concurrently running processes. This block of memory is protected by some type of guard that makes sure that the block of memory isn't being accessed by more then one process at any particular time. These guards usually take the form of Locks, Mutexes, Semaphores, etc. There are a bunch of problems with this shared memory approach. There is complexity in managing the serial access to the block of memory. There is complexity managing lock contention for a heavily used resources. There is a very real possibility of creating deadlocks in your code in a way that isn't easily visible to you as a developer. There is just all kinds of nastiness here. This type of concurrency is found in all the current mainstream programming and scripting languages, C, C++, Java, Perl, Python, etc. For whatever reason its ubiquitous and we have all been exposed to it that doesn't mean we have to accept it as the status quo.

Software Transactional Memory (STM)

The first non-traditional concurrency mechanism we are going to look at is Software Transactional Memory, or STM for short. The most popular embodiment of STM is currently available in the GHC implementation of Haskell. As for the description I will let wikipedia handle it.

Software transactional memory (STM) is a concurrency control mechanism analogous to database transactions for controlling access to shared memory in concurrent computing. It functions as an alternative to lock-based synchronization, and is typically implemented in a lock-free way. A transaction in this context is a piece of code that executes a series of reads and writes to shared memory. These reads and writes logically occur at a single instant in time; intermediate states are not visible to other (successful) transactions.

STM has a few benefits that aren't immediately obvious. First and foremost STM is optimistic. Every thread does what it needs to do without knowing or caring if another thread is working with the same data. At the end of a manipulation if everything is good and nothing has changed then the changes are committed. If problems or conflicts occur the change can be rolled back and retried. The cool part of this is that there isn't any waiting for resources. Threads can write to different parts of a structure without sharing any lock. The bad part about this is that you have to retry failed commits. There is also some, not insignificant, overhead involved with the transaction subsystem itself that causes a performance hit. Additionally in certain situations there may be a memory hit, ie if n processes are modifying the same amount of memory you would need O(n) of memory to support the transaction. This is a million times better then the mainstream shared memory approach and if its the only alternative available to you you should definitely use it. I still consider it shared memory at its core. Thats an argument that I have had many, many times.

Dataflow - Futures and Promises

Another approach to concurrency is the use of Futures and Promises. Its most visible implementation is in Mozart-Oz. Once again I will let wikipedia the description for me.

In computer science, futures and promises are closely related constructs used for synchronization in some concurrent programming languages. They both refer to an object that acts as a proxy for a result that is initially not known, usually because the computation of its value has not yet completed.

Lets lay down the difference between Futures and Promises before we get started. A future is a contract that a specific thread will, at some point in the future, provide a value to fill that contract. A promise is, well a promise, that at some point some thread will provide the promised value. This is very much a dataflow style of programming and is mostly found in those languages that support that style, like Mozart-Oz and Alice ML.

Futures and Promises are conceptually pretty simple. They make passing around data in concurrent systems pretty intuitive. They also serve as a good foundation on which to build up more complex structures like channels. Those languages that support Futures and Promises usually support advanced ideas like unification and in that context Futures and Promises work really well. However, although Futures and Promises remove the most egregious possibilities for dead-locking it is still possible in some cases.

In the end both of these approaches involve shared memory. They both do a reasonably good job at mitigating the insidious problems of using shared memory, but they just mitigate those problems, they don't eliminate them. The next mechanism takes a completely different approach to the problem. For that reason it does manage to eliminate most of the problems involved with shared memory concurrency. Of course, there is always a trade off and in this case the trade off is in additional memory usage and copying costs. I am getting ahead of myself let me begin at the beginning and then proceed to the end in a linear fashion.

Message Passing Concurrency

The third form of concurrency is built around message passing. Once again I will let wikipedia describe the system as it tends to be better at it then I am.

Concurrent components communicate by exchanging messages (exemplified by Erlang and Occam). The exchange of messages may be carried out asynchronously (sometimes referred to as "send and pray", although it is standard practice to resend messages that are not acknowledged as received), or may use a rendezvous style in which the sender blocks until the message is received. Message-passing concurrency tends to be far easier to reason about than shared-memory concurrency, and is typically considered a more robust, although slower, form of concurrent programming. A wide variety of mathematical theories for understanding and analyzing message-passing systems are available, including the Actor model, and various process calculi.

Message passing concurrency is about processes communicating by sending messages to one another. Semantically these messages are completely separate entities unrelated to whatever data they where built from. This means that when you are writing code that uses this form of concurrency you don't need to worry about shared state at all, you just need to worry about how the messages will flow through your system. Of course, you don't get this for free. In many cases, message passing concurrency is built by doing a deep copy of the message before sending and then sending the copy instead of the actual message. The problem here is that that copy can be quite expensive for sufficiently large structures. This additional memory usage may have negative implications for you system if you are in any way memory constrained or are sending a lot of large messages. In practice, this means that you must be aware of and manage the size and complexity of the messages that you are sending and receiving. Much like Futures and Promises the most egregious 'shoot yourself in the head' possibilities of deadlocking are removed its still possible to do. You must be aware of that in your design and implementation.

Conclusion

In the end any one of these approaches is so much better then the shared memory approach that it almost doesn't matter which one you choose for your next project. However, they each have very different philosophical approaches to concurrency that greatly affect how you go about designing systems. You should explore each one so that you are able to make a logical decision about which one to use for your next project. That said, opinions are like umm, I can't really complete that, but you get my drift. My opinion on the subject is the message passing concurrency is by far the best of the three, where best is defined by most conceptually simple and scalable. In the end the industry will decide which direction is right for that and head in that direction. We are still to early in the multi-core age to get any good impression of which will win out.

Friday, March 21, 2008

Event Based Stream Parsing for JavaScript Object Notation (JSON)

Overview

I have been playing around with the idea of a stream based, sax like, parsing API for JSON. In my mind it has a few very direct benefits. I suspect that it would simplify implementing a parser for an already simple syntax. It would also allow for parsing arbitrarily large documents. In my case I need to return information to the user as quickly as possible, I absolutely cannot wait for the entire document to be parsed. This would solve that problem. All that said I seriously doubt that I am the first one to think of this, though I can't find any references to anything similar out there. If any of you have any pointers let me know in the comments and I will reference them. I plan to implement this in erlang and at least one other language. I will post links as soon as that is done. Without further ado.

Abstract

The JavaScript Object Notation (JSON) is defined in RFC 4627. This document describes method of parsing JSON using an event driven api. It is expect that details of implementation will change from language to language. However, this document should present a uniform approach to stream parsing of JSON

Introduction

JavaScript Object Notation (JSON) is describe as a text format for serialization of structured data. More commonly it is used as a transport protocol for various network based applications. This document describes an event based parsing mechanism for JSON data that can be implemented in a simple and strait forward manner. Reading and understanding RFC 4627 is a prerequisite to reading and understanding this document.

Description

The stream oriented parser produces a series of events that describe the structure currently being parsed. These events are then consumed by the calling application in some manner. The actual mechanism of consumption will vary from language type to language type though a few different types of APIs are discussed in the appendix of this document.

JSON is composed of two types of data structures primitive and complex types. The events for the two types of structures remain the same, changing only in the description of the structure being described.

The events around primitive types are relatively simple and should be implemented as simply as possible. Individual API definers will choose how they want to implement it using the examples in the appendix as a guide.

Primitive types in JSON are strings, numbers, booleans and null. These are described within the event itself


      string = STRING_DATA(value)
      number = NUMBER_DATA(value)
      boolean = BOOLEAN_DATA(value)
      null = NULL_DATA(value)

Complex types in JSON are objects and arrays. These are represented by a series of events that describe the object. Because they are complex types their representation is much more complex then that of primitive types. However, it allows the object to be consumed as it occurred in the stream.


       object =  OBJECT_BEGIN
                     KEY(string_value)
                     VALUE_BEGIN
                      ... # recursive type description
                     VALUE_END
                      ... # arbitrary number of additional key/value pairs
                  OBJECT_END

       array =  ARRAY_BEGIN
                     VALUE_BEGIN
                       ... # recursive type description
                     VALUE_END
                ARRAY_END

Callback API Description for Erlang

This would be implemented as a behavior that defines these callbacks. Client code that wishes to receive these callbacks would implement these methods. This should allow a high degree of flexibility for the client.


        %% should guard data will be one of the primitive types
        %% null will be represented by the atom null
        data(Value, State) -> State2

        object_begin(State) -> State2

        object_end(State) -> State2

        key(Value, State) -> State2

        value_begin(State) ->  State2

        value_end(State) -> State2

        array_begin(State) -> State2

        array_end(State) -> State2

Wednesday, December 19, 2007

Erlware Progress Update

I have had very little time to post of late. The
other maintainers and I have been coding feverishly to get
the erlware system to version 1.0 by the end of January. This has
taken up a huge amount of my time, leaving me very little to
time to blog. It been so long now that I feel bad. So
I decided that I needed to spend a bit of time talking how
development in our new model is going.

All I can say is wow! This Open Development Model has
proven to be a huge boon to the project. Not only has it
encouraged user participation by a large margin, it has also
encouraged quite a bit more design process. Rather
spontaneously the maintainers have started to submit design
specs when they are preparing to do major changes. These design
specs are usually one pagers that give a high level overview of
the problem and describe the nature of the fix the developer is
implementing. It lets the community and the other maintainers
know and comment on whats upcoming. It also makes the developer
think more closely about the changes he is planning to make to
the system. All in all it doesn't take a large amount of time
and its helps increase the quality of our offering. Since each
design spec goes to the mailing list it fits right in with the
low level patch model. I like it.

Another major development is that we have moved our website to
the Open Development Model. Originally, we used dokuwiki for
our website. Dokuwiki is an great product, but its a wiki with
the attendant problems that a wiki has. In our case, we had kept
wiki access pretty limited. Only the core maintainers actually
had access to it. This made it hard for the community to fix
issues and expand docs. What we really wanted was to combine the
easy editing of the wiki with our, somewhat more careful, patch
based approach to changes. We spent some time thinking about the
problem and decided that it would be best if our site was
updatable, via git, just like the rest of our projects. However,
we really didn't want to hand edit html and do all the manual
work involved. We liked the readable usable wiki syntax. We
needed some mix of wiki with static source files. We went to
google with low expectations. We where pleasently surprised by
the number of good offerings in various languages. Eventually we
settled on webgen. This is a nice
little ruby framework for static site generation. It supports
various markups, including our choice Markdown. What
we ended up with is an open site with very nice wikish syntax
that is easy to extend and change via a model that our
contributors are familiar with.

Once a change is in our canonical git repo getting the changes
onto the sight is completly automated. A cronjob on our server
pulls the site from the repository and runs the generator. Then
runs rsync to push the changes into the area that they are
served from. This is a very recent change so we don't actually
know if it will accomplish our purpose of more contributions to
the site. One unexpected side benefit is that our site is now
really fast. All the work is happening at generation time so
the server just needs to serve static html files.