Before going in the technical discussion, it may be a good idea to precise what do we means by multiple core, essentially on which kind of machines we expect jMax to be run, and which ones are the “reference plateform” for our development goals.
I would consider three class of machines:
- low end: single socket PC, with Quad core processor (Core i7), 8 hardware threads.
- tpical: double socket PC workstation, with two quad processors (Xen i7), 8 cores, 16 hardware threads.
- high end: a four socket server, with 4 quad or hex or octo core Xen processors, with between 16 and 32 cores, and between 32 and 6 hardware threads.
Just to put things in perspective, today what i defined as low end machine costs less than 1KEuro, the typical machine costs around 2.5KEuro, and a high end machine would cost between 5 and 10 KEuro.
Machines of the “typical” class are widely used in musical studios today. High end machines are reserved to research center and the like.
I think that the new fts architecture should support the low end and the typical machine classes, and possibly being able to target high end machines later. By support, i means that FTS should be able to run big DSP patches by loading all the available processors close to 100%.
The thing that should be noted is that even a relatively low end machine can easily have 8 core and 16 hardware threads; so, the degree of parallism needed in fts to actually exploit all this resources is huge, and require imho some kind of creative and radical solution.
My initial approach to the problem was in the line of the ISPW heritage: in the ISPW, the patch was manually partitioned between up to six instances of FTS connect thru a kind of real time in-memory pipes; the load balancing and global latency is completely handled by hand, as the effect on the patch semantic (no latency compensation, for example).
So, a modern translation would be to implements the same run time structure, at least for the dsp but to automatize the graph partitioning at the dsp compiler level.
There are a number of problems in this solution: the first is the raw complexity: the graph partitioning is a complex optimisation problem interacting with multiple cost functions like latency and CPU load and memory/cache locality; this kind of algorithm exists, so it can be done but it is not easy and not very fast to write.
The second problem is that we do not necessarily have actual data on dsp object performance to use in the optimisation; for example, what is the relative performance of an osc~ and of an fft~ object ? This data can be obtained, but it is very difficult to keep it up to date, with the evolution of the objects and core code, and even of the evolution of the compiler and processors. So, the partitioning should be somehow adaptive and dynamic.
A final point is that the ISPW fts was based on the basis of a non SMP architecture where each processor have its own memory. This assumption was actually already wrong for the ISPW, but in modern machines is radically wrong.
I started to think about a different approach, where the different threads have a stronger coupling, and where the implementation can be done in a reasonable time; of course this approach, as you will see, have its own problems.
I will explain this solution starting with a simplified view, and then i'll add the complexity needed to cope with the various problems, little by little.
The FTS dsp computational model is essentially a pure data flow model. Conceptually, each object wait for all its inputs to be available and then is fired and produce its outputs. Input objects are fired by external events. The current dsp compiler essentially translate the data flow dsp graph to a sequential program, runned by a virtual DSP machine that we call FTL.
What if, instead of translating out the data flow structure, we use it to drive concurrency ? After all, the data flow computation engines are well known for their ability to handle concurrency.[note that data flows oriented engines are widely used at the micro-architecture level in modern processors].
Imagine a single thread data flow engine for fts: each dsp operation have a set of input and output buffers that can be empty or filled. A connection between two dsp objects is represented by a dedicated buffer (that is an output buffer for the first object and an input buffer for the second object).
A dsp operation is executable when all its input are filled and all outputs are empty.
When an operation is executed, it empties its input buffer, and it fills its output buffers. This change in the state of the buffer may make some operations executable. The executable status of input and output operation depends also on external events.
Operations that are ready to be executed are kept in a single scheduler queue, and a single thread execute the loop peek-from-the-ready-queue, execute, update-the-ready-queue.
This model is not multi thread, but it brings already a number of interesting solutions to functional problems: for example, since the computation cycle is not defined by the tick, it is very easy to have different part of the graph running with different ticks sizes and frequencies, or with different data types (for example 1024 points ffts on a part of the graph, and 64 samples per tick for sounds); this come essentially for free.
Another thing we can have is an elegant solution to conditional execution of patches (the horrible ISPW switch~ object): the operations i described have an AND semantic in input and output. Imagine an operation with an OR semantic in output: only one output buffer will be filled (depending on a control parameter), together with an operation that have an OR semantic in input (executed when one of the input is available). This two operations together allows the implementation of alternative parallel patches (for example) with conditional execution adding little to no complexity added to the dsp engine.
The solution can be improved a little, for example to share the output and input buffers between connections, using counters to know which buffer is empty or filled.
At this point, suppose that multiple threads looks for executable operations in the ready fifo. Since the synchronisation between operations is entirely done on the availability of data in the buffers, the threads do not need any other synchronisation, and the patch is executed in parallel.
Of course, the implementation of the counters tied to buffers must be atomic and highly efficient, as the fifo implementation.
This execution model have an interesting property: a data flow engine can attains the maximum parallelism possible for a given graph; but suppose that this parallelism is limited: imagine a butterfly pattern, like for example two columns of +~ that cross output at each level. Imagine to arrange many objects in this way, say thousands, with input objects at the top and output at the bottom. This patch, given its data dependency, allows for a parallelism of two. But the engine we described will automatically pipeline the patch: when the inputs are ready, it will automatically start to compute the next tick even if the computation of the current one is not finished. Essentially the engine automatically trade latency for performance, and keep all the needed threads busy.
The critical point of this model is the scheduler: a single fifo scheduler will simply not works: the synchronisation cost will be so high that it will not scale to the degree of parallelism we are looking for.
But there are a number of easy solutions: the easiest (short of implementing the equivalent of the Linux kernel scheduler) is to have a queue for thread, using some kind of heuristic to choose the right thread at the moment an operation become executable.
An other critical problems are the buffer: having a buffer for connection would probably means, at least for big patches, that the buffers cannot be in the cache hierarchy, and this would slow down the operations a lot. Buffers must be reused, but it cannot be done statically, because as we saw above different part of the patch can be executed concurrently in a un-predictable way. Buffers must be then handled dynamically. This would allows implementing a kind of thread affinity for buffers, to try to limit the “movement” of a buffer between different threads (i.e. processors and caches).
Finally, the remaining problem is that for most of the simple DSP objects, the overhead of the buffer allocation and synchronisation and of the rescheduling may be greater than the actually object payload.
I think we can afford a relatively large overhead, if we loose 30% of the available computational power in overhead, but we can actually use 8 cores, that would be fine.
But the risk is to get a larger overhead, because of the small granularity of DSP operations.
There is a solution, that bring to an elegant implementation strategy for the new engine: let's aggregate objects in small chunk of sequential execution, applying the data flow model not to the single objects, but to these chunks.
This would divide the overhead by the size of the chunks; partitioning the graph in these chunks is not very hard; the objects in the same chunk must share the scheduling requirement (same tick frenquency, essentially), and the chunks should be designed to limit the number of inputs/outputs.
These chunks should end up in a series of small FTL programs.
So, can we extends FTL so that the whole DSP code is still an FTL program, even if executed following a data flow model ? Yes: we 'just' need to add to FTL the dynamic handling of the buffers, and some new FTL primitives instruction to execute the data flow rescheduling.
These means that the current DSP objects require little or no modification, because the synchronisation is handled by FTL instructions generated by the chunk and not by the object itself.
The implementation of this solution can reuse the FTL engine by extending it, and adding the dynamic and multi-thread part, but it can be implemented step by step (for example, starting from a single thread implementation), without a big bang total restructuring.
Of course, the DSP compiler (translating from the graph to the FTL) will be substituted. I think of a multiple steps compiler:
1) FTS objects to an actual DSP graph (currently, an object may correspond to multiple DSP operations).
2) Graph optimizer: dead code elimination, constant propagation, operation aggregation at the graph level.
3) Chuck analysis and FTL code generation (probably in the same step).
4) FTL code assembly.
So, there are a few questions left, like: does all this actually works :→ ? There problems like memory locality and so on that may stop this solution from getting the degree of parallelism we are looking for, but i think it is worth trying.
Maurizio