Parallel MP3 Encoding
Gory Technical Details

Navigation


  • Home
  • Why?
  • Technical details
  • Change log
  • Download
  • Install / Run
  • Credits

  • BladeEnc home
  • LAM/MPI home


    Be sure to see the main BladeEnc home page.

  • This page gives the gory technical details about how I parallelized bladenc using MPI. It discusses many of the issues involved, and is intended to serve as a launching point for discussion between the BladeEnc developers about the Right Way to parallelize BladeEnc (see the "Open questions" section at the bottom of the page).

    Quick links:

    All of this info is here on one page and not broken up into cross-linked pages to be easier to understand and potentially more eye-pleasing. Deal. I'm a programmer, not a web page designer.


    Rationale

    So what is parallelization?

    It's the idea of splitting one problem across lots of CPUs and having them all work simultaneously to the solution. For example, consider some number-chunking code that takes 10 hours to run on a single workstation. Now say that you can easily split this problem to run on 10 workstations simultaneously. Assuming that you did everything right, the overall run time should be reduced to about an hour, right? That's parallelism.

    Why do we want to process MP3s in parallel?

    The short answer is: speed. Intuitively speaking, if you split the work of MP3 encoding across lots of CPUs (whether in one machine or in lots of machines), you can speed up the overall process [ideally] in an almost linear fashion.


    Overview

    So how does this work? Or, what actually happens when you run this parallel BladeEnc on (for example) 4 machines?

    Essentially, the parallelization was not too difficult. The MP3 encoding was already broken down into performing the mathematical operations over small portions of the file. It was essentially something like:

      while (!eof(input_file)) {
        read(next_portion_of_input_file);
        encode(that_portion);
        write(results_to_output_file);
      }
    

    The difficult part about parallelization is typically about how to split the problem up into independent (or nearly independent) parts. The structure shown above - assuming that the encode() function was independent of previous calls to encode() - is trivial to parallelize.

    For example, if we want to run on four machines, we can split the input file into four parts, give 1/4 of the file to each machine, and let each machine loop over encode() for their portion of the file. Then take the output from each machine, put it in the right order, and write it out to a single output file. Done.

    With such a scheme, the more work that you throw at it, the more efficient it becomes. Hence, trying this scheme with small MP3 files will probably not result in any noticeable speedup (in fact, it may be slower than running in serial, because of the added overhead for working in parallel). It is necessary to give the parallel engine enough work to offset the overhead added by the parallel framework. Generally, this is not very much overhead (read on to find out why), but parallel is not free.

    Of course, it's not quite that simple, but that's the general idea that I used.


    Parallel terminology

    At least one common term warrants definition for those who are not in the parallel computing community:

    • rank: This is an MPI-specific term, and refers to a single message passing entity (usually a POSIX process). A running MPI program has one or more ranks. Ranks may or may not reside on the same physical machine.


    Overlap

    One of the problems that occured is that MP3 is a differential encoding algorithm. For all of you non-math types out there, this means that calls to encode() do depend on the results of previous calls to encode(). Hence, splitting the file into four parts and giving one part to each rank is not quite sufficient. Grossly speaking, if nothing else, the first call to encode() on each rank (other than the rank that got the first 1/4 of the file) will generate incorrect results, since they don't have the results from the last section of the previous quarter of the file. That is, visualize the splitting of the file as such:

    SERIAL PROCESSING OF THE INPUT:
      |---------  ---------  ---------  ---------|
                         rank 0
    
    PARALLEL PROCESSING OF THE INPUT:
      |---------||---------||---------||---------|
        rank 0     rank 1     rank 2     rank 3
    

    (spaces were added in the serial diagram just to make it line up nicely with the parallel diagram; the only thing that matters is the number of dashes)

    Imagine that each "-" in the picture represents a call to encode(). If the file were processed in serial (i.e., on one rank), the first call to encode() in the section that is marked for rank 1 would come after the last call to encode() in the section that is marked for rank 0.

    However, when the different sections of the file are processed on different ranks, the first call to encode() in the section that is marked for rank 1 will not have any basis - there will be no differential (or, more specifically, the differential will be from zero) from the previous sample because from rank 1's point of view, there was no previous sample.

    Conclusion: simply splitting the file into multiple chunks and encoding them separately doesn't work. Indeed, doing that produces very obvious "clicks" in the output audio, since you're effectively resetting the differential at the boundary point for each rank.

    Solution: send some "overlap" to each rank (except to rank 0). That is, use a distribution something like:

                 rank 1             rank 3
              |- ---------|      |- --------|
      |---------|       |- --------|
        rank 0             rank 2
    

    So each rank gets the last chunk full of samples from the rank before it. Hence, it can build up a differential so that the first "real" frame that it does (i.e., the second frame) will be based upon the previous call to encode(), and we don't get an annoying "click" in the output because the differential between the different ranks (i.e., different parts of the input file) is actually correct.

    Sending such redundant chunks and the time required to process them does add a bit of overhead, but it is negligible when compared to amount of real work that the rank computes.


    Master / slave approach

    Using the "splitting" approach (with overlap) described above, we somehow have to ship the separate parts of the file out to all the ranks who will actually be doing the computation. There are several common ways to do this:

    1. Run N copies of BladeEnc (one on each rank), let each BladeEnc know who it is (i.e., its rank number). Each rank can then figure out which portion of the file it is supposed to process, do its work, and then quit. The following [very rough] pseudocode shows what I mean:
        size = size_of_file / N;
        fseek(my_rank_number * size);
        read_bytes = 0;
        while (read_bytes < size) {
          read_bytes += read(next_portion_of_input_file);
          encode(that_portion);
          write(results_to_temporary_output_file);
        }
      

      This method assumes that every rank has access to the input file - that every rank shares a common filesystem. This may not always be the case. Indeed, even if it is, most filesystems (e.g., NFS) will send the entire file to each machine that requests to read part of the file. So the total number of bytes sent across the network is (size of file) * (number of ranks), which is equivalent to sending the file across the network N times (and this doesn't even consider NFS overhead). This is quite inefficient, and isn't friendly to environments that don't have common filesystems.

      Note: The writing of the result MP3 data to a temporary file is a cop-out specifically for purposes of this pseudocode (and is used throughout this section). Writing the output out to an .mp3 file in parallel is somewhat complicated, and is specifically addressed in the "Output File Ordering" section.

    2. Have rank 0 be the only rank that actually reads the input file (i.e., the file only has to be accessible to rank 0), split it into N parts, and send a part to each other rank. Once a rank receives a portion of the file, it starts computing. Rank 0 starts computing when it finishes sending to the other ranks.
        if (my_rank_number == 0)
          for (i = 1; i < N; i++)
            send_portion_of_file(i);
        else
          receive_portion_of_file();
      
        for (i = 0; i < size_of_my_portion; i++) {
          encode(my_sub_portion);
          write(results_to_temporary_output_file);
        }
      

      This is efficient in terms of the number of bytes sent across the network because it is equivalent to sending the input file across the network only once (vs. N times from the previous approach). However, rank 0 will be delayed in starting its own computations, which does not seem nice. Additionally, if the ranks are not homogeneous, the entire job will have to wait for the slowest rank.

    3. Use the same scheme as above, but add one more rank. Instead of N ranks, have (N + 1) ranks (even though there are only N CPUs). Rank 0 becomes a coordinator, or "master" - it performs no MP3 encoding itself, but serves as the distributor of work. So ranks 1 through (N + 1) do the work, and are known as "slaves".
        if (my_rank_number == 0) {
          // Master
          for (i = 1; i <= N; i++)
            send_portion_of_file(i, i - 1);
        } else {
          // Slave
          receive_portion_of_file();
      
          for (i = 0; i < size_of_my_portion; i++) {
            encode(ith_sub_portion);
            write(results_to_temporary_output_file);
          }
        }
      

      This solves the "rank 0 is delayed" problem, but does introduce a little overhead in that the master will likely share a CPU with a slave and steal a few CPU cycles from it. However, the work differential between the master and the slave that it is sharing a CPU with is enormous, so the slave will still dominate the CPU cycles. But this does not solve the "slowest rank determines the overall speed" problem.

    4. Use the same "master / slave" scheme as above, but divide the input file into M parts, where M >> N. The master doles out portions of the input file to slaves on a first-come, first-serve basis. As slaves finish their section, they poll the master for another. This cycle repeats until the master runs out of input data to distribute.
        if (my_rank_number == 0) {
          // Master
          for (i = 1; i <= N; i++)
            send_portion_of_file(i, i - 1);
          i--;
          while (more_input_file_to_send) {
            who = receive_slave_done_message();
            send_portion_of_file(who, i++);
          }
          send_all_done_messages();
        }
        else {
          // Slave
          while (true) {
            receive_portion_of_file();
            if (master_is_done)
              break;
      
            for (i = 0; i < size_of_my_portion; i++) {
              encode(ith_sub_portion);
              write(results_to_temporary_output_file);
            }
          }
        }
      

      This approach allows naturally faster ranks to request more work than slower ranks. A disadvantage is that the master requires more CPU cycles, and therefore steals a few more from the slave that it is coexists with. However, the master should still be doing orders of magnitude less computations than the slave, so we should still be ok. Of course, there is no requirement that the master coexist on the same CPU as a slave - the master could certainly have a CPU all by itself, but this would seem like a waste of a CPU since the master is not very compute-intensive.

    The last approach is the one that I took with this version of parallel BladeEnc.


    Load balancing

    The master / slave approach is by no means perfect. But it does solve a lot of problems, and creates a convenient (and fairly simple) programming abstraction. But how do we choose M (the number of portions to divide the file into)? We must consider the overhead that it introduces. Let me digress to a short discussion about networking...


    Every time you send a message, there is some additional control information sent with it (e.g., the packet header), such as who the message is for, the length of the message, etc. Messages longer than a certain size will get broken up into a series of smaller messages. For example, TCP messages longer than 1,500 bytes will get broken up into as many 1,500 byte sub-messages are necessary to send the entire message. Each sub-message is called a packet. So when you send a message, it is broken down into one or more packets, sent to the destination, and then reassembled back into the original, single message.

    But the overhead for each packet is a constant number of bytes. So in the case of TCP, whether my message is one byte long or 1,500 bytes long, the overhead is the same - H bytes.

    So which is more efficient, sending 1,500 one-byte messages, or sending one 1,500 byte message? Consider: sending 1,500 one-byte messages actually sends 1,500 * (1 + H) bytes, where sending one 1,500 byte message sends 1 * (1,500 + H) bytes.

    O = overhead byte
    D = data byte
    
    1,500 one byte messages: |OOOOOOOOD|
                             |OOOOOOOOD|
                             |OOOOOOOOD|
    			  ...etc...
                             |OOOOOOOOD|
    
    One 1,500 byte message:  |OOOOOOOODDDDDDD...etc...DDDDDD|
    

    Clearly, sending one big message is more efficient (and we're not even talking about operating system delay or other sources of latency here), because the number of D's is always the same, but the number of O's is not. Generalize this a bit more: sending a small number of large messages is more efficient than sending a large number of small messages.

    Note that this is essentially the same for shared memory communication (or any message passing kinds of communication). Hence, it doesn't just apply to TCP networks.

    With that in mind, back to the master / slave discussion.


    So we need to choose an M that nicely divides the file so that we can get good performance out of both fast and slow slaves, but doesn't introduce too much network overhead. How the heck do we do that?

    I chose to do an end-run around this question - don't try to calculate M. Instead, we initially choose the size of the portion that we send to each rank. This number should be fairly low, and, unless you conduct some benchmarks, is somewhat arbitrary because you don't know the speed of the slaves beforehand (currently, I send 32 chunks to each rank in the master's first iteration).

    As each slave reports that it is done, the master doubles the size of the portion that it gave to that slave last time, reads in that amount from the input file, and sends it to the slave. In this way, we keep sending larger portions of data to the slaves as they request more. The faster slaves who keep asking for more portions will be given larger and larger portions of work. That is, the size of our messages increases so that our networking efficiency increases (there's a cutoff point, of course, but you get the idea).

    Note that there is a max size that the portions will grow to - the current implementation grows to a max of 256 chunks. This is a compromise between a not-too-short amount of time on a fast slave and a not-too-long amount of time on a slow slave.


    This approach also solves the "too many ranks" problem. Consider what was discussed above - if you try to parallelize a problem that is too small, your run time will likely be longer than if you had run in serial because of the overhead that parallelism introduces. The same problem occurs even if you have a fairly large piece of work to perform, but try to use too many ranks. That is, each rank's piece of work is fairly small, thereby reducing the overall efficiency. Since each rank will perform its work quickly, the ranks will all be in contention to return their output to the master (which is an inherently serial process).

    More to the point, the overall efficiency of the parallel program is dependent on the efficiency of its parts. If each rank has lousy efficiency (say because they only have a small amount of work to do), the overall efficiency will be lousy. So instead of dividing the total amount of work by the number of processors, we split it into fixed sizes and dole out the work until there is no more to perform. This guarantees that there is some minimum size of work that will be sent to each slave, and provides a guaranteed lower bound on efficiency. Chosen properly, this lower bound can be as close to optimal (i.e., the upper bound) as possible.


    Output file ordering

    So we haven't discussed how the output MP3 data is gathered by the master and written out to the output file yet - we just did a bit of hand-waving in the previous sections saying "don't worry about it; the output happens somehow." In reality, here's what happens...

    As each slave finishes the work that the master gave to them, they effectively send the resulting MP3 output data back to the master. The master receives and stores the MP3 data in memory and, if there is any input left in the input file, gives the slave a new chunk of work to do.

    However, consider the following:

    • The size of the MP3 output data may be different for each input set.
    • Slaves may not operate at the same speed.
    • Even if slaves are the same speed, workstations on a LAN can only transmit to the master one at a time (regardless of the underlying network media). So even if multiple ranks finish their work "simultaneously", only one of them will be able to immediately send the results to the master - the rest will automatically be queued up due to normal networking/message passing protocols.
    • The issue is essentially the same for SMPs as well; the master only talks to one slave at a time.

    Why is this important? Consider what happens when rank 5 returns its MP3 output data before rank 3 returns its MP3 output data. The master cannot write rank 5's data to the output file, because it has not written the previous ranks' output data (at the very least, it has not written rank 3's data), and since it doesn't know the length of rank 3's output data, it can speculatively write rank 5's data "in the right place" ahead of time.

    INPUT FILE:
      |------||------||------||------||------|...
       rank 1  rank 2  rank 3  rank 4  rank 5 
    
    OUTPUT FILE:
      |---||-||----||--||---|...
       rank 1  rank 3    rank 5
            rank 2   rank 4
    

    More to the point, the output file must be written in the same order that the input file was read. That is, we cannot write rank 5's MP3 output data until the MP3 output data has been written from ranks 1, 2, 3, and 4.

    Specifically, the master has to keep track of which portions of the input file it has given to which slave. As slaves return their output data, the master has to check which portion of the input file that slave was encoding and whether all preceding portions of the output file have been written yet. If they have, the master can write out the new output data. If the preceding sections have not all been written yet, the new output data must be queued up for later writing.


    Using SMPs / multithreading

    So does this scheme work with SMPs?

    Absolutely. One of the nice things about MPI is that it intentionally doesn't distinguish between whether ranks are on the same physical machine or not. When you use MPI to send a message, you just rely on the MPI implementation to do the fastest thing to send the message to the destination rank (regardless of whether the source and destination ranks are on the same machine or not).

    So when running BladeEnc in parallel, simply start up one per CPU (plus 1 for the master). So if you have 4 uniprocessors, start 5 ranks. If you have 4 dual processors, start 9 ranks. If you have 2 uniprocessors and 2 dual processors, start 7 ranks, but ensure to distribute them properly so that there is one slave on each of the uniprocessors and two on the dual processor machines.


    Tord and I have briefly discussed this scheme as well as a different potential multi-threading scheme for BladeEnc. That is, someone else has proposed a multi-threading scheme for BladeEnc that is more of a pipelined model, where each thread does a different stage of the computation in the MP3 encoding. Input and output data is therefore simply passed between threads in the pipeline.

    After giving this some thought (and again giving the disclaimer that I don't know much about the MP3 algorithm), I'm not sure that this model would be a performance win or not. Let me explain.

    The model that I have given above works for SMPs as well. It could certainly be improved such that one instance of BladeEnc runs per node, not per CPU, and each rank is responsible for having "sub-worker" threads. That is, the rank spawns K threads, where K is the number of CPUs on the machine that the rank physically resides on. Each of the K threads loops over encode() on its portion of the input that that rank received. This is just a logical extension of the previous subsetting.

    This would arguably be a better approach than having one rank per CPU, because it could potentially avoid some network interface congestion/contention between ranks on the same physical machine. I did not take this approach, however, because there are some global variables that are used in the encoding code (making it non-thread safe), and I didn't want to muck around with that [yet].

    Using the pipeline thread approach has a few drawbacks (IMHO):

    • Each stage in the pipeline is likely to have different amounts of work, unless the stages are very carefully instrumented. If each stage has a different amount of work, it will result in what could effectively be termed pipeline stalls - shorter stages will be forced to wait for work from previous longer stages, and thereby use the CPU less than all the time.

    • There are (will be) a fixed number of stages in the encoding algorithm - let's call it A. On a given machine, you have fixed number of available CPUs - K. When (A == K), life is good. But when (A > K), you'll have more threads than CPUs, which will cause thrashing. If (A < K), you'll have less threads than CPUs, meaning that you'll be wasting (K - A) CPUs since you won't be using them.

      The point that I'm making is that the pipeline approach may not scale nicely to any number of CPUs where (K != A). Yes, there probably are solutions (particularly for the [A < K] case) to make it scale nicely, but it could get quite complicated and ugly quite fast.

    So it's not clear to me that this pipeline model is a good one. The nice thing about the master / slave model (either as it is now, or if we introduce slave sub-threads) is that it scales nicely with this kind of algorithm.

    However, it is quite possible that I missed something; Tord and I really only spoke very briefly about this. This is my $0.02.


    LAM/MPI shortcuts / conveniences

    One of the newer features of MPI is that an MPI application can launch other MPI applications (although not all MPI implementations have incorporated this feature yet). This feature is quite convenient in BladeEnc's case, because, when combined with another MPI feature that allows the querying of how many compute ranks are available, it allows the automatic parallelization of BladeEnc without the user knowing that anything magic has happened.

    For example, in LAM/MPI's case, once you lamboot to fire up the MPI run-time environment on some number of nodes, you can simply:

      % bladeenc foo.wav bar.wav baz.wav
    

    just like you have always done in the past. Additionally, your favorite GUI front end that uses BladeEnc for its back-end MP3 encoding will also work in parallel - no special command line flags or mpirun is necessary.


    Specifically, if parallel BladeEnc detects that there is only one rank when it starts, and BladeEnc's configure determined that the underlying MPI supports the spawning feature, BladeEnc will automatically start MPI_UNIVERSE_SIZE slaves. MPI_UNIVERSE_SIZE is typically the number of CPUs available, but it may be the number of machines available - it is dependant upon the MPI implementation (LAM 6.3.2 uses number of nodes, but LAM 6.3.3 will use number of CPUs).

    However, if BladeEnc detects only one rank and the spawning capability is not available in the underlying MPI, you'll essentially be using the stock serial version of BladeEnc.

    The same is true if there are only two ranks. That is, it doesn't make much sense for there to be one master and one slave - the master would simply send the data to the slave, the slave would process it, and send it back. All we're doing is incurring network overhead; the process is not sped up at all. Hence, if there are only two ranks, the second rank will be ignored and the master will proceed in serial mode.


    Parallel portability

    One of the great things about programming with MPI is that it is inherently portable. The API is completely standardized, so if your program compiles with one MPI, it compiles with all MPIs.

    Each MPI implementation does have its own quirks, however - run-time behaviors can differ. Even though the MPI API is standardized, and the behavior of each function is specified, there are some intentionally vague points in the MPI standard that MPI implementors are free to interpret in various ways. This can cause some valid MPI codes to be "unportable", because they work fine with one MPI implementation, but "hang" when using a different MPI implementation.

    The MPI code in this parallel BladeEnc should be immune from such issues - it should work just fine under all implementations of MPI. The MPI code used is actually fairly simple and straightforward, and doesn't really use any of the ambiguous sections of MPI.

    Hence, the parallel code in this BladeEnc is highly portable to any parallel architecture.


    Microsoft Windows support

    Does this work on Microsoft Windows?

    Hypothetically, yes. MPI is fairly ubiqutous environment - there are implementations of it for just about every OS/platform, to include Microsoft Windows (there are at least 2 freeware MPIs that run under Windows). Since BladeEnc itself runs nicely under Windows, and since the MPI additions that I did are nothing out of the ordinary, there's no reason that you can't encode your MP3s on a cluster of Windows workstations with a Windows-based MPI.

    I do not personally use Windows very much, nor do I develop on it. Hence, I'm not familar with any of the auto-configuration tools that Windows developers use. Specifically, the configure script that works nicely in POSIX systems that determines the capabilities of your MPI will not work under Windows. You'll likely have to set the #define statements yourself. Any advice here would be appreciated.

    I do know that at least one of the Windows-based MPI implementations (MPICH/NT) does not have very much MPI-2 support yet, so the automatic spawning feature will not work. I am not sure of the MPI-2 status of the WMPI project.


    The code itself

    The majority of parallel code is in its own file, parallel.c (and parallel.h). A few hooks had to be inserted in main.c and some other files, but nothing big. grep for "#if WANT_MPI" in the .c files to see where parallel logic has been added.

    You should also read the file bladeenc/JMS-NOTES in the source code distribution for some notes on the parallel algorithm and whatnot.


    main.c

    Since I only added a few hooks in main.c, it's easiest to talk about them with just a list of bullets. All the added code in main.c is surrounded by "#if WANT_MPI", so it's easy to find.

    • First off, we have to include <mpi.h> to get all the MPI function prototypes, types, and extern constant declarations.

    • There's a few global variables instantiated:

      • num_desired_slaves: The number of slaves that the user set with the -slaves=N command line switch.
      • num_slaves: The number of slaves that are actually being used.
      • slave_comm: A private MPI communicator for the slaves to communicate in. (Sidenote: a "communicator" is essentially a private communications space inside of MPI that contains a set of ranks [i.e., processes] and a unique context that makes it private)

    • We have to do a few things if we don't want MPI, too. Recall that if the configure script doesn't find support for MPI, you'll get a plain vanilla version of MPI. So the first "#if !WANT_MPI" section defines a few variables that have to be used in the main loop if we're not going to go parallel.

    • Right at the top of main(), we call parallel_startup(). Slave ranks entering this call will not return -- they will be dispatched to the "worker" code. The master will return.

    • Further down, after we have parsed the command line to check for the -slaves=N argument, we call parallel_spawn(). This will spawn slaves if the following conditions are met:

      • The underlying MPI supports the MPI-2 function MPI_Comm_spawn().
      • Parallel BladeEnc was started with just one rank.
      • The number of desired slaves is greater than or equal to 2.

    • A litle further down, we print out a tagline specifying if we're encoding in parallel or in serial.

    • The main loop for encoding a single song has been moved to parallel.c. There are two main functions: serial_master() (for encoding with just one CPU) and parallel_master(). Depending on how many slaves we have, call the appropriate routine (if we were compiled without MPI support, unconditionally call serial_master()).

    • There's a litle more bookkeeping (specifically if we want "{" and "}" or not) if we compile with parallel support.

    • Finally, when the BladeEnc is completely finished, we tell the slaves that we are finished (i.e., distribute a special "We're done!" message to each of the slaves), and then shut down MPI in an orderly fashion with calls to parallel_master_finish() and parallel_shutdown().

    • Additionally, in the quit() routine (which is called for abnormal termination, IIRC), we also have to call parallel_master_finish() and parallel_shutdown so that all the slaves shut down when the master shuts down.

    • In the readGlobalSwitches() routine, a little code was added to check for the -slaves=N command line switch, and to set num_desired_slaves as appropriate (or print a warning if BladeEnc wasn't compiled with MPI support, or if the MPI doesn't support spawning). Similar logic was added in the printUage() routine.


    parallel.c

    This code is entirely original; it's not just a few hooks into the original BladeEnc code (the tabbing style is obviously different, for one thing ;-). Hence, I'm not going to go into as much details as I did for main.c -- there's tons of coments in the file itself for that. Instead, I'll just describe each function and the general algorithm that it implements.

    Just about the entire file is surrounded by "#if WANT_MPI", so even though parallel.c is compiled regardless of whether configure finds support for MPI, this file will essentially be empty after the preprocessor chunks through it if we are compiling a serial BladeEnc (with the exception that the serial_master() is always included).

    I do appoligize for all the debugging code in this file (all the #if MASTER_OUT and #if SLAVE_OUT and whatnot). Trust me in that debugging in parallel is a real pain in the... watch your mouth! (I'm just talking about Shaft...)

    And that's it. It's that simple.

    Ok, it's not quite simple. But it's cool. :-)


    Areas for improvement

    • The load balancing algorithm could probably be better. Should probably do a little investigation here to make it smarter such that overall performance is maximized.

    • In the current implementation, the batch mode waits for each input file to be completely processed before moving to the next. That is, the parallelization is across input files, not across the entire batch of work that needs to be done. So if you have a slow slave, you may end up waiting at the end of each input file for that one slave.

      The parallelism should be moved up one level (i.e., to the batch level). So when one input file is exhausted, the master should immediately start reading from the next input file and distributing that work, regardless of whether the previous input file(s) is(are) finished.

      This will entail a bit more bookkeeping on the master's part which will be annoying, but not overly complicated. It's more a matter of getting a round tuit.

    • It should be noted that there is no need for redundant processing (once the parallelism is moved up to the batch level, that is). Specifically, not much will be gained by sending the same input samples to multiple slaves. The only place where this would benefit from occurring would be the very end of the computation, when there is no more input to give.

      In this case, the master could easily give a set of samples to some slave that has already been given to another slave. One of the slaves with this set of samples will finish first and return the output to the master. The master can then write the first-returning output to the .mp3 file, and discard any redundant outputs that are returned later.

      But this is pointless in this situation, because we have to wait for all slaves to finish before exiting BladeEnc gracefully. Hence, we have to wait for the slowest slave anyway - why add [even more] complexity to the master by implementing redundant data? The only purpose for doing redundant data would be for a performance speedup, if we could finish the overall work faster and not have to wait for the slow slaves. But since we have to wait for the slow slaves anyway, nothing would be gained here.


    Open questions

    1. Is my approach of sending redundant chunks to every slave correct? I am actually not familiar with the MP3 algorithm, but a friend of mine who is has assured me that this will correct for the loss of differential, and the resulting output will be of the desired quality.

      However, this does make different output than the serial runs. So the question is not only "is this correct", it might also be, "does it matter"? I currently use 4 redundant chunks.

    2. Which multi-threading approach do we want to take? The pipeline approach, the slave sub-thread approach, or something else?

    3. What are good minimum and maximum packet sizes to use? I currently use 32 chunks as the minimum, and 256 chunks as the maximum. These defaults might be fine, but it may be good to have a compile-time and/or run-time way to change these defaults if you have exceptionally fast (or slow) slaves.

    4. How to integrate this nicely in Windows? That is, how do we test the capabilities of the local MPI implementation like we do in the POSIX configure script?

    It is important to keep in mind that this is first-generation stuff. The master/slave approach - while technically sound and effective - is possibly not the best approach. Any comments and suggestions are welcome (please send them to the BladeEnc developer's list).


    Page last updated:
    September 3, 2002
    Copyright © 2000-2014