Many programming languages allow for multithreading and multiprocessing as a means of parallel execution of code. This form of programming allows for tasks to be split into groups of tasks that can be executed concurrently. This can lead to faster execution times for tasks that are not blocked by other operations. There are however several advantages and disadvantages to this form of programming.
Multithreading and Multiprocessing can allow for better performance when executing certain operations. There are many different forms of multithreading and multiprocessing implementations, it is important to know the limitations of each implementation and to consider such things as:
- number of processors (or threads) that are available when the code is running
- the duration and number of tasks that are being executed
Multithreading is the ability to run concurrent tasks within the same process. This way of concurrent programming allows the the threads to share state, and execute from the same memory pools. The advantages of this form of concurrency are:
- Shared memory heaps and pools allow for reduced overhead of shared components – when compared to Multiprocessing
- The process can scale the number of threads while running – allowing for dynamic scaling at peak times
- Allows for asynchronous execution of task that communicate with remote services – allowing for the process to continue with some other operation while it waits for the database to complete the query
While the advantages of this style of concurrency are clear, the nature of the shared memory and resources can result in complexity in ensuring the data consistency. For example the use of the shared memory and resources, can result in data from one thread ‘leaking’ into another thread. In most languages that support this style of operation these errors are protected (as best as they can be) by using locks and synchronizers. These locks will try to prevent other threads from access the resources while a lock is held by a thread.
The usage of these locks and synchronizers can cause complication when dealing with threads as you have to now know what threads hold locks, and ensure that they are released when they are no longer needed. Mistakes in theses areas can result in threads waiting for a long time for a lock to be released, effectively removing the advantages of the multithreaded environment. These errors can also lead in the worst case a ‘dead lock’, this is a situation where all the threads are waiting for each other to release locks. When a process gets into these states is it very difficult (if not impossible) for the system to recover, meaning a restart of the process will be required. It can also be very difficult to know that these cases have occurred, as the process will continue to run, because the process might still respond to some requests. For these situations it is important to have tools in place (such as FusionReactor) to alert you to these situations.
The duration and throughput of concurrent operations should also be considered, as this can quickly lead to issues with executor pools queuing operations. In languages such as Java a common approach to multithreaded execution of code is to use an executor pool, this is a collection of threads that execute from a queue of tasks. This approach means that the overhead of creating new threads is reduced as the threads are reused. The problems can come from the number of executors available for the tasks, if for example there is two tasks submitted to the queue every second, and you have a executor pool of 2 threads (so you can execute 2 tasks at the same time). Then lets say the tasks take an average of 1.5 seconds to complete, this will lead to a problem as the queue will slowly get larger as there is not enough executors to complete the tasks before the next tasks are added.
In some programming languages there are options to use what are know as green threads. These simulate a multithreaded environment while not actually using more native threads. In Python for example, these are called greenlets, they allow the programmer to control when the process can switch to another thread to process another task. Like native threads there are advantages to this form of concurrency:
- No need to use locks and synchronizers – as the programmer can control when the process changes to another thread the shared resources do not need to use locks to protect them selves from data corruption
- Less overhead from thread schedulers when context switching threads – although having multiple threads is less overhead then multiple processes, creating and scheduling threads still requires some resources
- Allows the programmer to have more control over the execution of the program – with native threads you can control when you start and stop the threads, but you have limited control on when they will actually be executed
As with native threads there are advantages to this form of concurrency, it is important that caution is used as green threads cannot yield (allow other threads to execute) when native blocking operations are invoked. This means that if you are reading or writing to a network (waiting on database for example) you can easily get into a situation where the entire process will block until this completes. To prevent such situations the use of more complex IO is needed to stop the process blocking on the operation. This increased complexity can quickly overcome the advantages of using green threads.
Multiprocessing also know as process forking, is a way of running multiple tasks at the same time. This is different to multithreading as we are duplicating the whole process, duplicating the memory and resource requirements. This method can be used to obtain the benefits of a multithreaded environment without having to deal with the concurrency issues that come with those environments. There a a few advantages of this approach:
- No need to use locks and synchronizers – as the whole process is duplicated the resources are not shared and therefore there is no need to use locks to prevent concurrent access
- No need for schedulers – as there are no threads to context switch there is no need to have thread schedulers to control the threading
In a multiprocessing environment each process has it’s own memory set which is not shared with the other processes. In modern environments multiprocessing is used in a few different ways, languages such as Node.js support options to use process forking to run multiple instances of the code on the same machine at the same time to simulate thread pools. This process duplicates the master process and launches it as a new process on the machine, running the code as if the code just started. This has the advantages of separating out all the memory usage allowing for easier code development, however, this has a high overhead (when compared to multhreading) as the entire process is duplicated and executed.
Another way of using multiprocessing is to deploy multiple instance of the process behind a loadbalancer and forward the requests to the instances. This is a common approach when dealing with web applications, and when then need to scale up and down quickly is desired. There are many ways to do this including the usage of managed services such as AWS Elastic beanstalks, or kubernetes clusters. With this method of multiprocessing you can combine it with multi threading to get the benefits from both methodologies.
- Multiprocessing allows for isolation of memory and resources – reducing the risk of bad requests affecting other processes
- Multiprocessing removes the overhead and complexity of the thread schedulers
- Multithreading allows for blocking operations to be waited on without blocking the whole application
- Multithreading allows for shared resources between threads reducing the overhead of the process as a whole
There are advantage to both multithreading and multiprocessing, in both cases they can be used to improve the performance and reliability of applications. There are many things to consider when choosing what is the best for the application, such as:
- the number of threads or cores that are available – if there are 8 threads running but there is only the ability to run 4 threads at a time, there is the possibility for the CPU to be over committed resulting in the thread schedulares having to context switch the threads too much to see a benefit from the increased number of threads.
- the duration and number of tasks – if the duration and throughput of the tasks exceeds the number of executors available then there can be an increases in the number of queued operations leading to the inability to ever complete the operations.
- locking and synchronizers – in mulithreaded systems the incorrect usage of locks and synchronizers can lead to a deadlock where the system is unable to recover without a restart
- concurrency errors – related to incorrect usage of locks, it is possible to corrupt data shared between threads leading to situations where the wrong data is used. That can be very difficult to debug with out the correct tools
Ultimately there is not a better way, each approach has advantages and disadvantages and will depend on the workload of the application as to which approach is best. As a quick guide; Multithreading is good when dealing with remote systems, as there is no need to block the entire process while it waits for native operations (such as IO). Multiprocessing is good when dealing with know quick operations that do not need external systems, or tasks that need to execute various operations in a specific order without sharing data with other executors.