Tools For Tuning Parallel Programs


Parallel computing offers the potential for substantial increase in performance over sequential computing. But, the much booted performance of parallel systems with all its anticipated thunder hasn't yet arrived. One of the most significant reasons for such a situation is the unavailability of proper tools and techniques to aid in developing software for these systems. As a productive research group, we 0 to take on the challenges provided by contemporary knowledge and technology to arrive at intelligent solutions and viable alternatives.

Program Model:

Parallel programs, particularly the MIMD programs, may be classified as either message passing or shared memory programs. Though shared memory programs are the preferred program models from a user's perspective (for their ease of usage), they are more difficult to be supported than explicit message passing systems. A major difficulty in supporting a shared memory model is data consistency, whereas in a message passing system, such constraints are obviated. We prefer the message passing paradigm as it offers a lot of flexibility in programming. However, the results can just as easily be applied to the shared memory model.

Generalized Critical Path:

The concept of a critical path is often used in operations research. A large problem is broken into smaller tasks and are scheduled to complete in the shortest period of time. Many of the tasks can be carried out in parallel, but there may be certain tasks that require results from other tasks. The completion time for the problem depends on a 0 of tasks that take the longest time to finish. This irreducible 0 of tasks is referred to as the 0 path. Evidently, any attempts to shorten the time to execute the problem should start with the identification of the critical path. Any reduction in the execution time of a task that lies on the critical path will lower the overall execution time of the problem.

Since the critical path for every run may vary, we propose to find out the Generalized Critical Path which is the most probable critical path for the parallel program. An important difference in our approach over previous approaches lies in the fact that no instrumentation is used while identifying the GCP. Hence, there is no intrusion to the program execution. This is made possible by the concept of Synthetic Perturbation Tuning.

Synthetic Perturbation Tuning:

SPT is a novel approach in which small delays (these are too small to cause any intrusion to the program execution) are introduced in program segments, and the execution time of the program is measured. An increase in the execution time indicates that the particular program segment is sensitive to delays, and any addition of delays at that place is disastrous to the overall execution time. Since, the delays are too small, the execution time is assumed to be linear, and so, reduction in the execution time of the particular program segment will lower the overall execution time. In effect, what that means is that the particulat program segment is on the critical path. In this way, we scan identity the critical path of the program.


Scheck is a tool developed at the NIST that automates the introduction of delays by a graphical user interface. However, at present, the tool can only introduce delays in only one thread of execution, and multiple threads aren't yet supported.

Results thus far:

A lot of work has gone into identifying the Generalized Critical Path. A tool called the PathTool has been developed that identifies the generalized critical path using SPT. But, the tool is too cumbersome to use as the delays for SPT analysis are manually inserted. We are very much on our way to automate PathTool and integrate it with scheck.

Current Research and Proposed Future Work:

The proposed research is to utilize the concept of Synthetic Perturbation to tune the performance of parallel programs. Synthetic Perturbation is an approach wherein extremely small delays are introduced into code segments and the response time of the code in study is closely monitored. The major advantage of SPT lies in its minimum intrusion of the actual application program. (Probe Effect is a familiar disadvantage that is tolerated in conventional performance monitoring systems and our approach attempts to overcome this situation.) This is because we are attempting to monitor only those events that are associated with the start and end of the response-interval measurement, and another small intrusion where the program is being delayed. In short, SPT determines which subset of code segments will be most beneficial to to the overall response time, if shortened.

There are two ways in which we intend to tune parallel applications using SPT. The first is a crude method to quickly show performance bottlenecks by using SPT in a single thread of execution. The second method introduces delays in all threads of execution, from which a complete picture of the performance can be understood.

The first method can be directly used by scheck. The second method is to modify scheck to introduce delays in more than one thread of execution (as currently scheck doesn't support multiple threads ).

The disadvantage of the first method is that significant bottlenecks may never be known, if that method alone is employed. But, it is our feeling that it may throw a good amount of light on the overall situation. We feel that the first method followed by the second method will be a good way to tune the parallel application. After all, tuning by SPT is a iterative method, and the advantage of the first method lies in not using extensive runs for clearing initial bottlenecks. The second method may be used for finer tuning of the application program.

Go Back