jkluwersupe 8-1su618 - Laurent HEUTTE

hardware architecture and makes our tool very portable and ffexible. ..... the ISCA International Conference on Parallel and Distributed Computing Systems, ...
1MB taille 2 téléchargements 418 vues
The Journal of Supercomputing, 18, 89–104, 2001 © 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

Source-to-Source Instrumentation for the Optimization of an Automatic Reading System ∗ P. PEREIRA, L. HEUTTE, AND Y. LECOURTIER

[email protected] [email protected]

PSI-La3i, Universit´e de Rouen, 76821 Mont-Saint-Aignan Cedex, France Abstract. Recording address traces that occur during a program execution is a significant technique for computer performance analysis. This paper describes a software method for address tracing via the instrumentation of C based languages. All program transformations are performed at the language level. This approach, which differs from the usual methods, allows portable and flexible program instrumentation. This tool has been developed to make easier the memory optimization of LIRECh`eques, an automatic bank check reading system. Two applications of the tool are clearly identified: (i) data cache use optimization, (ii) dynamic memory use optimization. Keywords: C based language instrumentation, data cache use optimization, heap memory optimization, on-line simulation, bank check processing

1.

Introduction

The two approaches for collecting address trace are hardware methods [1, 8, 10, 14] and software instrumentation. The software approach, which consists in the insertion of instrumentation code, is used in many applications such as the study of the cache behavior [9, 12, 17, 18, 32], the prefetching [5, 6], the TLB behavior [7] or the file system performance analysis [13]. Insertion of address-tracing code is usually performed at the executable level. This approach, although efficient, depends however on the hardware architecture. To avoid this problem, we propose an instrumentation of a source program with all necessary program transformations performed at the C language level [30]. This allows to be independent from an hardware architecture and makes our tool very portable and flexible. The tool is intended to make easier the optimization of memory use in LIRECh`eques, an automatic bank check reading system developed by MATRA Syst`emes et Information, a French company, and integrated in several operational sites. This application, written in C, consists of a set of near 100 algorithmic steps linking up sequentially. The input data is a list of bank checks. The numeral and the literal amounts of the check are extracted and recognized by different techniques [16, 29]. The output of the process is a list of recognition results for each check. The LIRECh`eques system handles a large number of data during the execution and thus requires a lot of memory. Some data which are useful during several steps of a bank check processing (for example, the binary image of the check) are ∗

Author to whom all correspondence should be addressed.

90

Figure 1.

pereira et al.

The LIRECh`eques program and the global data structures

stored into global data structures. These structures are reset after the end of every check processing. Figure 1 shows a simplified scheme of the LIRECh`eques system and some information collected into the global data structures along the bank check processing. The memory optimization is really an important task in computer system improvements. In the case of LIRECh`eques, this task is crucial because of the large number and size of handled data. So, two aspects of memory optimization have been held: first, data cache use improvement because the cache and more particularly the data cache is crucial to obtain high performances, and the improvement of the dynamic memory use to prevent from saturation of the system which occurs with problematic checks. To realize these two tasks, we collect a trace of data memory references that occur during LIRECh`eques execution and we exploit it by well-suited methods. In the case of cache simulation, trace information is provided to a cache simulator which gives statistical results on the cache behavior. For the optimization of the dynamic memory use, the data addresses of the global data structures are tracked during the execution, and information about the location of the memory references in the program is recorded. The paper is thus organized as follows. Section 2 describes the instrumentation of C based languages for tracking memory references of program data. Section 3 explains how this instrumentation can be used for the memory optimization of the LIRECh`eques system and some experimental results are given as proof-of-concept for the data cache use and the heap memory optimization problems. Finally, some conclusions are drawn in Section 4.

automatic reading system 2.

91

Instrumenting C based languages

C and C++ based languages are frequently used in industrial applications. Since these types of languages are very permissive, program instrumentation is harder to realize. Program instrumentation is achieved by the insertion of monitoring code in order to record the execution of program events or to collect and record data. Instrumentation code can be added at any stage during compilation. Usually, a program is instrumented at the executable level [2, 23, 24, 25]. This method allows efficient results but depends on hardware architecture. Our approach departs from the existing approaches in the way that we want to make our instrumentation tool really portable and flexible. Instrumenting a program at the language level seems then to be a good solution to keep independence from hardware architecture. A source-to-source transformation which adds measurement code directly into a source program is usually used for measuring source level characteristics [21, 22], program profiling [28], program running time analysis [27] or heap space analysis [33]. Such an instrumentation does not require a detailed trace of events that occur during a program execution, unlike tracking memory references of program data which require a complete trace of events. As our approach is based on source code instrumentation, Sage++ library [3, 4], an object oriented toolkit for building program transformation systems, is well-suited for C based language transformations. The heart of the system is a set of functions that allow the tool builder complete freedom in restructuring the parse tree and a mechanism (called unparsing) to generate new source code from the restructured internal form. Our global program instrumentation is presented in Figure 2. In order to obtain a real trace of data memory references without modifying the behavior of the traced system (from the algorithmic point of view), we propose to

Figure 2.

Global program instrumentation

92

Figure 3.

pereira et al.

An example of a pre-instrumentation (“W” for a write access and “R” for a read access)

use a primitive called “access().” This function will allow to collect useful information about memory references of program data. The data address and the access type (read or write), are the main data to record for generating a trace. Another information can be recorded: location (in time or in physical position) of the data access. We will see later that this information is important for the optimization of LIRECh`eques. Primitive insertion consists in the exploration of the parse tree in search of the data accesses. When an object is detected (we call an OBJECT a datum which is accessed), the primitive is inserted into the parse tree before the statement enclosing the OBJECT. Figure 3 and Figure 4 show examples of source program transformations. Instrumentation of the control statement for() is more complex. This statement consists of three sub-statements called start, step and end. Evaluation of these three sub-statements is not realized the same number of times. So, a pre-instrumentation and a post-instrumentation (inside the loop as for the while() statement) are not fitted to this type of statement. Our solution is, first, to convert the for() statement into an equivalent while() statement, and next, to instrument the resulting statement in a classic way (Figure 5). For such an instrumentation which preliminarily requires a program conversion step, we must take into account the possible form of the sub-statements. For example, a “nil” value for the end sub-statement must be translated during the program conversion step by the value “1” (Figure 6). For suitable reasons, the dowhile() statement is also converted into a while() statement.

Figure 4.

An example of a while() control statement instrumentation

automatic reading system

Figure 5.

An example of for() control statement instrumentation

Figure 6.

An example of for() control statement program conversion in a particular case

93

An insertion of primitive by transformation of the basic expression into an expression list holding the “access()” primitive and the basic expression, is achieved to instrument bypassed expressions (Figure 7; the bypassed expression is “u  v”). A bypassed expression (like other expressions) can be a member of a control statement condition (for example: if(cond) with cond being an expression). In this case, the instrumentation by transformation of a basic expression into an expression list, has priority (Figure 8). Other instrumentation problems such as variable assignment at the declaration level, declarations inside the body of a control statement, presence of continue statement inside the for() control statement or overlapped for() or dowhile() control statements, have been encountered during the conception of this tool. All these problems have been solved with the same approach: look at the definition of the expression, transform it if necessary, instrument it with the “access()” primitive. The LIRECh`eques code (including libraries and containing more than 400 source files and about 335000 code lines) has been totally instrumented with our tool. As a source-to-source instrumentation is intrusive, the execution of the instrumented system is of course slown down. However, the instrumented version of LIRECh`eques

Figure 7. An example of bypassed expression instrumentation (we consider that the operand evaluation is made from left to right)

94

pereira et al.

Figure 8.

An instrumentation of a bypassed expression inside a control statement condition

has given the same output results as those obtained from the initial version and thus ensure a correct instrumentation of the application. We show in the next section that information about the memory behavior can be collected during the execution of the instrumented version of LIRECh`eques and discuss how to use it efficiently for the memory optimization problem. 3.

Application to memory optimization of LIRECh`eques

3.1.

Memory test platform of LIRECh`eques

To make easier the optimization of the data cache use and the improvement of the dynamic memory use, a memory test platform has been built (Figure 9). It assembles the instrumented version of LIRECh`eques program, the tracing system which gets information through the “access()” primitive and drives data to the address tracker or to the cache simulator or for generating a program trace. The collected information are the address of the datum memory reference, the access type (read or write) and the location information of the datum access which corresponds to the name of the function where the reference to the memory is made and the name of the file enclosing the function. The address tracker which allows to track addresses and to collect information about them, is useful for the improvement of the dynamic memory use. The user can choose different test modes: • • •

(1) On-line cache simulation: data concerning the memory references are provided to cache simulator during the program execution. (2) Off-line cache simulation: a trace is recorded during the program execution and will be exploited by the cache simulator after the program execution. (3) Address tracking: some addresses are tracked during the program execution and tracking information is collected for these addresses. This test is useful for the optimization of dynamic memory use.

3.2.

Optimization of the data cache use

Objectives. The cache behavior in our system is important to characterize, because an advantageous use of the cache and, more particularly the data cache,

automatic reading system

Figure 9.

95

The memory test platform of LIRECh`eques

can seriously improve the performance of LIRECh`eques. The optimization of the cache use can be separated in two tasks: •



Test of LIRECh`eques on different types of cache to determine the best cache configuration for our system: the PowerPC microprocessors (600-series) (TM) have been retained as a well-suited hardware architecture for LIRECh`eques. It is therefore interesting to evaluate by simulation, quickly and at the lower cost, the cache behavior of these processors on the LIRECh`eques system. Identification of cache performance bottlenecks and code restructuring by some techniques to increase locality and to fully exploit the performance potential of the cache use.

Methods and tools. To achieve these two tasks, cache simulation is the suitable solution. The trace of the data memory references which are collected during the execution of the instrumented version of LIRECh`eques is exploited by a cache simulator. This type of simulation requires two main data, the referenced address and the access type (write or read), but also an information about the location of the access so as to locate the problematic program blocks. The trace can be exploited by two methods: • •

On-line. The cache simulator exploits information during the execution and therefore can constantly provide statistical results about the data cache use. Off-line. The trace is stored in a file. At the end of the application execution, the cache simulator exploits the trace and provides statistical results about the data cache use.

The advantage of the off-line method is that data can be used by other traceconsuming applications. A trace consumer typically reads traces 10 (or more) times slower than the rate at which a trace can be produced [24]. However, an off-line approach needs to save large trace files. We have opted for an on-line simulation while also keeping the possibility for the user to record the trace of the execution (see Section 3.1).

96

pereira et al.

Several cache simulators have been developed this last decade. The Cachesim simulator [31] is used for an educational objective and belongs to a simulation environment of computer with a cache memory. Dinero [15, 17] as well as tycho [19] are uniprocessor cache simulators written in C language. These two simulators which belong to a collection of tools called WARTS [19] have been distributed to several companies and universities. Tycho is able to evaluate simultaneously several alternative uniprocessor caches but the design options that may be varied are restricted. Dinero evaluates only one uniprocessor cache but provides more performance metrics and allows more cache options to be varied. ACS [20], a uniprocessor cache simulator evaluating only one cache at a time, is the cache simulator we have decided to use at first. It has been written in C++ and supports several configurations. However, it is restricted to the LRU replacement algorithm. It can simulate unified and split caches as well as L1–L2 cache systems. The available options are the cache size, the number of bytes per line, the associativity, the number of sectors. ACS can be easily modified and can run approximately 3 times faster than the dinero cache simulator. It is therefore really fitted to our requirements. The provided statistics are the number of references, the number of cache misses and the miss ratio which is related to the cache performance, even if it may not be a straight forward relationship. Experimental results. The experimental tests have been conducted in on-line mode because accuracy information about the location of the reference is too difficult to manage in off-line mode (it requires an enormous trace file). Five check images have been selected as reduced test set since they are representative of the different problems that occur during the check processing on operational sites. Cache features of the PPC603 and PPC604e, the two microprocessors retained as hardware architecture for LIRECh`eques, are reported in Table 1. Figure 10 presents the global results of on-line tests performed on the data cache of these two microprocessors. We can see through the miss ratio that the behavior of the PPC604e data cache is slightly better than the PPC603 data cache. Estimating an ideal miss ratio is not easy since it depends on the algorithmic structure of the program and the cache characteristics. In large caches (>16 KB), the miss ratio is usually very small. The global cache performances are therefore satisfying (the miss ratio varies from 0.17% to 0.51% for the PPC604e). LIRECh`eques, which assembles a large variety of algorithms, seems to have a significant data locality. Indeed, the program references the same memory location multiple times in a short period via the use of global data Table 1.

PPC603 and PPC604e cache features

Microprocessor

PPC603

PPC604e

Number of lines (instructions cache/data cache) Number of bytes/line (instructions cache/data cache) Associativity Bloc replacement policy Instructions cache and data cache

256/256 32/32 2 LRU separated

1024/1024 32/32 4 LRU separated

automatic reading system

Figure 10.

97

Cache performance for the PPC603 and PPC604e

structures (see Section 1). However, we notice that the miss ratio is not constant and depends strongly on the type of the check and its specific processing. Statistical results can also be analyzed for each program function using the location information collected during the execution. Table 2 and Table 3 present for the PPC603 and PPC604e the number of functions with bad performance according to miss ratio and number of references.

98

pereira et al.

Table 2.

Results of on-line tests on the PPC603 PPC603 L1; Number of lines (256); Bytes per line (32); Associativity (2)

No image (total number of functions) Miss ratio (> 2%) references (> 1;000) Miss ratio (> 2%) references (> 10;000) Miss ratio (> 2%) references (> 100;000) Miss ratio (> 6%) references (> 1;000) Miss ratio (> 6%) references (> 10;000)

1 (697) 15% 7.9% 3.5% 5.3% 3.15%

2 (779) 16.8% 9.11% 3.6% 6.2% 3.5%

3 (677) 15% 6.6% 3% 6% 2.5%

4 (399) 7.5% 2.5% 0.75% 0.75% 0.25%

5 (776) 18% 9.5% 4.1% 6.5% 3.3%

Although good results have been achieved at the highest level, a lot of functions have a miss rate 10 times higher than the global miss ratio. Therefore, it will be interesting for further improvements of our work, to analyze these problematic program blocks and to transform them when it is beneficial. One can find for instance in [11, 26] program transformations by small source-code changes which can greatly improve cache performance. Tests of L2 caches on the LIRECh`eques architecture are also beyond the scope of our improvements.

3.3.

Optimization of the heap memory use

The LIRECh`eques system consists of a set of near 100 algorithmic steps linking up sequentially. During a check processing, a large number of data is collected. Some of these data are used in several steps such as the check image for example: it is created at the scanning stage and is used until the literal and numeral amounts have been located and extracted (Figure 1). The useful information are thus stored in global data structures which are reset at the end of every check processing and re-initialized for the processing of the next check. Since each bank check has its own characteristics (which results for a multi-bank check reading system, in the processing of variable backgrounds, variable size and text field location), the amount of data stored in the global data structures can vary a lot from one bank check to another. Such an example is presented in Figure 11 where check2 will require much more memory than check1 (due to more complex background and larger set of words and characters to recognize). Table 3.

Results of on-line tests on the PPC604e PPC604e L1; Number of lines (1024); Bytes per line (32); Associativity (4)

No image (total number of functions) Miss ratio (> 2%) references (> 1;000) Miss ratio (> 2%) references (> 10;000) Miss ratio (> 2%) references (> 100;000) Miss ratio (> 6%) references (> 1;000) Miss ratio (> 6%) references (> 10;000)

1 (697) 8.9% 4.45% 1.86% 2.7% 1.3%

2 (779) 9.11% 4.3% 1.8% 3% 1.3%

3 (677) 8.1% 3% 1.2% 3.1% 8.8%

4 (399) 2.7% 0.75% 0.25% 0.25% 0%

5 (776) 9.7% 4.6% 2.2% 3% 1.1%

automatic reading system

Figure 11.

99

Examples of bank checks

From the memory use point of view, the multi-bank check processing thus has to cope with the large variation in time and in heap memory occupation. As an example, tests performed on a representative set of 1000 bank checks clearly show a large variation (from 1.2 MBytes up to 5.7 MBytes) in the maximum heap memory occupation of each bank check processing (see Figure 12). In Figure 13 is also presented the maximum heap memory occupation for each step of the check processing (over the same 1000 checks). It shows that there are some algorithmic steps which require much more memory than the others (i.e., the connected components extraction steps for the literal and numeral amounts and the data fusion step). To cope with the problem of memory saturation which results in the break of the bank check processing, information recorded in global data structures has to be cleared from heap memory as soon as possible, i.e. when they are not any more used. This requires some knowledge about the accesses to the information collected in global data structures along the overall check processing. In other words, we have to know “where and when” each datum or each set of data of the global structures is accessed. To realize this, the proposed method consists in tracking, along the overall check processing, the addresses corresponding to the data stored in the global data structures and in collecting useful information such as first and last accesses to the data and location of these accesses. Resetting data from global data structures as soon

Figure 12.

Maximum heap memory occupation for each bank check

100

Figure 13.

pereira et al.

Maximum heap memory occupation for each step of the check processing

as possible (i.e. after their last access) results in the decrease of the maximum heap memory occupation during a bank check processing. The tracking mode is presented in Figure 14. The correspondence table contains the addresses of data stored in global data structures and allows to match the memory references. If the memory reference matches an element of the correspondence table, tracking information about this element (location of the first and last accesses) is updated. Specific rules are used to reduce the size of the correspondence table. The matching between memory references and the correspondence table is therefore performed on a reduced set of addresses and ensures the optimization of the tracking mode. Rules take into account some data structure properties: (i) a data structure is a juxtaposition of contiguous objects (called “members”) with possible different types, (ii) an object is defined by a single address or a set of addresses. It must be

Figure 14.

The tracking mode

automatic reading system

Figure 15.

101

Rules applied for the construction of the correspondence table

noted that there is no recursive data structure nor union structure in the global data structures of LIRECh`eques. Figure 15 shows an example of a data structure and the rules we can use to build a reduced correspondence table. The output of a LIRECh`eques execution in a tracking mode is a set of statistical results about the tracking information which describe the accesses to the global data structures along the overall check processing. An analysis of these results allows to determine the data structure objects which will be cleared from heap memory as soon as possible. The more interesting objects from the optimization point of view are those of large size and those which are only used during a reduced number of algorithmic steps. This ensures to have beneficial heap memory freeing. From these results, 8 object families have been retained for the heap memory optimization such as the bank check image, the images of the literal and numeral amounts, the connected components, and so on [30]. The maximum heap memory occupation obtained after such an optimization is presented in Figure 16 and Figure 17. The tests which are performed on the same representative set of 1000 bank checks clearly show a large decrease in the maximum heap memory occupation in comparison with the initial version of LIRECh`eques (2 MBytes less than the initial version for the check with the maximum heap memory occupation (see Figures 12 and 13)). These results are very encouraging and show the interest of our approach. This kind of optimization allows to cope with the problem of memory saturation which results in the break of a bank check processing and in the LIRECh`eques performance decrease.

102

pereira et al.

Figure 16. Maximum heap memory occupation for each bank check after heap memory optimization (to compare with Figure 12)

Figure 17. Maximum heap memory occupation for each step of the check processing after heap memory optimization (to compare with Figure 13)

4.

Conclusion

In this paper, we have presented a software method for address tracing via the instrumentation of C based languages. This approach which makes our tool portable and flexible can be applied to any C-based programs. As an example of its robustness, this tool has been applied to the memory optimization of LIRECh`eques, an automatic bank check reading system: this system has been totally instrumented even if some instrumentation problems have been encountered because C based languages are very permissive. Although the execution of the system is slowed down, the instrumented version of LIRECh`eques gives the same output results as the initial version. A memory test platform has been developed to achieve the memory optimization of the data cache use via the use of a cache simulator, and the improvement of the dynamic memory occupation. This platform allows to make either on-line or off-line cache simulation for any C-based programs. The on-line tests presented in Section 3 are the first results obtained as part of the memory optimization of LIRECh`eques. The data cache use optimization was the

automatic reading system

103

first application of our approach. Caches of Power PC microprocessors (600-series) have been simulated because this type of processor has been retained for the hardware architecture of LIRECh`eques. Results show that global cache performance are correct and we notice that the behavior of PPC604e data cache is slightly better than the PowerPC 603 data cache. An analysis of the statistical results at the function level points out problematic program blocks. Some program transformations can be therefore beneficial for improving performances. It will be also interesting, for further improvements of our work, to test some L2 cache systems. The second application we have presented is the heap memory use optimization. Through an address tracking, we have been able to collect information about the accesses to the global data structures of LIRECh`eques. The tests performed on a representative set of 1000 bank checks have shown that the maximum heap memory occupation can be reduced from 5.7 MBytes to 3.8 MBytes for the most problematic bank check. The applicability of our approach has been shown with the results of practical experiments on data cache simulation and dynamic memory analysis. Other potential applications may include program analysis (program profiling, running time analysis, : : :) and memory hierarchy analysis (TLB behavior, prefetching, cache behavior prediction, : : :). References 1. A. Agarwal, R. L. Sites, and M. Horowitz. ATUM: a new technique for capturing address traces using microcode. in Proceedings of the 13th International Symposium on Computer Architecture, pp. 119–127, June, 1986. 2. T. Ball and J. R. Larus. Optimally profiling and tracing programs, ACM Transactions on Programming Languages and Systems (TOPLAS), 16:1319–1360, 1994. 3. F. Bodin, P. Beckman, D. Gannon, S. Narayana, and S. Srinivas. Sage++: a class library for building Fortran 90 and C++ restructuring tools. Technical Report, CICA, Indiana University, 1992. 4. F. Bodin, P. Beckman, D. Gannon, J. Gotwals, S. Narayana, S. Srinivas, and B. Winnicka. Sage++: an object-oriented toolkit and class library for building Fortran and C++ restructuring tools. OON-SKI, 1994. 5. D. Callahan, K. Kennedy, and A. Portefield. Software prefetching. Proceedings of the Fourth International Conference on Architectural Support for Programming Langages and Operating Systems, April 1991. 6. P. Cao, E. W. Felten, A. R. Karlin, and K. Li. Implementation and performance of integrated application-controlled file caching, prefetching and disk scheduling. TOCS 14:311–343, 1996. 7. J. B. Chen. A simulation based study of TLB performance. Proceedings of the 19th Annual International Symposium on Computer Architecture, pp. 114–123, May 1992. 8. D. W. Clark and J. S. Emer. Performance of the VAX 11/780 translation buffer: simulation and measurement. ACM Transactions on Computer Systems, 3:270–301, 1985. 9. C. Ferdinand, F. Martin, and R. Wilhelm. Applying compiler techniques to cache behavior prediction. Proceedings of the ACM SIGPLAN 1997 Workshop on Languages, Compilers, and Tools for Real-Time Systems, pp. 37–46, 1997. 10. J. K. Flanagan, B. Nelson, J. Archibald, and K. Grimsrud. BACH: BYU address collection hardware; the collection of complete traces. Proceedings of the 6th International Conference on Modeling Techniques and Tools for Computer Performance Evaluation, September 1992. 11. D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5:587–616, 1988. 12. J. D. Gee, M. D. Hill, D. N. Pnevmatikatos, and A. J. Smith. Cache performance of the SPEC benchmark suite. Technical Report, University of Wisconsin-Madison, 1991.

104

pereira et al.

13. J. Griffioen and R. Appleton. Performance measurements of automatic prefetching. Proceedings of the ISCA International Conference on Parallel and Distributed Computing Systems, September 1995. 14. K. Grimsrud, J. Archibald, M. Ripley, K. Flanagan, and B. Nelson. BACH: a hardware monitor for tracing microprocessor-based systems. Microprocessors and Microsystems, 17, October 1993. 15. J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Mateo, Calif. 1990. 16. L. Heutte, P. Pereira, O. Bougeois, J. V. Moreau, B. Plessis, P. Courtellemont, and Y. Lecourtier. Multi-bank check recognition system: consideration on the numeral amount recognition module. Special Issue on Automatic Bankcheck Processing. International Journal of Pattern Recognition and Artificial Intelligence, 11:595–618, 1997. 17. M. D. Hill. Aspects of cache memory and instruction buffer performance. Ph.D. thesis, University of California at Berkeley, Computer Sciences Division, November 1987. 18. M. D. Hill. A case for direct-mapped caches. IEEE Computer, 21:25–40, 1988. 19. M. D. Hill, J. R. Larus, A. R. Lebeck, M. Talluri, and D. Wood. Wisconsin architectural research tools set. CAN 93. 20. B. R. Hunt. ACS version: 2.0, Parallel Arch Research Laboratory, 1997. 21. A. Kishon, P. Hudak, and C. Consel. Monitoring semantics: a formal framework for specifying, implementing and reasoning about execution monitors. Proceedings of the SIGPLAN’91 Conference on Programming Language Design and Implementation, pp. 338-352, June 1991. 22. M. Kumar. Measuring parallelism in computation-intensive scientific/engineering applications. IEEE Transactions on Computers, 37:1088–1098, 1988. 23. J. R. Larus. Abstract execution: a technique for efficiently tracing programs. Software Practice and Experience, 20:1241–1258, 1990. 24. J. R. Larus. Efficient program tracing. IEEE Computer, 26:52–61, 1993. 25. J. R. Larus and E. Schnarr. EEL: machine-independent executable editing. Proceedings of the ACM SIGPLAN’95 Conference on Programming Language Design and Implementation (PLDI), ACM SIGPLAN Notices, vol. 30, pp. 291–300, June 1995. 26. A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: a case study. IEEE Computer, 27:15–26, 1994. 27. Y. A. Liu and G. Gomez. Automatic accurate time-band analysis for high-level languages. Proceedings of the ACM SIGPLAN’98 Workshop on Languages, Compilers, and Tools for Embedded Systems, Lectures Notes in Computer Science, vol. 1474, pp. 31–40, Springer-Verlag, New York, 1998. 28. B. Mohr, D. Brown, and A. Malony. TAU: a portable parallel program analysis environment for pC++. Proceedings of CONPAR’94, VAPP VI, University of Linz, Austria, LNCS 854, pp. 29–40, September 1994. 29. P. Pereira, L. Heutte, O. Bougeois, J. V. Moreau, B. Plessis, P. Courtellemont, and Y. Lecourtier. Numeral amount recognition on multi-bank checks. Proceedings of the 13th International Conference on Pattern Recognition (ICPR’96), Vienna, Austria, vol. 3, pp. 165–169, August 25–30, 1996. 30. P. Pereira. Optimization of a bank check reading system. Ph.D. thesis, University of Rouen, France, February 1999 (in French). 31. C. A. Prete. Cachesim: a graphical software environment to support the teaching of computer system with cache memories. Proceedings of the 7th SEI Conference on Software Engineering Education, San Antonio, pp. 317–327, Springer-Verlag, New York, 1994. 32. A. J. Smith. Cache memories. ACM Computer Surveys, 14:473–530, 1982. 33. L. Unnikrishnan, S. D. Stoller, and Y. A. Liu. Automatic accurate stack space and heap space analysis for high-level languages. Technical Report, Computer Science Department, Indiana University, April 2000.