QuickSearch:   Number of matching entries: 0.

Search Settings

    AssignedDue DateReviewAuthorTitleYearJournal/ProceedingsDOI/URL
    hiren 2008.07.15 Arumi, P. & Amatriain, X. A Timeliness Approach to Synchronous Dataflows for Callback-based Multimedia Systems 2008    
    Abstract: Software-based reactive multimedia computation systems

    are pervasive today in desktops but also in mobile and ultra-

    portable devices. Most such systems offer a callback-based architecture to incorporate specific stream processing.

    The Synchronous Dataflow model (SDF) and its variants

    is appropriate for many stream processing problems such as

    the ones involving video and audio. SDF allows for static

    scheduling of multi-rate processing graphs therefore enabling

    optimal run-time efficiency. But the SDF abstraction does not adapt well to architectures such as the callback-driven ones as it lacks the notion of time. Executing non-trivial

    schedules of multi-rate dataflows in a callback architecture,

    though possible through buffering, causes jitter, excessive latency and run-time inefficiencies. In this paper we formally describe a new Callback Trig-

    gered SDF (CTSDF) model with a static scheduling al-

    gorithm that produces periodic schedules than can be

    split among several callback activations, solving the above-

    mentioned problems. The model has the same expressive-

    ness than SDF, in the sense that any graph computable by

    one model will also be computable by the other. Addition-

    ally, it enables parallelization and run time load balancing

    between callback activations.

    Review: The authors present a callback-triggered synchronous dataflow model of computation (CTSDF). Their motivation is that SDF programs when used in callback architectures cause the following issues: 1) introduce higher latency than required, 2) introduce jitter and 3) runtime overhead due to context switching of threads. Their approach resolves these issues by running the schedule within a callback routine and constructing a particular type of schedule. An SDF graph is augmented by identifying input and output nodes. The schedule is then separated into two parts: the prologue that fires input and output nodes such that the execution buffers in new tokens and buffers out the delays added on the arcs to output nodes, and the period. A callback activation runs the periodic schedule with the input and output nodes executed for the same number of times. The authors then go through a series of long and verbose proofs on conditions for CTSDF. Importantly they show that any CTSDF and SDF have equivalent computability.

    Overall I find that this article has many errors in its presentations. Examples in Figure 3 and 5 seem to have mistakes in them (or older versions of the figure). So although I did not thoroughly check the proofs, I would not be surprised if there were errors in that as well. The overall idea and approach seems to be correct in reducing the latency and jitter. I find that this approach is just providing a restriction for SDF schedules such that output is produced every callback invocation and sufficient tokens are added as delays for correct schedule execution. I also don't see what they mean by timelineness in their title.

      author = {Pau Arumi and Xavier Amatriain},
      title = {A Timeliness Approach to Synchronous Dataflows for Callback-based Multimedia Systems},
      year = {2008}
    Rhishikesh 2008.06.13 Avissar, O., Barua, R. & Stewart, D. An optimal memory allocation scheme for scratch-pad-based embedded systems 2002 ACM Transactions on Embedded Computing Systems (TECS)
    Vol. 1(1), pp. 6-26 
    Abstract: This article presents a technique for the efficient compiler management of software-exposed heterogeneous memory. In many lower-end embedded chips, often used in microcontrollers and DSP processors, heterogeneous memory units such as scratch-pad SRAM, internal DRAM, external DRAM, and ROM are visible directly to the software, without automatic management by a hardware caching mechanism. Instead, the memory units are mapped to different portions of the address space. Caches are avoided due to their cost and power consumption, and because they make it difficult to guarantee real-time performance. For this important class of embedded chips, the allocation of data to different memory units to maximize performance is the responsibility of the software.Current practice typically leaves it to the programmer to partition the data among different memory units. We present a compiler strategy that automatically partitions the data among the memory units. We show that this strategy is optimal, relative to the profile run, among all static partitions for global and stack data. For the first time, our allocation scheme for stacks distributes the stack among multiple memory units. For global and stack data, the scheme is provably equal to or better than any other compiler scheme or set of programmer annotations. Results from our benchmarks show a 44.2% reduction in runtime from using our distributed stack strategy vs. using a unified stack, and a further 11.8% reduction in runtime from using a linear optimization strategy for allocation vs. a simpler greedy strategy; both in the case of the SRAM size being 20% of the total data size. For some programs, less than 5% of data in SRAM achieves a similar speedup.
    Review: Addresses the problem of static allocation of global and stack data for heterogeneous memory system (i.e. multiple memories with different latencies, e.g. SRAM, DRAM, EEPROM, ROM, etc).

    0/1 ILP formulation.

    Optimal with respect to a profile run – uses access frequencies of variables measured from profile run.

    Distributed stack: otherwise, programmers and compilers assign the whole stack to one kind of memory.

    Disadvantages of caches: timing unpredictability, cost, power consumption.

    Formulation for global variables

    * Variables are treated atomically. This includes arrays. So an array will reside entirely in one memory unit.

    * Allocation is compile-time and static – it is decided at compile-time where each global variable will sit, and it's not changed throughout the execution.

    * Based on profile data – run the program a few times and collect frequencies of accesses to all variables. Thus, we will sort of minimize average time.

    * Memory units have constant latencies. And looking at the objective function, it seems the latency is constant irrespective of sizes of variables! Incorporating sizes wouldn't be difficult.

    * Only one memory access per cycle – modeling overlapping accesses requires max function, which is nonlinear.

    If there are U memory units, and G global variables, have 0/1 variables I j(v i), which is 1 if variable v i is allocated in memory j. Add constraint saying that a variable can exist in exactly one memory unit. Add size constraint for each memory unit using sizes of variables.

    Objective function is just total access time given the access frequencies of all variables.

    With only two memory units (e.g. a cache and main memory), the formulation is exactly a 0-1 knapsack problem. With multiple units, it's like having multiple knapsacks. QUESTION: do people talk about multiple knapsack problem? Didn't see such thing on wikipedia.

    QUESTION: what is more efficient method for solving knapsack problem – a specialized algorithm (e.g. dynamic programming), or ILP?

    Extension to stack variables

    Stack is distributed across all memory units. So different variables can be allocated to different memory units. Problem: function call overhead increases, as multiple stack pointers must be updated. Hence, alternative 1: all the variables of a function reside in one memory unit. Alternative 2: the original unrestricted case. Use hybrid approach: 1 for small functions, 2 for big functions.

    Introduce additional variables for each of the function variables. For alternative 1, whole stack frame of a function is treated as one variable. From earlier formulation, only the size constraint gets complicated: instead of one global size constraint per memory unit, there is one size constraint for every path from the root to leaves in the call graph.

    Because this is static allocation, it is context-insensitive: even if a function is called from multiple places, allocation of its variables is fixed.

      author = {Avissar, O. and Barua, R. and Stewart, D.},
      title = {An optimal memory allocation scheme for scratch-pad-based embedded systems},
      journal = {ACM Transactions on Embedded Computing Systems (TECS)},
      publisher = {ACM Press New York, NY, USA},
      year = {2002},
      volume = {1},
      number = {1},
      pages = {6--26},
      url = {http://portal.acm.org/ft_gateway.cfm?id=581891&type=pdf&coll=GUIDE&dl=GUIDE&CFID=69991052&CFTOKEN=83726364}
    Isaac 2008.07.09 Delvai, M., Huber, W., Puschner, P. & Steininger, A. Processor Support for Temporal Predictability — The SPEAR Design Example 2003 ecrts
    Vol. 00, pp. 169 
    Abstract: The demand for predictable timing behavior is characteristic for real-time applications. Experience has shown that this property cannot be achieved by software alone but rather requires support from the processor. This situation is analyzed and mapped to a design rationale for SPEAR (Scalable Processor for Embedded Applications in Real-time Environments), a processor that has been designed to meet the speci.c temporal demands of real-time systems. At the hardware level, SPEAR guarantees interrupt response with minimum temporal jitter and minimum delay. Furthermore, the processor provides an instruction set that only has constant-time instructions. At the software level, SPEAR supports the implementation of temporally predictable code according to the single-path programming paradigm. Altogether these features support writing of code with minimal jitter and provide the basis for exact temporal predictability. Experimental results show that SPEAR indeed exhibits the anticipated highly predictable timing behavior.
      author = {Martin Delvai and Wolfgang Huber and Peter Puschner and Andreas Steininger},
      title = {Processor Support for Temporal Predictability — The SPEAR Design Example},
      journal = {ecrts},
      publisher = {IEEE Computer Society},
      year = {2003},
      volume = {00},
      pages = {169},
      doi = {http://doi.ieeecomputersociety.org/10.1109/EMRTS.2003.1212740}
    Rhishikesh 2008.05.30 Deverge, J.-F., Deverge, J.-F. & Puaut, I. WCET-Directed Dynamic Scratchpad Memory Allocation of Data 2007 Proc. 19th Euromicro Conference on Real-Time Systems ECRTS '07, pp. 179-190  DOI  
    Abstract: Many embedded systems feature processors coupled with a small and fast scratchpad memory. To the difference with caches, allocation of data to scratchpad memory must be handled by software. The major gain is to enhance the predictability of memory accesses latencies. A compile-time dynamic allocation approach enables eviction and placement of data to the scratchpad memory at runtime. Previous dynamic scratchpad memory allocation approaches aimed to reduce average-case program execution time or the energy consumption due to memory accesses. For real-time systems, worst-case execution time is the main metric to optimize. In this paper, we propose a WCET-directed algorithm to dynamically allocate static data and stack data of a program to scratchpad memory. The granularity of placement of memory transfers (e.g. on function, basic block boundaries) is discussed from the perspective of its computation complexity and the quality of allocation.
    Review: Determining load-store targets

    Storage types: static, stack, heap, literals.

    Access types: scalar, regular, irregular and input-dependent accesses. Regular means regular array access. Scalar and regular mean no pointers. Irregular uses pointers but is not input dependent.

    Table 2 gives statistics of access and storage types for worst-case execution path of programs. QUESTION: why just worse-case path?

    Conclusion: no heap accesses, but all access types for static and stack data.

    Pointer analysis

    Pointer analysis of GCC 4.1 is used to calculate all possible target addresses of every load/store instruction.

    QUESTION: how effective is this target calculation? Does it usually come up with a small number of targets for a load/store instruction? The paper doesn't present results on this, but it will have impact on allocation, because constraint 8 says that all targets of a load/store must be either all present or all absent in the scratchpad.

    WCET analysis

    Uses Heptane to obtain WCET and corresponding path in a program. The path is represented by number of times each basic block is executed along the path.

    For the analysis, you need:

    What's exactly estimated contribution to WCET reduction for data v scratchpad-allocated on edge e_j?

    Why subtract load store times from objective function?

    All constraints except the one imposing size of scratchpad memory are binary. Can SAT techniques be used on such optimization problems?

    Why allocation, de-allocation on edges and not in nodes? (in node means at the boundary of node, for example, at the start or end of a basic block.)

    What are the fallouts of constraint 6? It says that scratchpad contents must match on all outgoing edges of a node.

    Constraint 8 says that all the targets of a load/store instruction must be present or absent together in the scratchpad. What is the effectiveness of pointer analysis? Does it always come up with small number of targets for a load/store? Section 3.2.2 somehow relates this with architecture being static- or dynamic-scheduled. Didn't understand that.

    Data placement

    ILP formulation doesn't consider fragmentation. So, not all allocations will fit in the scratchpad.

    Question about address assignment to data region – it starts with the data region having the highest impact in WCET reduction. Let's say that region is somewhere downstream in the flow graph. So it is allocated first and gets the address 0 in the scratchpad. If the second-highest data region does not overlap with the first and it precedes first. So do we put it at address 0 or elsewhere? It seems valid to put it at address 0 if it's going to be evicted before first data region starts.

    Does data region end with a store? Yes it does.

      author = {Deverge, J.-F. and Deverge, J.-F. and Puaut, I.},
      title = {WCET-Directed Dynamic Scratchpad Memory Allocation of Data},
      booktitle = {Proc. 19th Euromicro Conference on Real-Time Systems ECRTS '07},
      year = {2007},
      pages = {179--190},
      doi = {http://dx.doi.org/10.1109/ECRTS.2007.37}
    Rhishikesh 2008.06.06 Dominguez, A., Udayakumaran, S. & Barua, R. Heap data allocation to scratch-pad memory in embedded systems 2005 Journal of Embedded Computing
    Vol. 1(4), pp. 521-540 
    Abstract: This paper presents the first-ever compile-

    time method for allocating a portion of the heap data

    to scratch-pad memory. A scratch-pad is a fast directly

    addressed compiler-managed SRAM memory that replaces

    the hardware-managed cache. It is motivated by its better

    real-time guarantees vs cache and by its significantly lower

    overheads in access time, energy consumption, area and

    overall runtime. Existing compiler methods for allocating

    data to scratch-pad are able to place only global and stack

    data in scratch-pad memory; heap data is allocated entirely

    in DRAM, resulting in poor performance. Runtime methods

    based on software caching can place heap data in scratch-

    pad, but because of their high overheads from software

    address translation, they have not been successful, especially

    for heap data.

    In this paper we present a dynamic yet compiler-directed

    allocation method for heap data that for the first time, (i)

    is able to place a portion of the heap data in scratch-pad;

    (ii) has no software-caching tags; (iii) requires no run-time

    per-access extra address translation; and (iv) is able to

    move heap data back and forth between scratch-pad and

    DRAM to better track the program’s locality characteristics.

    With our method, global, stack and heap variables can

    share the same scratch-pad. When compared to placing all

    heap variables in DRAM and only global and stack data

    in scratch-pad, our results show that our method reduces

    the average runtime of our benchmarks by 34.6%, and the

    average power consumption by 39.9%, for the same size of

    scratch-pad fixed at 5% of total data size.

      author = {Dominguez, A. and Udayakumaran, S. and Barua, R.},
      title = {Heap data allocation to scratch-pad memory in embedded systems},
      journal = {Journal of Embedded Computing},
      publisher = {IOS Press},
      year = {2005},
      volume = {1},
      number = {4},
      pages = {521--540},
      url = {http://www.ece.umd.edu/~barua/dominguez-JEC-05.pdf}
    Hiren 2008.06.06 Halang, W. Simplicity Considered Fundamental to Design for Predictability 2004 (03471)Perspectives Workshop: Design of Systems with Predictable Behaviour  URL 
    Abstract: Complexity is the core problem of contemporary information technology, as the artificial complicatedness of its artefacts is exploding Intellectually easy and economically feasible predictability can be achieved by selecting simplicity as fundamental design principle. Predictability of system behaviour is identified as the central concept for the design of real-time and embedded systems, since it essentially implies the other requirements timeliness and dependability holding for them. Practically all dynamic and virtual features aiming to enhance the average performance of computing systems as well as the traditional categories and optimality criteria are found inadequate and are, thus, considered harmful. In mainstream research on scheduling the gap between academic research and reality has grown so wide that research results are doomed to irrelevance. Instead, useful scheduling research ought to employ utmost simplicity as optimality criterion, and strive to minimise software size and complexity. Computing should embrace other disciplines’ notions and technologies of time. Programming and verification methods for safety-related applications are identified on the basis of their simplicity and ergonomic aptitude. It is advocated to utilise the permanent advances in microelectronics to solve long untackled problems and to foster simplicity and predictability by hardware support.
      author = {Wolfgang Halang},
      title = {Simplicity Considered Fundamental to Design for Predictability},
      booktitle = {Perspectives Workshop: Design of Systems with Predictable Behaviour},
      publisher = {Internationales Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany},
      year = {2004},
      number = {03471},
      note = {$<$http://drops.dagstuhl.de/opus/volltexte/2004/4$>$ [date of citation: 2004-01-01]},
      url = {http://drops.dagstuhl.de/opus/volltexte/2004/4/pdf/03471_wolfgang_halang.pdf}
    Shanna 2008.05.30 Henzinger, T., Horowitz, B. & Kirsch, C. Giotto: a time-triggered language for embedded programming 2003 Proceedings of the IEEE
    Vol. 91(1), pp. 84-99 
    Abstract: Giotto provides an abstract programmer's model for the implementation of embedded control systems with hard real-time constraints. A typical control application consists of periodic software tasks together with a mode-switching logic for enabling and disabling tasks. Giotto specifies time-triggered sensor readings, task invocations, actuator updates, and mode switches independent of any implementation platform. Giotto can be annotated with platform constraints such as task-to-host mappings, and task and communication schedules. The annotations are directives for the Giotto compiler, but they do not alter the functionality and timing of a Giotto program. By separating the platform-independent from the platform-dependent concerns, Giotto enables a great deal of flexibility in choosing control platforms as well as a great deal of automation in the validation and synthesis of control software. The time-triggered nature of Giotto achieves timing predictability, which makes Giotto particularly suitable for safety-critical applications.
    Review: Giotto: A Time-triggered Language for Embedded Programming by Thomas Henzinger, Benjamin Horowitz, and Christoph Kirsch dissusses the Giotto language which was designed to provide a level of abstraction to enable easier communication between control engineers who manipulate differential equations to model a plant behavior and the software engineer who implement the models. "Giotto abstracts away from the realization of the software architecture on a specific platform, and frees the software engineer from worrying about issues such as hardware performance and scheduling mechanism while communicating with the control


    The implementation of the Giotto program, which requires no interaction with the control engineer is decoupled from the first task and can for the most part be automated by increasingly powerful compilers. Giotto can be compiled on different and even heterogeneous platforms. Basic functionality is the task (i.e. a periodically executed piece of code), and several concurrent tasks make up a mode. Tasks can be added or removed by switching from one mode to another. Tasks communicate with each other via drivers (code that converts and transports values between ports). Drivers are essentially bounded code that is assumed to execute instantaneously on the system. Giotto integrates scheduled computation (tasks) and synchronous communication (drivers). A Giotto program does not specify where, how and when tasks are scheduled. A Giotto program with tasks t1 and t2 can be compiled on platforms with a single CPU(by time sharing two tasks) as well as on platforms with two cpus(by parallelism). It can also be compiled on platforms with a preemptive priority scheduling (such as most real-time operating systems) as well as truly time-triggered platforms(such as the time-triggered architecture). The Giotto compiler just needs to ensure the the semantics of the program i.e functionality and timing - is preserved. So the compiler needs to solve a possibly distributed scheduling problem. The distributed scheduling problem can be rather difficult, so to make the job of the compiler easier a Giotto program can be annotated with compiler directives in the form of platform constraints i.e. you map a particular task to a particular cpu, assign a particular priority to a particular task or schedule a particular communication, or schedule a particular communication event between tasks in a particular slot.

      author = {Henzinger, T.A. and Horowitz, B. and Kirsch, C.M.},
      title = {Giotto: a time-triggered language for embedded programming},
      journal = {Proceedings of the IEEE},
      year = {2003},
      volume = {91},
      number = {1},
      pages = {84--99},
      note = {Shanna},
      url = {http://mtc.epfl.ch/~tah/Publications/giotto.pdf}
    Hiren 2008.05.30 Kadlec, A. & Kirner, R. On the Difficulty of Building a Precise Timing Model for Real-Time Programming? 14. Kolloquium Programmiersprachen und Grundlagen der Programmierung  URL 
    Abstract: For real-time computing it is important to know the worstcase

    execution time (WCET) of all time-critical software operations in

    order to ensure timeliness of the system. The calculation of a precise

    upper bound of theWCET relies on the availability of an adequate timing

    model of the target hardware.

    Within this article we explore the different mechanisms of modern processors

    that lead to complex timing models. We explore the different types

    of memory elements within a processor that resemble the state of the processor.

    Further, we compare the compile-time knowledge and run-time

    knowledge and discuss the consequences of offline (compiler) and online

    (hardware) code optimization. The main consequence of this hardware

    exploration is that real-time computing needs co-design of compilation,

    timing analysis, and processor optimizations to improve temporal predictability

    of the system.

      author = {Kadlec, A. and Kirner, R.},
      title = {On the Difficulty of Building a Precise Timing Model for Real-Time Programming?},
      journal = {14. Kolloquium Programmiersprachen und Grundlagen der Programmierung},
      note = {Ben},
      url = {http://www.isp.uni-luebeck.de/kps07/files/papers/kadlec.pdf}
    Ben 2008.05.23 Kirner, R. & Puschner, P. Obstacles in Worst-Case Execution Time Analysis 2008 isorc
    Vol. 0, pp. 333-339 
    Abstract: The analysis of the worst-case execution time (WCET) requires detailed knowledge of the program behavior. In practice it is still not possible to obtain all needed information automatically. In this paper we present the current state of the art of WCET analysis and point to the main problems to be solved. The most eminent problem is the state problem, i.e., the precise determination of possible processor states at different program locations. The path problem refers to the fact that current tools are not able to calculate all (in)feasible paths automatically. We discuss how the main open problems manifest themselves in static and in measurement-based WCET analysis methods.
    Review: Obstacles in Worst-Case Execution Time Analysis" describes current

    problems for developing tight worst case execution time bounds. In

    general, determining WCET requires managing all possible paths that a

    program can execute as well as managing all the states that a program.

    For example, if the state of a cache affects how long a memory

    operation will take, then any WCET analysis that hopes to be tight

    must manage the possible states of the cache. This is exacerbated by

    the abundance of hardware components that have state that affects

    their timing, as well as the interaction between these components and

    the flow of control or data memory of a program. They contend that

    the problem of managing all these possible states is one of the

    largest problems in WCET analysis.

    They then present two possible approaches to hierarchically managing

    WCET analysis by partitioning hardware into components and then

    composing those analyses. Each of these two approaches makes

    different requirements of the hardware, and can only be used when that

    requirement is met. The first, called "Delta-Composiotion" , can only

    be applied when the effect of changing the state of one component on

    the whole program cannot be larger than the effect to the timing of

    that one component. The second, called "Max-Composiotion" , can only

    be applied when the relative utilization of a hardware component over

    a pair of states is the same as the relative utilization of a larger

    system over those same states. They then assert that on any systems

    where at least one of these requirements is met, WCET analysis is made

    more feasible.

      author = {Raimund Kirner and Peter Puschner},
      title = {Obstacles in Worst-Case Execution Time Analysis},
      journal = {isorc},
      publisher = {IEEE Computer Society},
      year = {2008},
      volume = {0},
      pages = {333-339},
      note = {Ben},
      doi = {http://doi.ieeecomputersociety.org/10.1109/ISORC.2008.65}
    Shanna 2008.06.13 Kirner, R. & Puschner, P. Time-Predictable Task Preemption for Real-Time Systems with Direct-Mapped Instruction Cache 2007 Object and Component-Oriented Real-Time Distributed Computing, 2007. ISORC '07. 10th IEEE International Symposium on, pp. 87-93  DOI  
    Abstract: Modern processors used in embedded systems are becoming increasingly powerful, having features like caches and pipelines to speedup execution. While execution speed of embedded software is generally increasing, it becomes more and more complex to verify the correct temporal behavior of software, running on this high-end embedded computer systems. To achieve time-predictability the authors introduced a very rigid software execution model with distribution being realized based on the time-triggered communication model. In this paper we analyze the time-predictability of a preempting task-activation, running on a hardware with direct-mapped instruction caches. As one result we analyze why a task-preemption driven by a clock interrupt is not suitable to guarantee time-predictability. As a second result, we present a time-predictable task-preemption driven by an instruction counter.
      author = {Kirner, R. and Puschner, P.},
      title = {Time-Predictable Task Preemption for Real-Time Systems with Direct-Mapped Instruction Cache},
      journal = {Object and Component-Oriented Real-Time Distributed Computing, 2007. ISORC '07. 10th IEEE International Symposium on},
      year = {2007},
      pages = {87-93},
      doi = {http://dx.doi.org/10.1109/ISORC.2007.56}
    Shanna 2008.05.30 Krishnan, N.V. Real-Time Systems Design in Ptolemy II: A Time-Triggered Approach 2004 (Memorandum No. UCB/ERL M04/22)  URL 
      author = {N. Vinay Krishnan},
      title = {Real-Time Systems Design in Ptolemy II: A Time-Triggered Approach},
      year = {2004},
      number = {Memorandum No. UCB/ERL M04/22},
      url = {www.eecs.berkeley.edu/Pubs/TechRpts/2004/4226.html}
    Isaac 2008.06.06 Kurian, L., Hulina, P.T. & Coraor, L.D. Memory latency effects in decoupled architectures with a single data memory module 1992 SIGARCH Comput. Archit. News
    Vol. 20(2), pp. 236-245 
    Abstract: Decoupled computer architectures partition the memory access and execute functions in a computer program and achieve high performance by exploiting the fine-grain parallelism between the two. These architectures make use of an access processor to perform the data fetch ahead of demand by the execute process and hence are often less sensitive to memory access delays than conventional architectures. Past performance studies of decoupled computers used memory systems that are interleave or pipelined. We undertake a simulation study of the latency effects in decoupled computers when connected to a single, conventional non-interleaved data memory module so that the effect of decoupling is isolated from the improvement caused by interleaving. We compare decoupled computer performance to single processors with caches, study the memory latency sensitivity of the decoupled systems, and also perform simulations to determine the significance of data caches in a decoupled computer architecture. The Lawrence Livermore Loops and two signal processing algorithms are used as the simulation benchmark
    Review: This paper tries to analyze the performance effects of decoupled architectures vs caches, and also observe the effect of adding caches to decoupled architectures. After reading the paper, it made me realize that the function of access processor being decoupled from the execute processor is almost like a more powerful cache with a different caching algorithm ,that's why the paper compares the two. The results show that the decoupled architecture and cache separately perform well for different applications, thus, back to my previous conclusion, we have to evaluate the applications of pret and whether decoupled architecture really benefits. Adding caches to decoupled architectures also only have minor effects for general applications. So far there is no paper on how predictable the timing is for decoupled architectures, it seems like for the access processor, it still runs into the same issue of uncertain memory access time, unless of course every access is set cycles to memory. We need to look at the communication between access processor and execute processor to see how predictable this thing can get.
      author = {Lizyamma Kurian and Paul T. Hulina and Lee D. Coraor},
      title = {Memory latency effects in decoupled architectures with a single data memory module},
      journal = {SIGARCH Comput. Archit. News},
      publisher = {ACM},
      year = {1992},
      volume = {20},
      number = {2},
      pages = {236--245},
      doi = {http://doi.acm.org/10.1145/146628.140380}
    Isaac 2008.05.23 Milidonis, A., Alachiotis, N., Porpodas, V., Michail, H., Kakarountas, A. & Goutis, C. A Decoupled Architecture of Processors with Scratch-Pad Memory Hierarchy 2007 Design, Automation & Test in Europe Conference & Exhibition, 2007. DATE'07, pp. 1-6  URL 
    Abstract: This paper present a decoupled architecture of processors with a memory hierarchy of only scratch-pad memories, and a main memory. The decoupled architecture also exploits the parallelism between address computation and processing the application data. The application code is split in two programs the first for computing the addresses of the data in the memory hierarchy and the second for processing the application data. The first program is executed by one of the decoupled processors called Access which uses compiler methods for placing data in the memory hierarchy. In parallel, the second program is executed by the other processor called Execute. The synchronization of the memory hierarchy and the Execute processor is achieved through simple handshake protocol. The Access processor requires strong communication with the memory hierarchy which strongly differentiates it from traditional uniprocessors. The architecture is compared in performance with the MIPS IV architecture of SimpleScalar and with the existing decoupled architectures showing its higher normalized performance. Experimental results show that the performance is increased up to 3.7 times. Compared with MIPS IV the proposed architecture achieves the above performance with insignificant overheads in terms of area
    Review: The main idea to take away from this is that it splits the memory access operations and the actual execution operations of a program into 2 processors. Basically, they have an extra processor that does what a cache controller does, except it's for scratch pad memories. The architecture itself consists of 2 processors, main memory, and different levels of scratch pad memory. The processor that controls the scratchpad allocations and memory movement is called the Access processor. The "normal" processor is called the Execute Processor. The Execute Processor does not have load store instructions, but instead, uses a communication buffer and interrupts to communicate with the Access processor which handles the DMA's from memory to scratchpad, and even directly loading it into the Execute Processor's registers.

    Couple things need more explanation. The protocol explained between the Execute Processor and the Access processor involves only small (16 byte) buffer. However, this is where they claim the performance gain comes from, and where the difference between this and a cache hierarchy. Basically,a cache will miss and this architecture supposedly gains by not missing and also prefetching. This comes from knowing which memory location to fetch, which this paper just refers to as a compiler output, but lack explanation. It was quite confusing to me how the Execute Processor relayed address information to the Access processor. I think more background reading is required from the related work section where it has some papers on compiler techniques, maybe we will get a better understanding.

      author = {Milidonis, A. and Alachiotis, N. and Porpodas, V. and Michail, H. and Kakarountas, AP and Goutis, CE},
      title = {A Decoupled Architecture of Processors with Scratch-Pad Memory Hierarchy},
      journal = {Design, Automation & Test in Europe Conference & Exhibition, 2007. DATE'07},
      year = {2007},
      pages = {1--6},
      note = {Isaac},
      url = {http://doi.ieeecomputersociety.org/10.1109/DATE.2007.364661}
    Ben 2008.07.18 Nguyen, N., Dominguez, A. & Barua, R. Memory allocation for embedded systems with a compile-time-unknown scratch-pad size 2005 CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, pp. 115-125  DOI URL 
    Abstract: This paper presents the first memory allocation scheme for embedded systems having scratch-pad memory whose size is unknown at compile time. A scratch-pad memory (SPM) is a fast compiler-managed SRAM that replaces the hardware-managed cache. Its uses are motivated by its better real-time guarantees as compared to cache and by its significantly lower overheads in energy consumption, area and access time.Existing data allocation schemes for SPM all require that the SPM size be known at compile-time. Unfortunately, the resulting executable is tied to that size of SPM and is not portable to processor implementations having a different SPM size. Such portability would be valuable in situations where programs for an embedded system are not burned into the system at the time of manufacture, but rather are downloaded onto it during deployment, either using a network or portable media such as memory sticks. Such post-deployment code updates are common in distributed networks and in personal hand-held devices. The presence of different SPM sizes in different devices is common because of the evolution in VLSI technology across years. The result is that SPM cannot be used in such situations with downloaded code.To overcome this limitation, this work presents a compiler method whose resulting executable is portable across SPMs of any size. The executable at run-time places frequently used objects in SPM; it considers code, global variables and stack variables for placement in SPM. The allocation is decided by modified loader software before the program is first run and once the SPM size can be discovered. The loader then modifies the program binary based on the decided allocation. To keep the overhead low, much of the pre-processing for the allocation is done at compile-time. Results show that our benchmarks average a 36% speed increase versus an all-DRAM allocation, while the optimal static allocation scheme, which knows the SPM size at compile-time and is thus an un-achievable upper-bound, is only slightly faster (41% faster than all-DRAM). Results also show that the overhead from our embedded loader averages about 1% in both code-size and run-time of our benchmarks.
      author = {Nghi Nguyen and Angel Dominguez and Rajeev Barua},
      title = {Memory allocation for embedded systems with a compile-time-unknown scratch-pad size},
      booktitle = {CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems},
      publisher = {ACM},
      year = {2005},
      pages = {115--125},
      url = {http://portal.acm.org/ft_gateway.cfm?id=1086313&type=pdf&coll=GUIDE&dl=GUIDE&CFID=35707208&CFTOKEN=97425946},
      doi = {http://doi.acm.org/10.1145/1086297.1086313}
    Hiren 2008.10.08 Ouimet, M. & Lundqvist, K. Incorporating Time in the Modeling of Hardware and Software Systems: Concepts, Paradigms, and Paradoxes 2007 Modeling in Software Engineering, 2007. MISE '07: ICSE Workshop 2007. International Workshop on, pp. 5-5  DOI  
    Abstract: In this paper, we present some of the issues encountered when trying to apply model-driven approaches to the engineering of real-time systems. In real-time systems, quantitative values of time, as reflected through the duration of actions, are central to the system's correctness. We review basic time concepts and explain how time is handled in different modeling languages. We expose the inherent paradox of incorporating quantitative time-dependent behavior in high-level models. High-level models are typically built before the system is implemented, which makes quantitative time metrics difficult to predict since these metrics depend heavily on implementation details. We provide some possible answers to this paradox and explain how the Timed Abstract State Machine (TASM) language helps address some of these issues.
    Review: The authors suggest that modeling real-time applications requires quantative time and not just qualatative time. They define qualatative time as the ordering of events and the quantative time as the numerical duration between events. They indicate that timing requirements in real-time systems are also defined quantatively; thus, requiring a method of expressing it as such in modeling frameworks. They focus on two alternatives of describing time as duration of an action or a passage of time. For example, timed automata provide a method for specifying the passage of time. The authors go further to say that for real-time systems time as a duration of action is more appropriate and present their Timed Abstract State Machine approach.

    The authors indicate that there is a paradox of incorporating time in high-level models. That is, accurate time information is only available at low-level abstractions and specifying time at high-level models is an approximate at best. They propse an approach similar to PBD where the specification pushes constraints to the implementation and the implementation can tell whether it is possible or not and if not, the specification is adjusted.

      author = {Ouimet, M. and Lundqvist, K.},
      title = {Incorporating Time in the Modeling of Hardware and Software Systems: Concepts, Paradigms, and Paradoxes},
      journal = {Modeling in Software Engineering, 2007. MISE '07: ICSE Workshop 2007. International Workshop on},
      year = {2007},
      pages = {5-5},
      doi = {http://dx.doi.org/10.1109/MISE.2007.7}
    Ben 2008.06.06 Ozturk, O., Kandemir, M. & Kolcu, I. Shared Scratch-Pad Memory Space Management 2006 Quality Electronic Design, 2006. ISQED'06. 7th International Symposium on, pp. 576-584  URL 
    Abstract: Scratch-pad memories (SPMs) are important storage components in many embedded applications and used as an alternative or a complimentary storage to on-chip cache memories. One of the most critical issues in the context of SPMs is to select the data elements to place in them since the gap between SPM access latencies and off-chip memory access latencies keep increasing dramatically. Previous research considered this problem and attacked it using both static and dynamic schemes. Most of the prior efforts on data SPMs have mainly focused on single application scenarios, i.e., the SPM space available is assumed to be managed by a single application at any given time. While this assumption makes sense in certain domains, there also exist many cases where multiple applications need to share the same SPM space. This paper focuses on such a multi-application scenario and proposes a nonuniform SPM space partitioning and management across concurrently-executing applications. In our approach, the amount of data to be allocated to each application is decided based on the data reuse each application exhibits.
      author = {Ozturk, O. and Kandemir, M. and Kolcu, I.},
      title = {Shared Scratch-Pad Memory Space Management},
      journal = {Quality Electronic Design, 2006. ISQED'06. 7th International Symposium on},
      year = {2006},
      pages = {576--584},
      url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1613200}
    Hiren 2008.05.23 Pitter, C. & Schoeberl, M. Time Predictable CPU and DMA Shared Memory Access 2007 Field Programmable Logic and Applications, 2007. FPL 2007. International Conference on, pp. 317-322  URL 
    Abstract: In this paper, we propose a first step towards a time predictable computer architecture for single-chip multiprocessing (CMP). CMP is the actual trend in server and desktop systems. CMP is even considered for embedded realtime systems, where worst-case execution time (WCET) estimates are of primary importance. We attack the problem of WCET analysis for several processing units accessing a shared resource (the main memory) by support from the hardware. In this paper, we combine a time predictable Java processor and a direct memory access (DMA) unit with a regular access pattern (VGA controller). We analyze and evaluate different arbitration schemes with respect to schedulability analysis and WCET analysis. We also implement the various combinations in an FPGA. An FPGA is the ideal platform to verify the different concepts and evaluate the results by running applications with industrial background in real hardware.
    Review: The authors present two approaches for integrating a direct memory access controller into a real-time embedded processor architecture. Their architecture uses the Java Optimized Processor (JOP) and a VGA controller as a DMA unit. Both the DMA and procesor share the main memory. The approach used here is to treat the DMA unit as either a hard real-time task in rate monotonic scheduling or include the DMA memory acess time in the WCET analysis of the individual application tasks. They restrict the pattern at which it access the main memory as a regular pattern. This comes in two forms: blocked memory access and spread memory access scheme. Blocked is when large chunks of data are to be moved from the main memory. Tasks with blocked scheme have teh highest prioirty with fixed-priorty scheduling. This is ideal for streaming applications. Spread scheme denotes acases where the memory requests are spread out in timely fashion. Again the task must be of highest priority.

    The two approaches are implemented with the two access pattern schemes with an application. This application is evaluated to show that the second approach of including the DMA memory access in the WCET analysis of the individual application tasks provides tighter estimation and enables more processing resources for teh application.

      author = {Pitter, C. and Schoeberl, M.},
      title = {Time Predictable CPU and DMA Shared Memory Access},
      journal = {Field Programmable Logic and Applications, 2007. FPL 2007. International Conference on},
      year = {2007},
      pages = {317--322},
      note = {Hiren},
      url = {http://ti.tuwien.ac.at/rts/people/pitter/publications/jopvga_fpl2007.pdf}
    Sungjun 2008.06.06 Pospischil, G., Puschner, P., Vrchoticky, A. & Zainlinger, R. Developing real-time tasks with predictable timing Sep 1992 Software, IEEE
    Vol. 9(5), pp. 35-44 
    Abstract: A prototype development environment for the Mars maintainable real-time system is described. The Mars design environment supports the implementation of critical real-time tasks with time editing. It uses an experimental real-time language, Modula-R, which is based on Modula-2. The Mars real-time architecture, which is the target architecture of the Mars environment, is described. To illustrate the environment, the development of a simple real-time algorithm that implements insertion sort for an array is presented
    Review: This paper, written by Gustav Pospischil, Peter Puschner, Alexander Vrchoticky, and Ralph Zainlinger in Technical University of Vienna, explains a development environment for real-time systems, called Mars. It has several traits: cyclic tasks, dedicated programming language called Modula-R and timing analysis. Among them, in particular, the authors mainly introduce the timing analyzing feature, which shows timing information of a user application. It is composed of three components: timing tree, maximum time analyzer, and time editor.

    The timing tree is a simplified syntax tree designed to communicate the maximum time analyzer and the time editor. Also, the timing tree is constructed by timing-tree generation component. During assembly-code generation in compile time, the timing-tree generation traverses syntax tree and flow graph to create the timing tree for each procedure. The timing-tree nodes represent a program's sequential parts (basic blocks), branches, bounded loops, markers, and scopes. In the tree's leaves, the maximum timing information of the basic block (sequential part) is stored; from the leaves, timing information is generated by going up the scopes. If a marker, which is given by Modula-R language and limits loops in the marked range, is set, the marked range is defined as a scope and a timing-tree node.

    The maximum time analysis component has several formulas. For example, maximum execution time of sequential contructs is the sum of all the constructs; maximum execution time of branch construct is (maximum execution time of condition statement) + (maximum value between (maximum execution time of alternative_1) and (maximum execution time of alternative_2)). According to this information, the maximum time information of each timing-tree node is calculated.

    Finally, through the time editor, developers can see the timing information and edits it. To illustrate, for the code in Mars


    a < b1






    the numbers of the right-hand side designates execution times according to the code's conditions. If a is smaller than b, proc1(a) will be called; since the maximum time of proc1(a) is 36 and the execution time of comparison of a and b is 1, one of the maximum value in this conditional code is 37. When a user changes the execution time into 20, the maximum value is modified into 25. This method is recommended to find bottle neck of an algorithm.

    For PRET project, the timing analysis of Mars seems useful. In a PRET's application, every time we select a dead instruction, we can regard it as a node of the timing tree. If a task is an infinite loop and some scopes of the inside loop has to be cyclic, we can easily calculate how many cycles the scope will take by going up scope of the timing tree. Besides, if we use several dead instructions for each thread and each thread has to be synchronized, the timing information will be useful to figure out timing information affected by the synchronization. If a thread receives synchronization signals every x cycles from the other thread, it can be defined as a node of the timing tree. The timing effects by synchronizations can be simplified.

      author = {Pospischil, G. and Puschner, P. and Vrchoticky, A. and Zainlinger, R.},
      title = {Developing real-time tasks with predictable timing},
      journal = {Software, IEEE},
      year = {Sep 1992},
      volume = {9},
      number = {5},
      pages = {35-44},
      doi = {http://dx.doi.org/10.1109/52.156895}
    Hiren 2008.07.11 Puaut, I. WCET-centric software-controlled instruction caches for hard real-time systems 2006 Proceedings of the 18th Euromicro Conference on Real-Time Systems, pp. 217-226   
    Abstract: Cache memories have been extensively used to bridge the gap between high speed processors and relatively slower main memories. However, they are sources of predictability problems because of their dynamic and adaptive behavior, and thus need special attention to be used in hard real-time systems. A lot of progress has been achieved in the last ten years to statically predict worst-case execution times (WCETs) of tasks on architectures with caches. However, cache-aware WCET analysis techniques are not always applicable due to the lack of documentation of hardware manuals concerning the cache replacement policies. Moreover, they tend to be pessimistic with some cache replacement policies (e.g. random replacement policies) [6]. Lastly, caches are sources of timing anomalies in dynamically scheduled processors [13] (a cache miss may in some cases result in a shorter execution time than a hit). To reconciliate performance and predictability of caches, we propose in this paper algorithms for software control of instruction caches. The proposed algorithms statically divide the code of tasks into regions, for which the cache contents is statically selected. At run-time, at every transition between regions, the cache contents computed off-line is loaded into the cache and the cache replacement policy is disabled (the cache is locked). Experimental results provided in the paper show that with an appropriate selection of regions and cache contents, the worst-case performance of applications with locked instruction caches is competitive with the worst-case performance of unlocked caches.
    Review: The author presents algorithms for scheduling instructions onto a software-controlled cache. The two algorithms are called greedy and genetic. The goal is to optimize the worst-case execution path (WCEP) of a program. They do this by presenting a method for locating reload points in the program code and the contents that need to be moved into the instruction cache. Reload points are limited to beginning of loops and function entries in a control-flow graph. They create a cost function that compares a regular cache with a locked cache. This cost function determines what basic blocks benefit from being scheduled by software. The selection of cache contents finds the WCEP and on that path determines the most beneficial basic block (in terms of the cost function). This basic block is scheduled to be moved and locked in the cache. This however changes the WCEP, so another run of the WCEP is done. If the WCEP is different then the same procedure of finding the basic block that is most beneficial is done and scheduled onto the cache and locked. The procedure continues and stops once the worst case execution time of a current iteration is greater than one of a previous iteration.
      author = {Puaut, I.},
      title = {WCET-centric software-controlled instruction caches for hard real-time systems},
      journal = {Proceedings of the 18th Euromicro Conference on Real-Time Systems},
      publisher = {IEEE Computer Society Washington, DC, USA},
      year = {2006},
      pages = {217--226}
    Sungjun 2008.05.30 Puschner, P. & Burns, A. Writing temporally predictable code 2002 Object-Oriented Real-Time Dependable Systems, 2002.(WORDS 2002). Proceedings of the Seventh International Workshop on, pp. 85-91  URL 
    Abstract: The Worst-Case Execution-Time Analysis (WCET Analysis) of program code for modern processors is a highly complex tasks: First, it involves path analysis, to identify and describe the possible execution paths through the code. Second, it models the worst-case timing of possible paths on the target hardware, where the characterization of the timing of sophisticated hardware features (e.g., instruction pipelines, caches, parallel execution units) and their interferences are non-trivial. This paper presents a programming paradigm that takes the complexity from WCET analysis. Program code written according to this paradigm only has a single execution path. Writing single-path code makes path analysis and thus WCET analysis trivial. The WCET of the single path is obtained by executing the code (necessarily on that single path) and logging the duration of this execution. To demonstrate that the single-path approach provides a universal solution to the WCET-analysis problem, the paper shows how every, WCET-analyzable piece of code can be translated into single-path code
      author = {Puschner, P. and Burns, A.},
      title = {Writing temporally predictable code},
      journal = {Object-Oriented Real-Time Dependable Systems, 2002.(WORDS 2002). Proceedings of the Seventh International Workshop on},
      year = {2002},
      pages = {85--91},
      url = {http://www.vmars.tuwien.ac.at/php/pserver/extern/download.php?fileid=423}
    Rhishikesh Puschner, P. & Burns, A. A Review of Worst-Case Execution-Time Analysis 2000 Real-Time Systems
    Vol. 18(2), pp. 115-128 
    DOI URL 
    Review: WCET analysis is defined as finding upper bounds on execution times of pieces of code for a given application. They also clarify what is “execution time”, so that WCET analysis is not confused with response time analysis. They also emphasize that WCET bound for a piece of code is application dependent, i.e. it depends on the context.

    Goals of WCET analysis are to be safe and tight.

    WCET of a piece of code depends on:

    * Possible sequences of program actions

    * Time taken by each of the actions.

    The former is expressed more easily at the source code level, while the latter needs machine code. Thus, WCET analysis comprised of three problems:

    * Characterization of execution paths

    * Translation of path information from source code level to machine code level.

    * Hardware-level execution-time analysis

    Seems like, path characterization mostly relies on programmer annotations, with maybe some automatic support to aide the programmer. No reference cited for completely automatic path information extraction.

      author = {Puschner, Peter and Burns, Alan },
      title = {A Review of Worst-Case Execution-Time Analysis},
      journal = {Real-Time Systems},
      year = {2000},
      volume = {18},
      number = {2},
      pages = {115--128},
      url = {http://dx.doi.org/10.1023/A:1008119029962},
      doi = {http://dx.doi.org/10.1023/A:1008119029962}
    Shanna 2008.07.09 Puschner, P. & Kirner, R. From Time-Triggered to Time-Deterministic Real-Time Systems 2006 Proc. 5th IFIP Working Conference on Distributed and Parallel Embedded Systems   
      author = {Peter Puschner and Raimund Kirner},
      title = {From Time-Triggered to Time-Deterministic Real-Time Systems},
      booktitle = {Proc. 5th IFIP Working Conference on Distributed and Parallel Embedded Systems},
      year = {2006}
    Hiren 2008.07.13 Scheler, F. & Schröder-Preikschat, W. Time-triggered vs. event-triggered: A matter of configuration Proceedings of the GI/ITG Workshop on Non-Functional Properties of Embedded Systems, pp. 107-112  URL 
    Abstract: During the development of real-time systems one has either to plump

    for a time-triggered or an event-triggered architecture. Actually this decision

    deals with a non-functional property of a real-time system and should therefore be

    postponed as far as possible. Unfortunately, this property also exhibits functional

    qualities during the development of real-time systems making this postponement

    impossible and a subsequent transition very expensive. This paper sketches an

    approach to specify a real-time system independent of its architecture (timetriggered

    or event-triggered), thus facilitating to switch between a time-triggered

    and an event-triggered architecture easily.

    Review: The authors claim that whether a time-triggered or event-triggered approach is chosen for a real-time embedded design should not matter as long as all the real-time constraints of the system are met. However, the real-time architecture is a functional property of the system as well, which makes it difficult to migrate between the two approaches. For example, tasks in time-triggered follow the run-to-completion semantics whereas tasks in event-triggered may be preempted; hence, a different programming model. The authors propose a method for delaying the decision on time-triggered or event-triggered architectures to a later implementation stage. Their approach uses atomic basic blocks that consist of the control and data flow graphs and a mutual exclusion graph. They seperate the program code into and-semantic sections and or-semantic sections and briefly outline how to create schedules for the two triggered approaches.
      author = {Scheler, F. and Schröder-Preikschat, W.},
      title = {Time-triggered vs. event-triggered: A matter of configuration},
      journal = {Proceedings of the GI/ITG Workshop on Non-Functional Properties of Embedded Systems},
      pages = {107--112},
      url = {http://www4.informatik.uni-erlangen.de/~scheler/Publications/mmb2006.pdf}
    Ben 2008.07.09 Schoeberl, M. A time predictable instruction cache for a Java processor On the Move to Meaningful Internet Systems 2004: Workshop on Java Technologies for Real-Time and Embedded Systems (JTRES 2004
    Vol. 3292, pp. 371-382 
      author = {Schoeberl, M.},
      title = {A time predictable instruction cache for a Java processor},
      journal = {On the Move to Meaningful Internet Systems 2004: Workshop on Java Technologies for Real-Time and Embedded Systems (JTRES 2004},
      publisher = {Springer},
      volume = {3292},
      pages = {371--382}
    Isaac 2008.05.30 Talla, D. & John, L.K. MediaBreeze: a decoupled architecture for accelerating multimedia applications 2001 SIGARCH Comput. Archit. News
    Vol. 29(5), pp. 62-67 
    Abstract: Decoupled architectures are fine-grain processors that partition the memory access and execute functions in a computer program and exploit the parallelism between the two functions. Although some concepts from the traditional decoupled access execute paradigm made its way into commercial processors, they encountered resistance in general-purpose applications because these applications are not very structured and regular. However, multimedia applications have recently become dominant workload on desktops and workstations. Media applications are very structured and regular and lend themselves well to the decoupling concept. In this paper, we present an architecture that decouples the useful/true computations from the overhead/supporting instructions in media applications. The proposed scheme is incorporated into an out-of-order general-purpose processor enhanced with SIMD extensions. Explicit hardware support is provided to exploit instruction level parallelism in the overhead component. Performance evaluation shows that such hardware can significantly improve performance over conventional SIMD enhanced general-purpose processors. Results on nine multimedia benchmarks show that the proposed MediaBreeze architecture provides a 1.05x to 16.7x performance improvement over a 2-way out-of-order SIMD machine. On introducing slip-based data prefetching, a performance improvement up to 28x is observed.
    Review: A Decoupled Architecture for Accelerating Multimedia Applications.

    This paper utilizes the same idea as the previous decoupled architecture paper, with 2 processors. One being the Execute Processor, and one being the Access processor (memory access). THe paper doesn't go into a lot of details about the communication between them ,but it refers to a couple queues (such as load queues and store queues) to do the communication. Basically, they state that maybe for general purpose applications decoupled architecture doesn't work well because programs aren't structured in a way that can easily separate between memory access and computation, but for media applications, it is well structured. They compare against traditional SIMD machines and state that the amount of overhead for preparing SIMD instructions (packing, unpacking data etc) can be offset by the Decoupled architecture. For us, one thing we can take away is to evaluate whether a decoupled architecture makes sense for our applications. It seems that since we have interleave pipeline with multi threads, we could possible have the same effect with just one core? We should evaluate and learn more real time applications and see most of their memory access and execution code pattern.

      author = {Deependra Talla and Lizy K. John},
      title = {MediaBreeze: a decoupled architecture for accelerating multimedia applications},
      journal = {SIGARCH Comput. Archit. News},
      publisher = {ACM},
      year = {2001},
      volume = {29},
      number = {5},
      pages = {62--67},
      doi = {http://doi.acm.org/10.1145/563647.563659}
    Ben 2008.05.30 Wenzel, I., Kirner, R., Puschner, P. & Rieder, B. Principles of timing anomalies in superscalar processors 19-20 Sept. 2005 Quality Software, 2005. (QSIC 2005). Fifth International Conference on, pp. 295-303  DOI  
    Abstract: The counter-intuitive timing behavior of certain features in superscalar processors that cause severe problems for existing worst-case execution time analysis (WCET) methods is called timing anomalies. In this paper, we identify structural sources potentially causing timing anomalies in superscalar pipelines. We provide examples for cases where timing anomalies can arise in much simpler hardware architectures than commonly supposed (i.e., even in hardware containing only in-orderfunctional units). We elaborate the general principle behind timing anomalies and propose a general criterion (resource allocation criterion) that provides a necessary (but not sufficient) condition for the occurrence of timing anomalies in a processor. This principle allows to state the absence of timing anomalies for a specific combination of hardware and software and thus forms a solid theoretic foundation for the time-predictable execution of real-time software on complex processor hardware.
    Review: This paper describes the concept of timing anomalies, which can be

    detrimental to WCET analysis. They are defined as cases where

    changing the timing of a single instruction in a stream of

    instructions affects the timing of the overall stream by a larger

    amount than the original change (or in the opposite direction). For

    example, a case in which decreasing the latency of a single

    instruction can decrease the latency of the overall instruction stream

    by a larger amount is a timing anomaly. It would also be a timing

    anomaly if decreasing the latency of the one instruction caused an

    overall increase by any amount. There are straightforward and very

    similar examples when the latency of a single instruction is


    The authors present a previous analysis (T. Lundqvist and P.

    Stenström, 1999) showing that allowing out-of-order execution could

    cause timing anomalies, and present an example of this phenomenon that

    is very similar in spirit to the proof of Richard's Anomalies (Richard

    Graham, 1976). They then also show that disallowing out-of-order

    execution is not enough to prevent timing anomalies by presenting an

    example of a timing anomaly that can occur without out-of-order


    They then present an alternative criterion, which is necessary but not

    sufficient condition for the occurrence of timing anomalies. This

    criterion is the possibility of resource allocation decisions, which

    they define to be any choice that must be made in allocating hardware

    such that a change to the latency of some instruction can change the

    stream of instructions allocated to any functional unit. This

    criterion is satisfied in situations where more than one piece of

    hardware can perform the same function, because an increased latency

    for one instruction can cause a following instruction to be allocated

    to the other hardware unit.

      author = {Wenzel, I. and Kirner, R. and Puschner, P. and Rieder, B.},
      title = {Principles of timing anomalies in superscalar processors},
      journal = {Quality Software, 2005. (QSIC 2005). Fifth International Conference on},
      year = {19-20 Sept. 2005},
      pages = { 295-303},
      doi = {http://dx.doi.org/10.1109/QSIC.2005.49}
    Hiren 2008.05.19 Whitham, J. & Audsley, N. MCGREP--A Predictable Architecture for Embedded Real-Time Systems 2006 Proceedings of the 27th IEEE International Real-Time Systems Symposium, pp. 13-24  URL 
    Abstract: Real-time systems design involves many important choices, including that of the processor. The fastest processors achieve performance by utilizing architectural features that make them unpredictable, leading to difficulties proving offline that application process deadlines will be met, in the worst-case. Utilizing slower, more predictable processors, may not provide sufficient instruction throughput to execute all required application processes. This exposes a key trade-off in processor selection for real-time systems: predictability versus instruction throughput. This paper proposes MCGREP, a novel CPU architecture that combines predictability, high instruction throughput and flexibility. MCGREP is entirely microprogrammed, with multiple execution units. Basic operation involves implementation of a conventional set of CPU instructions in microcode - MCGREP then executes object code suitably compiled. Advanced operation allows the application to dynamically load new microcode, enabling new application specific instructions to increase overall performance. MCGREP is implemented upon reconfigurable logic (FPGA) - an increasingly important platform for the embedded RTS. Custom microcode configurations for new instructions are generated from C source. MCGREP is shown to have performance comparable to two popular FPGA softcore CPUs (OpenRISC and Microblaze, the latter a commercial product). Flexibility is demonstrated by implementing an existing instruction set (OpenRISC) in microcode, with application-specific instructions to improve overall performance. As a further demonstration, predictable two-level interrupt and synchronization mechanisms are programmed in microcode
    Review: MCGREP is a predictable processor architecture for embedded real-time systems. It is implemented as a softcore. MCGREP supports two operation modes: 1) traditional instruction and 2) microcode processing. The architecture consists of a two-stage pipeline with fetch/decode being one and execute being the second. The memory system is simple; the core is connected to the RAM via a bus shared with a boot ROM and I/O. There are no caches or scratchpads, but instead they use main memory as block RAMs on the FPGA. They have two functional units, which are used to parallelize the execution of microcode instructions. Microcode are generated with supporting tools to reduce hotspot segments of traditional instruction streams - parallelize the instruction execution. They compare the MCGREP with openRISC and Microblaze processors, all slowed down to 40Mhz and present instruction throughput, area and variability in execution times. They use the MiBench and MediaBench benchmarks.

    The collected results are misleading. The instruction throughput compares MCGREP with openRISC and Microblaze without caches, MMU, and FPU and shows that MCGREP has reasonably good performance. The variability in execution times reintroduce the caches in openRISC and Microblaze and show that MCGREP has smaller variation in execution times. Then the authors continue to describe how microcode can be used to handle interrupts, atomic operations, and priority inheritance.

      author = {Whitham, J. and Audsley, N.},
      title = {MCGREP--A Predictable Architecture for Embedded Real-Time Systems},
      journal = {Proceedings of the 27th IEEE International Real-Time Systems Symposium},
      publisher = {IEEE Computer Society Washington, DC, USA},
      year = {2006},
      pages = {13--24},
      note = {Hiren},
      url = {http://ieeexplore.ieee.org/iel5/4032320/4032321/04032332.pdf?tp=&isnumber=&arnumber=4032332}
    Rhishikesh 2008.05.26 Yau-Tsun Steven Li, S.M. & Yau-Tsun Steven Li, S.M. Performance Analysis of Embedded Software Using Implicit Path Enumeration 1995 Proc. 32nd Conference on Design Automation DAC '95, pp. 456-461  DOI  
    Abstract: Embedded computer systems are characterized by the presence of a processor running application specific software. A large number of these systems must satisfy real-time constraints. This paper examines the problem of determining the bound on the running time of a given program on a given processor. An important aspect of this problem is determining the extreme case program paths. The state of the art solution here relies on an explicit enumeration of program paths. This runs out of steam rather quickly since the number of feasible program paths is typically exponential in the size of the program. We present a solution for this problem, which considers all paths implicitly by using integer linear programming. This solution is implemented in the program cinderella which currently targets a popular embedded processor - the Intel i960. The preliminary results of using this tool are presented here.
      author = {Yau-Tsun Steven Li, Sharad Malik and Yau-Tsun Steven Li, Sharad Malik},
      title = {Performance Analysis of Embedded Software Using Implicit Path Enumeration},
      booktitle = {Proc. 32nd Conference on Design Automation DAC '95},
      year = {1995},
      pages = {456--461},
      doi = {http://dx.doi.org/10.1109/DAC.1995.249991}

    Created by JabRef on 09/10/2008.