Gør som tusindvis af andre bogelskere
Tilmeld dig nyhedsbrevet og få gode tilbud og inspiration til din næste læsning.
Ved tilmelding accepterer du vores persondatapolitik.Du kan altid afmelde dig igen.
Quantum computers can (in theory) solve certain problems far faster than a classical computer running any known classical algorithm. While existing technologies for building quantum computers are in their infancy, it is not too early to consider their scalability and reliability in the context of the design of large-scale quantum computers. To architect such systems, one must understand what it takes to design and model a balanced, fault-tolerant quantum computer architecture. The goal of this lecture is to provide architectural abstractions for the design of a quantum computer and to explore the systems-level challenges in achieving scalable, fault-tolerant quantum computation. In this lecture, we provide an engineering-oriented introduction to quantum computation with an overview of the theory behind key quantum algorithms. Next, we look at architectural case studies based upon experimental data and future projections for quantum computation implemented using trapped ions. While we focus here on architectures targeted for realization using trapped ions, the techniques for quantum computer architecture design, quantum fault-tolerance, and compilation described in this lecture are applicable to many other physical technologies that may be viable candidates for building a large-scale quantum computing system. We also discuss general issues involved with programming a quantum computer as well as a discussion of work on quantum architectures based on quantum teleportation. Finally, we consider some of the open issues remaining in the design of quantum computers. Table of Contents: Introduction / Basic Elements for Quantum Computation / Key Quantum Algorithms / Building Reliable and Scalable Quantum Architectures / Simulation of Quantum Computation / Architectural Elements / Case Study: The Quantum Logic Array Architecture / Programming the Quantum Architecture / Using the QLA for Quantum Simulation: The Transverse Ising Model / Teleportation-Based Quantum Architectures / Concluding Remarks
This book provides an introduction to the key concepts of Fully Homomorphic Encryption (FHE)-based computing, and discusses the challenges associated with architecting FHE-based computing systems. Readers will see that due to FHE¿s ability to compute on encrypted data, it is a promising solution to address privacy concerns arising from cloud-based services commonly used for a variety of applications including healthcare, financial, transportation, and weather forecasting. This book explains the fundamentals of the FHE operations and then presents an architectural analysis of the FHE-based computing. The authors also highlight challenges associated with accelerating FHE on various commodity platforms and argue that the FPGA platform provides a sweet spot in making privacy-preserving computing plausible.
Since the end of Dennard scaling in the early 2000s, improving the energy efficiency of computation has been the main concern of the research community and industry. The large energy efficiency gap between general-purpose processors and application-specific integrated circuits (ASICs) motivates the exploration of customizable architectures, where one can adapt the architecture to the workload. In this Synthesis lecture, we present an overview and introduction of the recent developments on energy-efficient customizable architectures, including customizable cores and accelerators, on-chip memory customization, and interconnect optimization. In addition to a discussion of the general techniques and classification of different approaches used in each area, we also highlight and illustrate some of the most successful design examples in each category and discuss their impact on performance and energy efficiency. We hope that this work captures the state-of-the-art research and development on customizable architectures and serves as a useful reference basis for further research, design, and implementation for large-scale deployment in future computing systems.
The emerging three-dimensional (3D) chip architectures, with their intrinsic capability of reducing the wire length, promise attractive solutions to reduce the delay of interconnects in future microprocessors. 3D memory stacking enables much higher memory bandwidth for future chip-multiprocessor design, mitigating the "e;memory wall"e; problem. In addition, heterogenous integration enabled by 3D technology can also result in innovative designs for future microprocessors. This book first provides a brief introduction to this emerging technology, and then presents a variety of approaches to designing future 3D microprocessor systems, by leveraging the benefits of low latency, high bandwidth, and heterogeneous integration capability which are offered by 3D technology.
Having hit power limitations to even more aggressive out-of-order execution in processor cores, many architects in the past decade have turned to single-instruction-multiple-data (SIMD) execution to increase single-threaded performance. SIMD execution, or having a single instruction drive execution of an identical operation on multiple data items, was already well established as a technique to efficiently exploit data parallelism. Furthermore, support for it was already included in many commodity processors. However, in the past decade, SIMD execution has seen a dramatic increase in the set of applications using it, which has motivated big improvements in hardware support in mainstream microprocessors. The easiest way to provide a big performance boost to SIMD hardware is to make it wider-i.e., increase the number of data items hardware operates on simultaneously. Indeed, microprocessor vendors have done this. However, as we exploit more data parallelism in applications, certain challenges can negatively impact performance. In particular, conditional execution, non-contiguous memory accesses, and the presence of some dependences across data items are key roadblocks to achieving peak performance with SIMD execution. This book first describes data parallelism, and why it is so common in popular applications. We then describe SIMD execution, and explain where its performance and energy benefits come from compared to other techniques to exploit parallelism. Finally, we describe SIMD hardware support in current commodity microprocessors. This includes both expected design tradeoffs, as well as unexpected ones, as we work to overcome challenges encountered when trying to map real software to SIMD execution.
As Moore's Law and Dennard scaling trends have slowed, the challenges of building high-performance computer architectures while maintaining acceptable power efficiency levels have heightened. Over the past ten years, architecture techniques for power efficiency have shifted from primarily focusing on module-level efficiencies, toward more holistic design styles based on parallelism and heterogeneity. This work highlights and synthesizes recent techniques and trends in power-efficient computer architecture. Table of Contents: Introduction / Voltage and Frequency Management / Heterogeneity and Specialization / Communication and Memory Systems / Conclusions / Bibliography / Authors' Biographies
To date, the most common form of simulators of computer systems are software-based running on standard computers. One promising approach to improve simulation performance is to apply hardware, specifically reconfigurable hardware in the form of field programmable gate arrays (FPGAs). This manuscript describes various approaches of using FPGAs to accelerate software-implemented simulation of computer systems and selected simulators that incorporate those techniques. More precisely, we describe a simulation architecture taxonomy that incorporates a simulation architecture specifically designed for FPGA accelerated simulation, survey the state-of-the-art in FPGA-accelerated simulation, and describe in detail selected instances of the described techniques. Table of Contents: Preface / Acknowledgments / Introduction / Simulator Background / Accelerating Computer System Simulators with FPGAs / Simulation Virtualization / Categorizing FPGA-based Simulators / Conclusion / Bibliography / Authors' Biographies
Design for security is an essential aspect of the design of future computers. However, security is not well understood by the computer architecture community. Many important security aspects have evolved over the last several decades in the cryptography, operating systems, and networking communities. This book attempts to introduce the computer architecture student, researcher, or practitioner to the basic concepts of security and threat-based design. Past work in different security communities can inform our thinking and provide a rich set of technologies for building architectural support for security into all future computers and embedded computing devices and appliances. I have tried to keep the book short, which means that many interesting topics and applications could not be included. What the book focuses on are the fundamental security concepts, across different security communities, that should be understood by any computer architect trying to design or evaluate security-aware computer architectures.
From driving, flying, and swimming, to digging for unknown objects in space exploration, autonomous robots take on varied shapes and sizes. In part, autonomous robots are designed to perform tasks that are too dirty, dull, or dangerous for humans. With nontrivial autonomy and volition, they may soon claim their own place in human society. These robots will be our allies as we strive for understanding our natural and man-made environments and build positive synergies around us. Although we may never perfect replication of biological capabilities in robots, we must harness the inevitable emergence of robots that synchronizes with our own capacities to live, learn, and grow. This book is a snapshot of motivations and methodologies for our collective attempts to transform our lives and enable us to cohabit with robots that work with and for us. It reviews and guides the reader to seminal and continual developments that are the foundations for successful paradigms. It attempts to demystify the abilities and limitations of robots. It is a progress report on the continuing work that will fuel future endeavors. Table of Contents: Part I: Preliminaries/Agency, Motion, and Anatomy/Behaviors / Architectures / Affect/Sensors / Manipulators/Part II: Mobility/Potential Fields/Roadmaps / Reactive Navigation / Multi-Robot Mapping: Brick and Mortar Strategy / Part III: State of the Art / Multi-Robotics Phenomena / Human-Robot Interaction / Fuzzy Control / Decision Theory and Game Theory / Part IV: On the Horizon / Applications: Macro and Micro Robots / References / Author Biography / Discussion
Shrinking feature size and diminishing supply voltage are making circuits sensitive to supply voltage fluctuations within the microprocessor, caused by normal workload activity changes. If left unattended, voltage fluctuations can lead to timing violations or even transistor lifetime issues that degrade processor robustness. Mechanisms that learn to tolerate, avoid, and eliminate voltage fluctuations based on program and microarchitectural events can help steer the processor clear of danger, thus enabling tighter voltage margins that improve performance or lower power consumption. We describe the problem of voltage variation and the factors that influence this variation during processor design and operation. We also describe a variety of runtime hardware and software mitigation techniques that either tolerate, avoid, and/or eliminate voltage violations. We hope processor architects will find the information useful since tolerance, avoidance, and elimination are generalizable constructs that can serve as a basis for addressing other reliability challenges as well. Table of Contents: Introduction / Modeling Voltage Variation / Understanding the Characteristics of Voltage Variation / Traditional Solutions and Emerging Solution Forecast / Allowing and Tolerating Voltage Emergencies / Predicting and Avoiding Voltage Emergencies / Eliminiating Recurring Voltage Emergencies / Future Directions on Resiliency
Multithreaded architectures now appear across the entire range of computing devices, from the highest-performing general purpose devices to low-end embedded processors. Multithreading enables a processor core to more effectively utilize its computational resources, as a stall in one thread need not cause execution resources to be idle. This enables the computer architect to maximize performance within area constraints, power constraints, or energy constraints. However, the architectural options for the processor designer or architect looking to implement multithreading are quite extensive and varied, as evidenced not only by the research literature but also by the variety of commercial implementations. This book introduces the basic concepts of multithreading, describes a number of models of multithreading, and then develops the three classic models (coarse-grain, fine-grain, and simultaneous multithreading) in greater detail. It describes a wide variety of architectural and software design tradeoffs, as well as opportunities specific to multithreading architectures. Finally, it details a number of important commercial and academic hardware implementations of multithreading. Table of Contents: Introduction / Multithreaded Execution Models / Coarse-Grain Multithreading / Fine-Grain Multithreading / Simultaneous Multithreading / Managing Contention / New Opportunities for Multithreaded Processors / Experimentation and Metrics / Implementations of Multithreaded Processors / Conclusion
General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models. We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms. We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also briefly survey the state-of-the-art in GPU performance analysis tools and techniques. Table of Contents: GPU Design, Programming, and Trends / Performance Principles / From Principles to Practice: Analysis and Tuning / Using Detailed Performance Analysis to Guide Optimization
Compiling for parallelism is a longstanding topic of compiler research. This book describes the fundamental principles of compiling "e;regular"e; numerical programs for parallelism. We begin with an explanation of analyses that allow a compiler to understand the interaction of data reads and writes in different statements and loop iterations during program execution. These analyses include dependence analysis, use-def analysis and pointer analysis. Next, we describe how the results of these analyses are used to enable transformations that make loops more amenable to parallelization, and discuss transformations that expose parallelism to target shared memory multicore and vector processors. We then discuss some problems that arise when parallelizing programs for execution on distributed memory machines. Finally, we conclude with an overview of solving Diophantine equations and suggestions for further readings in the topics of this book to enable the interested reader to delve deeper into the field. Table of Contents: Introduction and overview / Dependence analysis, dependence graphs and alias analysis / Program parallelization / Transformations to modify and eliminate dependences / Transformation of iterative and recursive constructs / Compiling for distributed memory machines / Solving Diophantine equations / A guide to further reading
As conventional memory technologies such as DRAM and Flash run into scaling challenges, architects and system designers are forced to look at alternative technologies for building future computer systems. This synthesis lecture begins by listing the requirements for a next generation memory technology and briefly surveys the landscape of novel non-volatile memories. Among these, Phase Change Memory (PCM) is emerging as a leading contender, and the authors discuss the material, device, and circuit advances underlying this exciting technology. The lecture then describes architectural solutions to enable PCM for main memories. Finally, the authors explore the impact of such byte-addressable non-volatile memories on future storage and system designs. Table of Contents: Next Generation Memory Technologies / Architecting PCM for Main Memories / Tolerating Slow Writes in PCM / Wear Leveling for Durability / Wear Leveling Under Adversarial Settings / Error Resilience in Phase Change Memories / Storage and System Design With Emerging Non-Volatile Memories
A key determinant of overall system performance and power dissipation is the cache hierarchy since access to off-chip memory consumes many more cycles and energy than on-chip accesses. In addition, multi-core processors are expected to place ever higher bandwidth demands on the memory system. All these issues make it important to avoid off-chip memory access by improving the efficiency of the on-chip cache. Future multi-core processors will have many large cache banks connected by a network and shared by many cores. Hence, many important problems must be solved: cache resources must be allocated across many cores, data must be placed in cache banks that are near the accessing core, and the most important data must be identified for retention. Finally, difficulties in scaling existing technologies require adapting to and exploiting new technology constraints. The book attempts a synthesis of recent cache research that has focused on innovations for multi-core processors. It is an excellent starting point for early-stage graduate students, researchers, and practitioners who wish to understand the landscape of recent cache research. The book is suitable as a reference for advanced computer architecture classes as well as for experienced researchers and VLSI engineers. Table of Contents: Basic Elements of Large Cache Design / Organizing Data in CMP Last Level Caches / Policies Impacting Cache Hit Rates / Interconnection Networks within Large Caches / Technology / Concluding Remarks
Dynamic binary modification tools form a software layer between a running application and the underlying operating system, providing the powerful opportunity to inspect and potentially modify every user-level guest application instruction that executes. Toolkits built upon this technology have enabled computer architects to build powerful simulators and emulators for design-space exploration, compiler writers to analyze and debug the code generated by their compilers, software developers to fully explore the features, bottlenecks, and performance of their software, and even end-users to extend the functionality of proprietary software running on their computers. Several dynamic binary modification systems are freely available today that place this power into the hands of the end user. While these systems are quite complex internally, they mask that complexity with an easy-to-learn API that allows a typical user to ramp up fairly quickly and build any of a number of powerful tools. Meanwhile, these tools are robust enough to form the foundation for software products in use today. This book serves as a primer for researchers interested in dynamic binary modification systems, their internal design structure, and the wide range of tools that can be built leveraging these systems. The hands-on examples presented throughout form a solid foundation for designing and constructing more complex tools, with an appreciation for the techniques necessary to make those tools robust and efficient. Meanwhile, the reader will get an appreciation for the internal design of the engines themselves. Table of Contents: Dynamic Binary Modification: Overview / Using a Dynamic Binary Modifier / Program Analysis and Debugging / Active Program Modification / Architectural Exploration / Advanced System Internals / Historical Perspectives / Summary and Observations
Provides insights into the principles behind the design of academic and commercial secure processor architectures. This book presents the different challenges of secure processor architecture design to graduate students interested in research on architecture and hardware security and computer architects working in industry.
Datacenter networks provide the communication substrate for large parallel computer systems that form the ecosystem for high performance computing (HPC) systems and modern Internet applications. The design of new datacenter networks is motivated by an array of applications ranging from communication intensive climatology, complex material simulations and molecular dynamics to such Internet applications as Web search, language translation, collaborative Internet applications, streaming video and voice-over-IP. For both Supercomputing and Cloud Computing the network enables distributed applications to communicate and interoperate in an orchestrated and efficient way. This book describes the design and engineering tradeoffs of datacenter networks. It describes interconnection networks from topology and network architecture to routing algorithms, and presents opportunities for taking advantage of the emerging technology trends that are influencing router microarchitecture. With the emergence of "e;many-core"e; processor chips, it is evident that we will also need "e;many-port"e; routing chips to provide a bandwidth-rich network to avoid the performance limiting effects of Amdahl's Law. We provide an overview of conventional topologies and their routing algorithms and show how technology, signaling rates and cost-effective optics are motivating new network topologies that scale up to millions of hosts. The book also provides detailed case studies of two high performance parallel computer systems and their networks. Table of Contents: Introduction / Background / Topology Basics / High-Radix Topologies / Routing / Scalable Switch Microarchitecture / System Packaging / Case Studies / Closing Remarks
This lecture presents a study of the microarchitecture of contemporary microprocessors. The focus is on implementation aspects, with discussions on their implications in terms of performance, power, and cost of state-of-the-art designs. The lecture starts with an overview of the different types of microprocessors and a review of the microarchitecture of cache memories. Then, it describes the implementation of the fetch unit, where special emphasis is made on the required support for branch prediction. The next section is devoted to instruction decode with special focus on the particular support to decoding x86 instructions. The next chapter presents the allocation stage and pays special attention to the implementation of register renaming. Afterward, the issue stage is studied. Here, the logic to implement out-of-order issue for both memory and non-memory instructions is thoroughly described. The following chapter focuses on the instruction execution and describes the different functional units that can be found in contemporary microprocessors, as well as the implementation of the bypass network, which has an important impact on the performance. Finally, the lecture concludes with the commit stage, where it describes how the architectural state is updated and recovered in case of exceptions or misspeculations. This lecture is intended for an advanced course on computer architecture, suitable for graduate students or senior undergrads who want to specialize in the area of computer architecture. It is also intended for practitioners in the industry in the area of microprocessor design. The book assumes that the reader is familiar with the main concepts regarding pipelining, out-of-order execution, cache memories, and virtual memory. Table of Contents: Introduction / Caches / The Instruction Fetch Unit / Decode / Allocation / The Issue Stage / Execute / The Commit Stage / References / Author Biographies
The advent of multicore processors has renewed interest in the idea of incorporating transactions into the programming model used to write parallel programs. This approach, known as transactional memory, offers an alternative, and hopefully better, way to coordinate concurrent threads. The ACI (atomicity, consistency, isolation) properties of transactions provide a foundation to ensure that concurrent reads and writes of shared data do not produce inconsistent or incorrect results. At a higher level, a computation wrapped in a transaction executes atomically - either it completes successfully and commits its result in its entirety or it aborts. In addition, isolation ensures the transaction produces the same result as if no other transactions were executing concurrently. Although transactions are not a parallel programming panacea, they shift much of the burden of synchronizing and coordinating parallel computations from a programmer to a compiler, to a language runtime system, or to hardware. The challenge for the system implementers is to build an efficient transactional memory infrastructure. This book presents an overview of the state of the art in the design and implementation of transactional memory systems, as of early spring 2010. Table of Contents: Introduction / Basic Transactions / Building on Basic Transactions / Software Transactional Memory / Hardware-Supported Transactional Memory / Conclusions
Performance evaluation is at the foundation of computer architecture research and development. Contemporary microprocessors are so complex that architects cannot design systems based on intuition and simple models only. Adequate performance evaluation methods are absolutely crucial to steer the research and development process in the right direction. However, rigorous performance evaluation is non-trivial as there are multiple aspects to performance evaluation, such as picking workloads, selecting an appropriate modeling or simulation approach, running the model and interpreting the results using meaningful metrics. Each of these aspects is equally important and a performance evaluation method that lacks rigor in any of these crucial aspects may lead to inaccurate performance data and may drive research and development in a wrong direction. The goal of this book is to present an overview of the current state-of-the-art in computer architecture performance evaluation, with a special emphasis on methods for exploring processor architectures. The book focuses on fundamental concepts and ideas for obtaining accurate performance data. The book covers various topics in performance evaluation, ranging from performance metrics, to workload selection, to various modeling approaches including mechanistic and empirical modeling. And because simulation is by far the most prevalent modeling technique, more than half the book's content is devoted to simulation. The book provides an overview of the simulation techniques in the computer designer's toolbox, followed by various simulation acceleration techniques including sampled simulation, statistical simulation, parallel simulation and hardware-accelerated simulation. Table of Contents: Introduction / Performance Metrics / Workload Design / Analytical Performance Modeling / Simulation / Sampled Simulation / Statistical Simulation / Parallel Simulation and Hardware Acceleration / Concluding Remarks
This book covers technologies, applications, tools, languages, procedures, advantages, and disadvantages of reconfigurable supercomputing using Field Programmable Gate Arrays (FPGAs). The target audience is the community of users of High Performance Computers (HPC) who may benefit from porting their applications into a reconfigurable environment. As such, this book is intended to guide the HPC user through the many algorithmic considerations, hardware alternatives, usability issues, programming languages, and design tools that need to be understood before embarking on the creation of reconfigurable parallel codes. We hope to show that FPGA acceleration, based on the exploitation of the data parallelism, pipelining and concurrency remains promising in view of the diminishing improvements in traditional processor and system design. Table of Contents: FPGA Technology / Reconfigurable Supercomputing / Algorithmic Considerations / FPGA Programming Languages / Case Study: Sorting / Alternative Technologies and Concluding Remarks
For many years, most computer architects have pursued one primary goal: performance. Architects have translated the ever-increasing abundance of ever-faster transistors provided by Moore's law into remarkable increases in performance. Recently, however, the bounty provided by Moore's law has been accompanied by several challenges that have arisen as devices have become smaller, including a decrease in dependability due to physical faults. In this book, we focus on the dependability challenge and the fault tolerance solutions that architects are developing to overcome it. The two main purposes of this book are to explore the key ideas in fault-tolerant computer architecture and to present the current state-of-the-art - over approximately the past 10 years - in academia and industry. Table of Contents: Introduction / Error Detection / Error Recovery / Diagnosis / Self-Repair / The Future
In the last few years, power dissipation has become an important design constraint, on par with performance, in the design of new computer systems. Whereas in the past, the primary job of the computer architect was to translate improvements in operating frequency and transistor count into performance, now power efficiency must be taken into account at every step of the design process. While for some time, architects have been successful in delivering 40% to 50% annual improvement in processor performance, costs that were previously brushed aside eventually caught up. The most critical of these costs is the inexorable increase in power dissipation and power density in processors. Power dissipation issues have catalyzed new topic areas in computer architecture, resulting in a substantial body of work on more power-efficient architectures. Power dissipation coupled with diminishing performance gains, was also the main cause for the switch from single-core to multi-core architectures and a slowdown in frequency increase. This book aims to document some of the most important architectural techniques that were invented, proposed, and applied to reduce both dynamic power and static power dissipation in processors and memory hierarchies. A significant number of techniques have been proposed for a wide range of situations and this book synthesizes those techniques by focusing on their common characteristics. Table of Contents: Introduction / Modeling, Simulation, and Measurement / Using Voltage and Frequency Adjustments to Manage Dynamic Power / Optimizing Capacitance and Switching Activity to Reduce Dynamic Power / Managing Static (Leakage) Power / Conclusions
Chip multiprocessors - also called multi-core microprocessors or CMPs for short - are now the only way to build high-performance microprocessors, for a variety of reasons. Large uniprocessors are no longer scaling in performance, because it is only possible to extract a limited amount of parallelism from a typical instruction stream using conventional superscalar instruction issue techniques. In addition, one cannot simply ratchet up the clock speed on today's processors, or the power dissipation will become prohibitive in all but water-cooled systems. Compounding these problems is the simple fact that with the immense numbers of transistors available on today's microprocessor chips, it is too costly to design and debug ever-larger processors every year or two. CMPs avoid these problems by filling up a processor die with multiple, relatively simpler processor cores instead of just one huge core. The exact size of a CMP's cores can vary from very simple pipelines to moderately complex superscalar processors, but once a core has been selected the CMP's performance can easily scale across silicon process generations simply by stamping down more copies of the hard-to-design, high-speed processor core in each successive chip generation. In addition, parallel code execution, obtained by spreading multiple threads of execution across the various cores, can achieve significantly higher performance than would be possible using only a single core. While parallel threads are already common in many useful workloads, there are still important workloads that are hard to divide into parallel threads. The low inter-processor communication latency between the cores in a CMP helps make a much wider range of applications viable candidates for parallel execution than was possible with conventional, multi-chip multiprocessors; nevertheless, limited parallelism in key applications is the main factor limiting acceptance of CMPs in some types of systems. After a discussion of the basic pros and cons of CMPs when they are compared with conventional uniprocessors, this book examines how CMPs can best be designed to handle two radically different kinds of workloads that are likely to be used with a CMP: highly parallel, throughput-sensitive applications at one end of the spectrum, and less parallel, latency-sensitive applications at the other. Throughput-sensitive applications, such as server workloads that handle many independent transactions at once, require careful balancing of all parts of a CMP that can limit throughput, such as the individual cores, on-chip cache memory, and off-chip memory interfaces. Several studies and example systems, such as the Sun Niagara, that examine the necessary tradeoffs are presented here. In contrast, latency-sensitive applications - many desktop applications fall into this category - require a focus on reducing inter-core communication latency and applying techniques to help programmers divide their programs into multiple threads as easily as possible. This book discusses many techniques that can be used in CMPs to simplify parallel programming, with an emphasis on research directions proposed at Stanford University. To illustrate the advantages possible with a CMP using a couple of solid examples, extra focus is given to thread-level speculation (TLS), a way to automatically break up nominally sequential applications into parallel threads on a CMP, and transactional memory. This model can greatly simplify manual parallel programming by using hardware - instead of conventional software locks - to enforce atomic code execution of blocks of instructions, a technique that makes parallel coding much less error-prone. Contents: The Case for CMPs / Improving Throughput / Improving Latency Automatically / Improving Latency using Manual Parallel Programming / A Multicore World: The Future of CMPs
This synthesis lecture presents the current state-of-the-art in applying low-latency, lossless hardware compression algorithms to cache, memory, and the memory/cache link. There are many non-trivial challenges that must be addressed to make data compression work well in this context. First, since compressed data must be decompressed before it can be accessed, decompression latency ends up on the critical memory access path. This imposes a significant constraint on the choice of compression algorithms. Second, while conventional memory systems store fixed-size entities like data types, cache blocks, and memory pages, these entities will suddenly vary in size in a memory system that employs compression. Dealing with variable size entities in a memory system using compression has a significant impact on the way caches are organized and how to manage the resources in main memory. We systematically discuss solutions in the open literature to these problems. Chapter 2 provides the foundations of data compression by first introducing the fundamental concept of value locality. We then introduce a taxonomy of compression algorithms and show how previously proposed algorithms fit within that logical framework. Chapter 3 discusses the different ways that cache memory systems can employ compression, focusing on the trade-offs between latency, capacity, and complexity of alternative ways to compact compressed cache blocks. Chapter 4 discusses issues in applying data compression to main memory and Chapter 5 covers techniques for compressing data on the cache-to-memory links. This book should help a skilled memory system designer understand the fundamental challenges in applying compression to the memory hierarchy and introduce him/her to the state-of-the-art techniques in addressing them.
Today, computer-system optimization, at both the hardware and software levels, must consider the details of the memory system in its analysis; failing to do so yields systems that are increasingly inefficient as those systems become more complex. This lecture seeks to introduce the reader to the most important details of the memory system; it targets both computer scientists and computer engineers in industry and in academia. Roughly speaking, computer scientists are the users of the memory system and computer engineers are the designers of the memory system. Both can benefit tremendously from a basic understanding of how the memory system really works: the computer scientist will be better equipped to create algorithms that perform well and the computer engineer will be better equipped to design systems that approach the optimal, given the resource limitations. Currently, there is consensus among architecture researchers that the memory system is "the bottleneck," and this consensus has held for over a decade. Somewhat inexplicably, most of the research in the field is still directed toward improving the CPU to better tolerate a slow memory system, as opposed to addressing the weaknesses of the memory system directly. This lecture should get the bulk of the computer science and computer engineering population up the steep part of the learning curve. Not every CS/CE researcher/developer needs to do work in the memory system, but, just as a carpenter can do his job more efficiently if he knows a little of architecture, and an architect can do his job more efficiently if he knows a little of carpentry, giving the CS/CE worlds better intuition about the memory system should help them build better systems, both software and hardware. Table of Contents: Primers / It Must Be Modeled Accurately / ...\ and It Will Change Soon
Since the 1970¿s, microprocessor-based digital platforms have been riding Moore¿s law, allowing for doubling of density for the same area roughly every two years. However, whereas microprocessor fabrication has focused on increasing instruction execution rate, memory fabrication technologies have focused primarily on an increase in capacity with negligible increase in speed. This divergent trend in performance between the processors and memory has led to a phenomenon referred to as the ¿Memory Wall.¿ To overcome the memory wall, designers have resorted to a hierarchy of cache memory levels, which rely on the principal of memory access locality to reduce the observed memory access time and the performance gap between processors and memory. Unfortunately, important workload classes exhibit adverse memory access patterns that baffle the simple policies built into modern cache hierarchies to move instructions and data across cache levels. As such, processors often spend much time idling upon a demand fetch of memory blocks that miss in higher cache levels. Prefetching¿predicting future memory accesses and issuing requests for the corresponding memory blocks in advance of explicit accesses¿is an effective approach to hide memory access latency. There have been a myriad of proposed prefetching techniques, and nearly every modern processor includes some hardware prefetching mechanisms targeting simple and regular memory access patterns. This primer offers an overview of the various classes of hardware prefetchers for instructions and data proposed in the research literature, and presents examples of techniques incorporated into modern microprocessors.
This book describes warehouse-scale computers (WSCs), the computing platforms that power cloud computing and all the great web services we use every day. It discusses how these new systems treat the datacenter itself as one massive computer designed at warehouse scale, with hardware and software working in concert to deliver good levels of internet service performance. The book details the architecture of WSCs and covers the main factors influencing their design, operation, and cost structure, and the characteristics of their software base. Each chapter contains multiple real-world examples, including detailed case studies and previously unpublished details of the infrastructure used to power Google's online services. Targeted at the architects and programmers of today's WSCs, this book provides a great foundation for those looking to innovate in this fascinating and important area, but the material will also be broadly interesting to those who just want to understandthe infrastructure powering the internet.The third edition reflects four years of advancements since the previous edition and nearly doubles the number of pictures and figures. New topics range from additional workloads like video streaming, machine learning, and public cloud to specialized silicon accelerators, storage and network building blocks, and a revised discussion of data center power and cooling, and uptime. Further discussions of emerging trends and opportunities ensure that this revised edition will remain an essential resource for educators and professionals working on the next generation of WSCs.
Many modern computer systems, including homogeneous and heterogeneous architectures, support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. For a shared memory machine, the memory consistency model defines the architecturally visible behavior of its memory system. Consistency definitions provide rules about loads and stores (or memory reads and writes) and how they act upon memory. As part of supporting a memory consistency model, many machines also provide cache coherence protocols that ensure that multiple cached copies of data are kept up-to-date. The goal of this primer is to provide readers with a basic understanding of consistency and coherence. This understanding includes both the issues that must be solved as well as a variety of solutions. We present both high-level concepts as well as specific, concrete examples from real-world systems.This second edition reflects a decade of advancements since the first edition and includes, among other more modest changes, two new chapters: one on consistency and coherence for non-CPU accelerators (with a focus on GPUs) and one that points to formal work and tools on consistency and coherence.
Tilmeld dig nyhedsbrevet og få gode tilbud og inspiration til din næste læsning.
Ved tilmelding accepterer du vores persondatapolitik.