图书介绍
并行计算机体系结构 英文版·第2版pdf电子书版本下载
- (美)(D.E.卡勒)David E.Culler等著 著
- 出版社: 北京:机械工业出版社
- ISBN:7111074408
- 出版时间:1999
- 标注页数:1025页
- 文件大小:58MB
- 文件页数:1051页
- 主题词:
PDF下载
下载说明
并行计算机体系结构 英文版·第2版PDF格式电子书版下载
下载的文件为RAR压缩包。需要使用解压软件进行解压得到PDF格式图书。建议使用BT下载工具Free Download Manager进行下载,简称FDM(免费,没有广告,支持多平台)。本站资源全部打包为BT种子。所以需要使用专业的BT下载软件进行下载。如 BitComet qBittorrent uTorrent等BT下载工具。迅雷目前由于本站不是热门资源。不推荐使用!后期资源热门了。安装了迅雷也可以迅雷进行下载!
(文件页数 要大于 标注页数,上中下等多册电子书除外)
注意:本站所有压缩包均有解压码: 点击下载压缩包解压工具
图书目录
1 Introduction 1
1.1 Why Parallel Architecture 4
1.1.1 Application Trends 6
1.1.2 Technology Trends 12
1.1.3 Architectural Trends 14
1.1.4 Supercomputers 21
1.1.5 Summary 23
1.2 Convergence of Parallel Architectures 25
1.2.1 Communication Architecture 25
1.2.2 Shared Address Space 28
1.2.3 Message Passing 37
1.2.4 Convergence 42
1.2.5 Data Parallel Processing 44
1.2.6 Other Parallel Architectures 47
1.2.7 A Generic Parallel Architecture 50
1.3 Fundamental Design Issues 52
1.3.1 Communication Abstraction 53
1.3.2 Programming Model Requirements 53
1.3.3 Communication and Replication 58
1.3.4 Performance 59
1.4 Concluding Remarks 63
1.3.5 Summary 63
1.5 Historical References 66
1.6 Exercises 70
2 Parallel Programs 75
2.1 Parallel Application Case Studies 76
2.1.1 Simulating Ocean Currents 77
2.1.2 Simulating the Evolution of Galaxies 78
2.1.3 Visualizing Complex Scenes Using Ray Tracing 79
2.1.4 Mining Data for Associations 80
2.2 The Parallelization Process 81
2.2.1 Steps in the Process 82
2.2.2 Parallelizing Computation versus Data 90
2.2.3 Goals of the Parallelization Process 91
2.3 Parallelization of an Example Program 92
2.3.1 The Equation Solver Kernel 92
2.3.2 Decomposition 93
2.3.3 Assignment 98
2.3.4 Orchestration under the Data Parallel Model 99
2.3.5 Orchestration under the Shared Address Space Model 101
2.3.6 Orchestration under the Message-Passing Model 108
2.4 Concluding Remarks 116
2.5 Exercises 117
3 Programming for Performance 121
3.1.1 Load Balance and Synchronization Wait Time 123
3.1 Partitioning for Performance 123
3.1.2 Reducing Inherent Communication 131
3.1.3 Reducing the Extra Work 135
3.1.4 Summary 136
3.2 Data Access and Communication in a Multimemory System 137
3.2.1 A Multiprocessor as an Extended Memory Hierarchy 138
3.2.2 Artifactual Communication in the Extended Memory Hierarchy 139
3.2.3 Artifactual Communication and Replication:The Working Set Perspective 140
3.3 Orchestration for Performance 142
3.3.1 Reducing Artifactual Communication 142
3.3.2 Structuring Communication to Reduce Cost 150
3.4 Performance Factors from the Processor s Perspective 156
3.5 The Parallel Application Case Studies:An In-Depth Look 160
3.5.1 Ocean 161
3.5.2 Barnes-Hut 166
3.5.3 Raytrace 174
3.5.4 Data Mining 178
3.6 Implications for Programming Models 182
3.6.1 Naming 184
3.6.2 Replication 184
3.6.3 Overhead and Granularity of Communication 186
3.6.4 Block Data Transfer 187
3.6.6 Hardware Cost and Design Complexity 188
3.6.5 Synchronization 188
3.6.7 Performance Model 189
3.6.8 Summary 189
3.7 Concluding Remarks 190
3.8 Exercises 192
4 Workload-Driven Evaluation 199
4.1 Scaling Workloads and Machines 202
4.1.1 Basic Measures of Multiprocessor Performance 202
4.1.2 Why Worry about Scaling? 204
4.1.3 Key Issues in Scaling 206
4.1.4 Scaling Models and Speedup Measures 207
4.1.5 Impact of Scaling Models on the Equation Solver Kernel 211
4.1.6 Scaling Workload Parameters 213
4.2 Evaluating a Real Machine 215
4.2.1 Performance Isolation Using Microbenchmarks 215
4.2.2 Choosing Workloads 216
4.2.3 Evaluating a Fixed-Size Machine 221
4.2.4 Varying Machine Size 226
4.2.5 Choosing Performance Metrics 228
4.3 Evaluating an Architectural Idea or Trade-off 231
4.3.1 Multiprocessor Simulation 233
4.3.2 Scaling Down Problem and Machine Parameters for Simulation 234
4.3.3 Dealing with the Parameter Space:An Example Evaluation 238
4.4 Illustrating Workload Characterization 243
4.3.4 Summary 243
4.4.1 Workload Case Studies 244
4.4.2 Workload Characteristics 253
4.5 Concluding Remarks 262
4.6 Exercises 263
5 Shared Memory Multiprocessors 269
5.1 Cache Coherence 273
5.1.1 The Cache Coherence Problem 273
5.1.2 Cache Coherence through Bus Snooping 277
5.2 Memory Consistency 283
5.2.1 Sequential Consistency 286
5.2.2 Sufficient Conditions for Preserving Sequential Consistency 289
5.3 Design Space for Snooping Protocols 291
5.3.1 A Three-State(MSI)Write-Back Invalidation Protocol 293
5.3.2 A Four-State(MESI)Write-Back Invalidation Protocol 299
5.3.3 A Four-State(Dragon)Write-Back Update Protocol 301
5.4 Assessing Protocol Design Trade-offs 305
5.4.1 Methodology 306
5.4.2 Bandwidth Requirement under the MESI Protocol 307
5.4.3 Impact of Protocol Optimizations 311
5.4.4 Trade-Offs in Cache Block Size 313
5.4.5 Update-Based versus Invalidation-Based Protocols 329
5.5 Synchronization 334
5.5.1 Components of a Synchronization Event 335
5.5.2 Role of the User and System 336
5.5.3 Mutual Exclusion 337
5.5.4 Point-to-Point Event Synchronization 352
5.5.5 Global(Barrier)Event Synchronization 353
5.5.6 Synchronization Summary 358
5.6 Implications for Software 359
5.7 Concluding Remarks 366
5.8 Exercises 367
6 Snoop-Based Multiprocessor Design 377
6.1 Correctness Requirements 378
6.2 Base Design:Single-Level Caches with an Atomic Bus 380
6.2.1 Cache Controller and Tag Design 381
6.2.2 Reporting Snoop Results 382
6.2.3 Dealing with Write Backs 384
6.2.4 Base Organization 385
6.2.5 Nonatomic State Transitions 385
6.2.6 Serialization 388
6.2.7 Deadlock 390
6.2.8 Livelock and Starvation 390
6.2.9 Implementing Atomic Operations 391
6.3 Multilevel Cache Hierarchies 393
6.3.1 Maintaining Inclusion 394
6.3.2 Propagating Transactions for Coherence in the Hierarchy 396
6.4 Split-Transaction Bus 398
6.4.1 An Example Split-Transaction Design 400
6.4.2 Bus Design and Request-Response Matching 400
6.4.3 Snoop Results and Conflicting Requests 402
6.4.4 Flow Control 404
6.4.5 Path of a Cache Miss 404
6.4.6 Serialization and Sequential Consistency 406
6.4.7 Alternative Design Choices 409
6.4.8 Split-Transaction Bus with Multilevel Caches 410
6.4.9 Supporting Multiple Outstanding Misses from a Processor 413
6.5 Case Studies:SGI Challenge and Sun Enterprise 6000 415
6.5.1 SGI Powerpath-2 System Bus 417
6.5.2 SGI Processor and Memory Subsystems 420
6.5.3 SGI I/O Subsystem 422
6.5.4 SGI Challenge Memory System Performance 424
6.5.5 Sun Gigaplane System Bus 424
6.5.6 Sun Processor and Memory Subsystem 427
6.5.7 Sun I/O Subsystem 429
6.5.8 Sun Enterprise Memory System Performance 429
6.5.9 Application Performance 429
6.6 Extending Cache Coherence 433
6.6.1 Shared Cache Designs 434
6.6.2 Coherence for Virtually Indexed Caches 437
6.6.3 Translation Lookaside Buffer Coherence 439
6.6.4 Snoop-Based Cache Coherence on Rings 441
6.6.5 Scaling Data and Snoop Bandwidth in Bus-Based Systems 445
6.7 Concluding Remarks 446
6.8 Exercises 446
7 Scalable Multiprocessors 453
7.1 Scalability 456
7.1.1 Bandwidth Scaling 457
7.1.2 Latency Scaling 460
7.1.3 Cost Scaling 461
7.1.4 Physical Scaling 462
7.1.5 Scaling in a Generic Parallel Architecture 467
7.2 Realizing Programming Models 468
7.2.1 Primitive Network Transactions 470
7.2.2 Shared Address Space 473
7.2.3 Message Passing 476
7.2.4 Active Messages 481
7.2.5 Common Challenges 482
7.2.6 Communication Architecture Design Space 485
7.3 Physical DMA 486
7.3.1 Node-to-Network Interface 486
7.3.3 A Case Study:nCUBE/2 488
7.3.2 Implementing Communication Abstractions 488
7.3.4 Typical LAN Interfaces 490
7.4 User-Lervel Access 491
7.4.1 Node-to-Network Interface 491
7.4.2 Case Study:Thinking Machines CM-5 493
7.4.3 User-Level Handlers 494
7.5 Dedicated Message Processing 496
7.5.1 Case Study:Intel Paragon 499
7.5.2 Case Study:Meiko CS-2 503
7.6 Shared Physical Address Space 506
7.6.1 Case Study:CRAY T3D 508
7.6.2 Case Study:CRAY T3E 512
7.6.3 Summary 513
7.7 Clusters and Networks of Workstations 513
7.7.1 Case Study:Myrinet SBUS Lanai 516
7.7.2 Case Study:PCI Memory Channel 518
7.8 Implications for Parallel Software 522
7.8.1 Network Transaction Performance 522
7.8.2 Shared Address Space Operations 527
7.8.3 Message-Passing Operations 528
7.8.4 Application-Level Performance 531
7.9.1 Algorithms for Locks 538
7.9 Synchronization 538
7.9.2 Algorithms for Barriers 542
7.10 Concluding Remarks 548
7.11 Exercises 548
8 Directory-Based Cache Coherence 553
8.1 Scalable Cache Coherence 558
8.2 Overview of Directory-Based Approaches 559
8.2.1 Operation of a Simple Directory Scheme 560
8.2.2 Scaling 564
8.2.3 Alternatives for Organizing Directories 565
8.3.1 Data Sharing Patterns for Directory Schemes 571
8.3 Assessing Directory Protocols and Trade-Offs 571
8.3.2 Local versus Remote Traffic 578
8.3.3 Cache Block Size Effects 579
8.4 Design Challenges for Directory Protocols 579
8.4.1 Performance 584
8.4.2 Correctness 589
8.5 Memory-Based Directory Protocols:The SGI Origin System 596
8.5.1 Cache Coherence Protocol 597
8.5.2 Dealing with Correctness Issues 604
8.5.3 Details of Directory Structure 609
8.5.4 Protocol Extensions 610
8.5.5 Overview of the Origin2000 Hardware 612
8.5.6 Hub Implementation 614
8.5.7 Performance Characteristics 618
8.6 Cache-Based Directory Protocols:The Sequent NUMA-Q 622
8.6.1 Cache Coherence Protocol 624
8.6.2 Dealing with Correctness Issues 632
8.6.3 Protocol Extensions 634
8.6.4 Overview of NUMA-Q Hardware 635
8.6.5 Protocol Interaction with SMP Node 637
8.6.6 IQ-Link Implementation 639
8.6.7 Performance Characteristics 641
8.6.8 Comparison Case Study:The HAL S1 Multiprocessor 643
8.7 Performance Parameters and Protocol Performance 645
8.8 Synchronization 648
8.8.1 Performance of Synchronization Algorithms 649
8.8.2 Implementing Atomic Primitives 651
8.9 Implications for Parallel Software 652
8.10 Advanced Topics 655
8.10.1 Reducing Directory Storage Overhead 655
8.10.2 Hierarchical Coherence 659
8.11 Concluding Remarks 669
8.12 Exercises 672
9 Hardware/Software Trade-Offs 679
9.1 Relaxed Memory Consistency Models 681
9.1.1 The System Specification 686
9.1.2 The Programmer s Interface 694
9.1.3 The Translation Mechanism 698
9.1.4 Consistency Models in Real Multiprocessor Systems 698
9.2 Overcoming Capacity Limitations 700
9.2.1 Tertiary Caches 700
9.2.2 Cache-Only Memory Architectures(COMA) 701
9.3 Reducing Hardware Cost 705
9.3.1 Hardware Access Control with a Decoupled Assist 707
9.3.2 Access Control through Code Instrumentation 707
9.3.3 Page-Based Access Control:Shared Virtual Memory 709
9.3.4 Access Control through Language and Compiler Support 721
9.4 Putting It All Together:A Taxonomy and Simple COMA 724
9.4.1 Putting It All Together:Simple COMA and Stache 726
9.5 Implications for Parallel Software 729
9.6 Advanced Topics 730
9.6.1 Flexibility and Address Constraints in CC-NUMA Systems 730
9.6.2 Implementing Relaxed Memory Consistency in Software 732
9.7 Concluding Remarks 739
9.8 Exercises 740
10 Interconnection Network Design 749
10.1 Basic Definitions 750
10.2 Basic Communication Performance 755
10.2.1 Latency 755
10.2.2 Bandwidth 761
10.3 Organizational Structure 764
10.3.1 Links 764
10.3.2 Swithches 767
10.3.3 Network Interfaces 768
10.4 Interconnection Topologies 768
10.4.1 Fully Connected Network 768
10.4.2 Linear Arrays and Rings 769
10.4.3 Multidimensional Meshes and Tori 769
10.4.4 Trees 772
10.4.5 Butterflies 774
10.4.6 Hypercubes 778
10.5 Evaluating Design Trade-Offs in Network Topology 779
10.5.1 Unloaded Latency 780
10.5.2 Latency under Load 785
10.6 Routing 789
10.6.1 Routing Mechanisms 789
10.6.2 Deterministic Routing 790
10.6.3 Deadlock Freedom 791
10.6.4 Virtual Channels 795
10.6.5 Up*-Down*Routing 796
10.6.6 Turn-Model Routing 797
10.6.7 Adaptive Routing 799
10.7 Switch Design 801
10.7.2 Internal Datapath 802
10.7.1 Ports 802
10.7.3 Channel Buffers 804
10.7.4 Output Scheduling 808
10.7.5 Stacked Dimension Switches 810
10.8 Flow Control 811
10.8.1 Parallel Computer Networks versus LANs and WANs 811
10.8.2 Link-Level Flow Control 813
10.8.3 End-to-End Flow Control 816
10.9 Case Studies 818
10.9.1 CRAY T3D Network 818
10.9.2 IBM SP-1,SP-2 Network 820
10.9.3 Scalable Coherent Interface 822
10.9.4 SGI Origin Network 825
10.9.5 Myricom Network 826
10.10 Concluding Remarks 827
10.11 Exercises 828
11 Latency Tolerance 831
11.1 Overview of Latency Tolerance 834
11.1.1 Latency Tolerance and the Communication Pipeline 836
11.1.2 Approaches 837
11.1.3 Fundamental Requirements,Benefits,and Limitations 840
11.2 Latency Tolerance in Explicit Message Passing 847
11.2.3 Precommunication 848
11.2.1 Structure of Communication 848
11.2.2 Block Data Transfer 848
11.2.4 Proceeding Past Communication in the Same Thread 850
11.2.5 Multithreading 850
11.3 Latency Tolerance in a Shared Address Space 851
11.3.1 Structure of Communication 852
11.4 Block Data Transfer in a Shared Address Space 853
11.4.1 Techniques and Mechanisms 853
11.4.2 Policy Issues and Trade-Offs 854
11.4.3 Performance Benefits 856
11.5 Proceeding Past Long-Latency Events 863
11.5.1 Proceeding Past Writes 864
11.5.2 Proceeding Past Reads 868
11.5.3 Summary 876
11.6 Precommunication in a Shared Address Space 877
11.6.1 Shared Address Space without Caching of Shared Data 877
11.6.2 Cache-Coherent Shared Address Space 879
11.6.3 Performance Benefits 891
11.6.4 Summary 896
11.7 Multithreading in a Shared Address Space 896
11.7.1 Techniques and Mechanisms 898
11.7.2 Performance Benefits 910
11.7.3 Implementation Issues for the Blocked Scheme 914
11.7.4 Implementation Issues for the Interleaved Scheme 917
11.7.5 Integrating Multithreading with Multiple-Issue Processors 920
11.8 Lockup-Free Cache Design 922
11.9 Concluding Remarks 926
11.10 Exercises 927
12 Future Directions 935
12.1 Technology and Architecture 936
12.1.1 Evolutionary Scenario 937
12.1.2 Hitting a Wall 940
12.1.3 Potential Breakthroughs 944
12.2.1 Evolutionary Scenario 955
12.2 Applications and System Software 955
12.2.2 Hitting a Wall 960
12.2.3 Potential Breakthroughs 961
Appendix:Parallel Benchmark Suites 963
A.1 ScaLapack 963
A.2 TPC 963
A.3 SPLASH 965
A.4 NAS Parallel Benchmarks 966
A.5 PARKBENCH 967
A.6 Other Ongoing Efforts 968
References 969
Index 993