万本电子书0元读

万本电子书0元读

顶部广告

Learn CUDA Programming电子书

售       价:¥

59人正在读 | 0人评论 6.6

作       者:Jaegeun Han

出  版  社:Packt Publishing

出版时间:2019-09-27

字       数:57.9万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:此类商品不支持退换货,不支持下载打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
Explore different GPU programming methods using libraries and directives, such as OpenACC, with extension to languages such as C, C++, and Python Key Features * Learn parallel programming principles and practices and performance analysis in GPU computing * Get to grips with distributed multi GPU programming and other approaches to GPU programming * Understand how GPU acceleration in deep learning models can improve their performance Book Description Compute Unified Device Architecture (CUDA) is NVIDIA's GPU computing platform and application programming interface. It's designed to work with programming languages such as C, C++, and Python. With CUDA, you can leverage a GPU's parallel computing power for a range of high-performance computing applications in the fields of science, healthcare, and deep learning. Learn CUDA Programming will help you learn GPU parallel programming and understand its modern applications. In this book, you'll discover CUDA programming approaches for modern GPU architectures. You'll not only be guided through GPU features, tools, and APIs, you'll also learn how to analyze performance with sample parallel programming algorithms. This book will help you optimize the performance of your apps by giving insights into CUDA programming platforms with various libraries, compiler directives (OpenACC), and other languages. As you progress, you'll learn how additional computing power can be generated using multiple GPUs in a box or in multiple boxes. Finally, you'll explore how CUDA accelerates deep learning algorithms, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs). By the end of this CUDA book, you'll be equipped with the skills you need to integrate the power of GPU computing in your applications. What you will learn * Understand general GPU operations and programming patterns in CUDA * Uncover the difference between GPU programming and CPU programming * Analyze GPU application performance and implement optimization strategies * Explore GPU programming, profiling, and debugging tools * Grasp parallel programming algorithms and how to implement them * Scale GPU-accelerated applications with multi-GPU and multi-nodes * Delve into GPU programming platforms with accelerated libraries, Python, and OpenACC * Gain insights into deep learning accelerators in CNNs and RNNs using GPUs Who this book is for This beginner-level book is for programmers who want to delve into parallel computing, become part of the high-performance computing community and build modern applications. Basic C and C++ programming experience is assumed. For deep learning enthusiasts, this book covers Python InterOps, DL libraries, and practical examples on performance estimation.
目录展开

Dedication

About Packt

Why subscribe?

Contributors

About the authors

About the reviewers

Packt is searching for authors like you

Preface

Who this book is for

What this book covers

To get the most out of this book

Download the example code files

Download the color images

Conventions used

Get in touch

Reviews

Introduction to CUDA Programming

The history of high-performance computing

Heterogeneous computing

Programming paradigm

Low latency versus higher throughput

Programming approaches to GPU

Technical requirements

Hello World from CUDA

Thread hierarchy

GPU architecture

Vector addition using CUDA

Experiment 1 – creating multiple blocks

Experiment 2 – creating multiple threads

Experiment 3 – combining blocks and threads

Why bother with threads and blocks?

Launching kernels in multiple dimensions

Error reporting in CUDA

Data type support in CUDA

Summary

CUDA Memory Management

Technical requirements

NVIDIA Visual Profiler

Global memory/device memory

Vector addition on global memory

Coalesced versus uncoalesced global memory access

Memory throughput analysis

Shared memory

Matrix transpose on shared memory

Bank conflicts and its effect on shared memory

Read-only data/cache

Computer vision – image scaling using texture memory

Registers in GPU

Pinned memory

Bandwidth test – pinned versus pageable

Unified memory

Understanding unified memory page allocation and transfer

Optimizing unified memory with warp per page

Optimizing unified memory using data prefetching

GPU memory evolution

Why do GPUs have caches?

Summary

CUDA Thread Programming

Technical requirements

CUDA threads, blocks, and the GPU

Exploiting a CUDA block and warp

Understanding CUDA occupancy

Setting NVCC to report GPU resource usages

The settings for Linux

Settings for Windows

Analyzing the optimal occupancy using the Occupancy Calculator

Occupancy tuning – bounding register usage

Getting the achieved occupancy from the profiler

Understanding parallel reduction

Naive parallel reduction using global memory

Reducing kernels using shared memory

Writing performance measurement code

Performance comparison for the two reductions – global and shared memory

Identifying the application's performance limiter

Finding the performance limiter and optimization

Minimizing the CUDA warp divergence effect

Determining divergence as a performance bottleneck

Interleaved addressing

Sequential addressing

Performance modeling and balancing the limiter

The Roofline model

Maximizing memory bandwidth with grid-strided loops

Balancing the I/O throughput

Warp-level primitive programming

Parallel reduction with warp primitives

Cooperative Groups for flexible thread handling

Cooperative Groups in a CUDA thread block

Benefits of Cooperative Groups

Modularity

Explicit grouped threads' operation and race condition avoidance

Dynamic active thread selection

Applying to the parallel reduction

Cooperative Groups to avoid deadlock

Loop unrolling in the CUDA kernel

Atomic operations

Low/mixed precision operations

Half-precision operation

Dot product operations and accumulation for 8-bit integers and 16-bit data (DP4A and DP2A)

Measuring the performance

Summary

Kernel Execution Model and Optimization Strategies

Technical requirements

Kernel execution with CUDA streams

The usage of CUDA streams

Stream-level synchronization

Working with the default stream

Pipelining the GPU execution

Concept of GPU pipelining

Building a pipelining execution

The CUDA callback function

CUDA streams with priority

Priorities in CUDA

Stream execution with priorities

Kernel execution time estimation using CUDA events

Using CUDA events

Multiple stream estimation

CUDA dynamic parallelism

Understanding dynamic parallelism

Usage of dynamic parallelism

Recursion

Grid-level cooperative groups

Understanding grid-level cooperative groups

Usage of grid_group

CUDA kernel calls with OpenMP

OpenMP and CUDA calls

CUDA kernel calls with OpenMP

Multi-Process Service

Introduction to Message Passing Interface

Implementing an MPI-enabled application

Enabling MPS

Profiling an MPI application and understanding MPS operation

Kernel execution overhead comparison

Implementing three types of kernel executions

Comparison of three executions

Summary

CUDA Application Profiling and Debugging

Technical requirements

Profiling focused target ranges in GPU applications

Limiting the profiling target in code

Limiting the profiling target with time or GPU

Profiling with NVTX

Visual profiling against the remote machine

Debugging a CUDA application with CUDA error

Asserting local GPU values using CUDA assert

Debugging a CUDA application with Nsight Visual Studio Edition

Debugging a CUDA application with Nsight Eclipse Edition

Debugging a CUDA application with CUDA-GDB

Breakpoints of CUDA-GDB

Inspecting variables with CUDA-GDB

Listing kernel functions

Variables investigation

Runtime validation with CUDA-memcheck

Detecting memory out of bounds

Detecting other memory errors

Profiling GPU applications with Nsight Systems

Profiling a kernel with Nsight Compute

Profiling with the CLI

Profiling with the GUI

Performance analysis report

Baseline compare

Source view

Summary

Scalable Multi-GPU Programming

Technical requirements

Solving a linear equation using Gaussian elimination

Single GPU hotspot analysis of Gaussian elimination

GPUDirect peer to peer

Single node – multi-GPU Gaussian elimination

Brief introduction to MPI

GPUDirect RDMA

CUDA-aware MPI

Multinode – multi-GPU Gaussian elimination

CUDA streams

Application 1 – using multiple streams to overlap data transfers with kernel execution

Application 2 – using multiple streams to run kernels on multiple devices

Additional tricks

Benchmarking an existing system with an InfiniBand network card

NVIDIA Collective Communication Library (NCCL)

Collective communication acceleration using NCCL

Summary

Parallel Programming Patterns in CUDA

Technical requirements

Matrix multiplication optimization

Implementation of the tiling approach

Performance analysis of the tiling approach

Convolution

Convolution operation in CUDA

Optimization strategy

Filtering coefficients optimization using constant memory

Tiling input data using shared memory

Getting more performance

Prefix sum (scan)

Blelloch scan implementation

Building a global size scan

The pursuit of better performance

Other applications for the parallel prefix-sum operation

Compact and split

Implementing compact

Implementing split

N-body

Implementing an N-body simulation on GPU

Overview of an N-body simulation implementation

Histogram calculation

Compile and execution steps

Understanding a parallel histogram

Calculating a histogram with CUDA atomic functions

Quicksort in CUDA using dynamic parallelism

Quicksort and CUDA dynamic parallelism

Quicksort with CUDA

Dynamic parallelism guidelines and constraints

Radix sort

Two approaches

Approach 1 – warp-level primitives

Approach 2 – Thrust-based radix sort

Summary

Programming with Libraries and Other Languages

Linear algebra operation using cuBLAS

cuBLAS SGEMM operation

Multi-GPU operation

Mixed-precision operation using cuBLAS

GEMM with mixed precision

GEMM with TensorCore

cuRAND for parallel random number generation

cuRAND host API

cuRAND device API

cuRAND with mixed precision cuBLAS GEMM

cuFFT for Fast Fourier Transformation in GPU

Basic usage of cuFFT

cuFFT with mixed precision

cuFFT for multi-GPU

NPP for image and signal processing with GPU

Image processing with NPP

Signal processing with NPP

Applications of NPP

Writing GPU accelerated code in OpenCV

CUDA-enabled OpenCV installation

Implementing a CUDA-enabled blur filter

Enabling multi-stream processing

Writing Python code that works with CUDA

Numba – a high-performance Python compiler

Installing Numba

Using Numba with the @vectorize decorator

Using Numba with the @cuda.jit decorator

CuPy – GPU accelerated Python matrix library

Installing CuPy

Basic usage of CuPy

Implementing custom kernel functions

PyCUDA – Pythonic access to CUDA API

Installing PyCUDA

Matrix multiplication using PyCUDA

NVBLAS for zero coding acceleration in Octave and R

Configuration

Accelerating Octave's computation

Accelerating R's compuation

CUDA acceleration in MATLAB

Summary

GPU Programming Using OpenACC

Technical requirements

Image merging on a GPU using OpenACC

OpenACC directives

Parallel and loop directives

Data directive

Applying the parallel, loop, and data directive to merge image code

Asynchronous programming in OpenACC

Structured data directive

Unstructured data directive

Asynchronous programming in OpenACC

Applying the unstructured data and async directives to merge image code

Additional important directives and clauses

Gang/vector/worker

Managed memory

Kernel directive

Collapse clause

Tile clause

CUDA interoperability

DevicePtr clause

Routine directive

Summary

Deep Learning Acceleration with CUDA

Technical requirements

Fully connected layer acceleration with cuBLAS

Neural network operations

Design of a neural network layer

Tensor and parameter containers

Implementing a fully connected layer

Implementing forward propagation

Implementing backward propagation

Layer termination

Activation layer with cuDNN

Layer configuration and initialization

Implementing layer operation

Implementing forward propagation

Implementing backward propagation

Softmax and loss functions in cuDNN/CUDA

Implementing the softmax layer

Implementing forward propagation

Implementing backward propagation

Implementing the loss function

MNIST dataloader

Managing and creating a model

Network training with the MNIST dataset

Convolutional neural networks with cuDNN

The convolution layer

Implementing forward propagation

Implementing backward propagation

Pooling layer with cuDNN

Implementing forward propagation

Implementing backward propagation

Network configuration

Mixed precision operations

Recurrent neural network optimization

Using the CUDNN LSTM operation

Implementing a virtual LSTM operation

Comparing the performance between CUDNN and SGEMM LSTM

Profiling deep learning frameworks

Profiling the PyTorch model

Profiling a TensorFlow model

Summary

Appendix

Useful nvidia-smi commands

Getting the GPU's information

Getting formatted information

Power management mode settings

Setting the GPU's clock speed

GPU device monitoring

Monitoring GPU utilization along with multiple processes

Getting GPU topology information

WDDM/TCC mode in Windows

Setting TCC/WDDM mode

Performance modeling

The Roofline model

Analyzing the Jacobi method

Exploring container-based development

NGC configuration for a host machine

Basic usage of the NGC container

Creating and saving a new container from the NGC container

Setting the default runtime as NVIDIA Docker

Another Book You May Enjoy

Leave a review - let other readers know what you think

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部