Over 10 Million Study Resources Now at Your Fingertips


Download as :
Rating : ⭐⭐⭐⭐⭐
Price : $10.99
Language:EN

David krueger and aditya kumar praharaj chapter

Deep Learning

Ian Goodfellow

vii

Acknowledgments
Notation

viii

53
54

1

Introduction

1.1 Who Should Read This Book? . . . . . . . . . . . . . . . . . . . .
1.2 Historical Trends in Deep Learning . . . . . . . . . . . . . . . . .
I
2
2.1 Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . .
2.2
2.3 Identity and Inverse Matrices . . . . . . . . . . . . . . . . . . . .
2.4

. . . . . . . . . . . . . . . . . . . .

2.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.6 Special Kinds of Matrices and Vectors . . . . . . . . . . . . . . .
2.7 Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . .
2.8

Singular Value Decomposition . . . . . . . . . . . . . . . . . . . .

2.9
2.10 The Trace Operator
2.11 The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.12 Example: Principal Components Analysis . . . . . . . . . . . . .
3
3.1 Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS

4

Random Variables

. . . . . . . . . . . . . . . . . . . . . . . . . .

3.4

Marginal Probability . . . . . . . . . . . . . . . . . . . . . . . . .

Conditional Probability

3.6

The Chain Rule of Conditional Probabilities . . . . . . . . . . . .

Independence and Conditional Independence . . . . . . . . . . . .
Expectation, Variance and Covariance

. . . . . . . . . . . . . . .

60

Common Probability Distributions . . . . . . . . . . . . . . . . .
3.10 Useful Properties of Common Functions . . . . . . . . . . . . . .
3.11

70

3.12 Technical Details of Continuous Variables

. . . . . . . . . . . . .

71
3.13
3.14

Numerical Computation

80

Poor Conditioning

. . . . . . . . . . . . . . . . . . . . . . . . . .

4.4

Constrained Optimization . . . . . . . . . . . . . . . . . . . . . .

93

98

5.1

5.3

Hyperparameters and Validation Sets . . . . . . . . . . . . . . . . 120

Maximum Likelihood Estimation

5.6

Bayesian Statistics

. . . . . . . . . . . . . . . . . . . . . . . . . . 135

Unsupervised Learning Algorithms

5.9

Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . 151

5.10 Building a Machine Learning Algorithm
5.11

Deep Feedforward Networks

168

ii

CONTENTS

Hidden Units

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

6.4

Back-Propagation and Other Differentiation Algorithms

Historical Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

Regularization for Deep Learning

228

7.3

Regularization and Under-Constrained Problems

. . . . . . . . . 239

7.6

Semi-Supervised Learning

. . . . . . . . . . . . . . . . . . . . . . 243

Early Stopping

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

7.10
7.11
7.12

Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

7.13

Adversarial Training

7.14
274

How Learning Differs from Pure Optimization . . . . . . . . . . . 275

8.2

Parameter Initialization Strategies

. . . . . . . . . . . . . . . . . 301

8.7

Optimization Strategies and Meta-Algorithms . . . . . . . . . . . 317

330
The Convolution Operation

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335

9.3

Variants of the Basic Convolution Function

. . . . . . . . . . . . 347

9.6

Efficient Convolution Algorithms

. . . . . . . . . . . . . . . . . . 362

9.9

9.10

The Neuroscientific Basis for Convolutional Networks . . . . . . . 364

9.11

Convolutional Networks and the History of Deep Learning . . . . 371

10 Sequence Modeling: Recurrent and Recursive Nets 373
10.1
10.2
10.3
10.4

Encoder-Decoder Sequence-to-Sequence Architectures . . . . . . . 396

10.5 Deep Recurrent Networks

. . . . . . . . . . . . . . . . . . . . . . 398

10.6
10.7
10.8
10.9

10.10 The Long Short-Term Memory and Other Gated RNNs . . . . . . 408

10.11 Optimization for Long-Term Dependencies . . . . . . . . . . . . . 413

10.12 Explicit Memory
421
11.1
11.2
11.3

Determining Whether to Gather More Data . . . . . . . . . . . . 426

11.4

Selecting Hyperparameters . . . . . . . . . . . . . . . . . . . . . . 427

11.5
11.6
443
12.1
12.2

Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . 452

12.3

Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 458

12.4 Natural Language Processing
12.5
III
486
489
13.1

Probabilistic PCA and Factor Analysis . . . . . . . . . . . . . . . 490

13.2

Independent Component Analysis (ICA) . . . . . . . . . . . . . . 491

13.3 Slow Feature Analysis
13.4
13.5

Manifold Interpretation of PCA . . . . . . . . . . . . . . . . . . . 499

14 Autoencoders

502
14.1 Undercomplete Autoencoders
14.2
14.3
14.4
14.5

Denoising Autoencoders . . . . . . . . . . . . . . . . . . . . . . . 510

14.6

Learning Manifolds with Autoencoders . . . . . . . . . . . . . . . 515

14.7
14.8 Predictive Sparse Decomposition
14.9
526
15.1 Greedy Layer-Wise Unsupervised Pretraining

. . . . . . . . . . . 528

15.2

Transfer Learning and Domain Adaptation . . . . . . . . . . . . . 536

15.3 Semi-Supervised Disentangling of Causal Factors
15.4
15.5
15.6

Providing Clues to Discover Underlying Causes . . . . . . . . . . 554

16 Structured Probabilistic Models for Deep Learning

558
16.1
16.2
16.3
16.4
16.5

Learning about Dependencies . . . . . . . . . . . . . . . . . . . . 582

16.6

Inference and Approximate Inference . . . . . . . . . . . . . . . . 584

16.7 The Deep Learning Approach to Structured Probabilistic Models
590
17.1 Sampling and Monte Carlo Methods
17.2
17.3

Markov Chain Monte Carlo Methods . . . . . . . . . . . . . . . . 595

17.4

Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599

17.5
605
18.1 The Log-Likelihood Gradient
18.2

v

CONTENTS

18.3 Pseudolikelihood
18.4
18.5
18.6 Noise-Contrastive Estimation

. . . . . . . . . . . . . . . . . . . . 620

18.7

Estimating the Partition Function . . . . . . . . . . . . . . . . . . 623

631
19.1 Inference as Optimization
19.2
19.3 MAP Inference and Sparse Coding
19.4 Variational Inference and Learning

. . . . . . . . . . . . . . . . . 638

19.5

Learned Approximate Inference . . . . . . . . . . . . . . . . . . . 651

654
20.1
20.2
20.3
20.4

Deep Boltzmann Machines . . . . . . . . . . . . . . . . . . . . . . 663

20.5

Boltzmann Machines for Real-Valued Data . . . . . . . . . . . . . 676

20.6
20.7
20.8 Other Boltzmann Machines
20.9

20.10 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . . 692 20.11 Drawing Samples from Autoencoders . . . . . . . . . . . . . . . . 711 20.12 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 714 20.13 Other Generation Schemes . . . . . . . . . . . . . . . . . . . . . . 716 20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . . . . . 717 20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720

Bibliography
Index

This book is accompanied by the above website.

The website provides a

This book would not have been possible without the contributions of many people.

We would like to thank those who commented on our proposal for the book and helped plan its contents and organization: Guillaume Alain, Kyunghyun Cho, Çağlar Gülçehre, David Krueger, Hugo Larochelle, Razvan Pascanu and Thomas Rohée.

viii

CONTENTS

Chapter 5 Machine Learning Basics : Dzmitry Bahdanau, Justin Domingue, Nikhil Garg, Makoto Otsuka, Bob Pepin, Philip Popien, Emmanuel Rayner, Peter Shepard, Kee-Bong Song, Zheng Sun and Andy Wu.

Chapter ,6 Deep Feedforward Networks: Uriel Berdugo, Fabrizio Bottarel, Elizabeth Burl, Ishan Durugkar, Jeff Hlywa, Jong Wook Kim, David Krueger and Aditya Kumar Praharaj.

Chapter Chapter Roscher.

Chapter

ix

CONTENTS

Chapter
Chapter

Chapter 20 Deep Generative Models: Nicolas Chapados, Daniel Galvez,

Wenming Ma, Fady Medhat, Shakir Mohamed and Grégoire Montavon.

We would like to thank the Google Brain team for providing an intellectual environment where Ian could devote a tremendous amount of time to writing this book and receive feedback and guidance from colleagues. We would especially like to thank Ian’s former manager, Greg Corrado, and his current manager, Samy Bengio, for their support of this project. Finally, we would like to thank Geoffrey Hinton for encouragement when writing was difficult.

x

In
I

A scalar (integer or real)
A vector
A matrix
A tensor

n rows and n

e( )i Standard basis vector [0, . . . , 0, 1,0, . . . ,0] with a 1 at position i

diag( ) A square, diagonal matrix with diagonal entries given by a

Sets and Graphs

A
R

0and

n

The real interval including

a and
a but including

ments of

A that are not in

B

All elements of vector

aexcept for element

i

iof matrix

A

Column

iof matrix
of a 3-D tensor

2-D slice of a 3-D tensor

Element

iof the random vector
A
A and

Moore-Penrose pseudoinverse of A

A+
A

B

det(

CONTENTS

2 xf dy

)( )

y with respect to
dx
∂y
y with respect to

x

∂x
xy

Gradient of

ywith respect to
X y
ywith respect to
Xy

X

∂f

Jacobian matrix J Rm

×n
x

The Hessian matrix of

f

at input point x

a b

c

The random variables a and b are independent

a b |
P( )
p( )

specified

a

∼ P

Ex∼P[ ( )]

Var( ( ))
Cov( ( ) f x , g x ( )) H( )
DKL( P Q)
N( ; x µ, Σ)

Variance of

f x ( ) under
f x ( ) and g x ( ) under

xiii

CONTENTS

f : A
f with domain A and range
f

g

Composition of the functions

f and

notation)

log x
σ x ( )

Softplus, log(1 + exp( x ))

Lpnorm of x

x, i.e.,

Sometimes we use a function f whose argument is a scalar but apply it to a vector, matrix, or tensor: f(x), f(X), or f(X). This denotes the application of f to the array element-wise. For example, if C = σ(X), then C i,j,k = σ(Xi,j,k) for all

valid values of , i j and k

.

X
x( )i
y( )ior y( )i

A set of training examples
The -th example (input) from a dataset
The target associated with x( )ifor supervised learn-

X The m ×n
Xi,:

Introduction

Inventors have long dreamed of creating machines that think. This desire dates back to at least the time of ancient Greece. The mythical figures Pygmalion, Daedalus, and Hephaestus may all be interpreted as legendary inventors, and Galatea, Talos, and Pandora may all be regarded as artificial life ( Ovid and Martin , 2004 Sparkes 1996 Tandy 1997 ).

CHAPTER 1.

INTRODUCTION

concepts are built on top of each other, the graph is deep, with many layers. For this reason, we call this approach to AI deep learning .

The difficulties faced by systems relying on hard-coded knowledge suggest that AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data. This capability is known as machine learning. The

2

CHAPTER 1.

Many artificial intelligence tasks can be solved by designing the right set of features to extract for that task, then providing these features to a simple machine learning algorithm. For example, a useful feature for speaker identification from sound is an estimate of the size of speaker’s vocal tract. It therefore gives a strong clue as to whether the speaker is a man, woman, or child.

However, for many tasks, it is difficult to know what features should be extracted. For example, suppose that we would like to write a program to detect cars in photographs. We know that cars have wheels, so we might like to use the presence of a wheel as a feature. Unfortunately, it is difficult to describe exactly what a wheel looks like in terms of pixel values. A wheel has a simple geometric shape but its image may be complicated by shadows falling on the wheel, the sun glaring off the metal parts of the wheel, the fender of the car or an object in the foreground obscuring part of the wheel, and so on.

CHAPTER 1.



4

CHAPTER 1.

INTRODUCTION

disentangle

ones that we do not care about.

Of course, it can be very difficult to extract such high-level, abstract features from raw data. Many of these factors of variation, such as a speaker’s accent, can be identified only using sophisticated, nearly human-level understanding of the data. When it is nearly as difficult to obtain a representation as to solve the original problem, representation learning does not, at first glance, seem to help us.

cepts. Figure 1.2

The idea of learning the right representation for the data provides one perspec-tive on deep learning. Another perspective on deep learning is that depth allows the computer to learn a multi-step computer program. Each layer of the representation can be thought of as the state of the computer’s memory after executing another set of instructions in parallel. Networks with greater depth can execute more instructions in sequence. Sequential instructions offer great power because later

5

How It Works
Login account
Login Your Account
Add to cart
Add to Cart
Payment
Make payment
Document download
Download File
PageId: ELI27C20A4
img
Uploaded by :
bzbwLRy6
Page 1 Preview
david krueger and aditya kumar praharaj chapter
Sell Your Old Documents & Earn Wallet Balance