David krueger and aditya kumar praharaj chapter
Deep Learning
Ian Goodfellow
vii |
---|
|
|
||||||||
---|---|---|---|---|---|---|---|---|---|
1 |
|
||||||||
1.1 | Who Should Read This Book? . . . . . . . . . . . . . . . . . . . . | ||||||||
1.2 | Historical Trends in Deep Learning . . . . . . . . . . . . . . . . . | ||||||||
I | |||||||||
2 | |||||||||
2.1 | Scalars, Vectors, Matrices and Tensors . . . . . . . . . . . . . . . | ||||||||
2.2 | |||||||||
2.3 | Identity and Inverse Matrices | . . . . . . . . . . . . . . . . . . . . | |||||||
2.4 |
|
||||||||
2.5 | Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | ||||||||
2.6 | Special Kinds of Matrices and Vectors | . . . . . . . . . . . . . . . | |||||||
2.7 | Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . . . | ||||||||
2.8 |
|
||||||||
2.9 | |||||||||
2.10 | The Trace Operator | ||||||||
2.11 | The Determinant . . . . . . . . . . . . . . . . . . . . . . . . . . . | ||||||||
2.12 | Example: Principal Components Analysis | . . . . . . . . . . . . . | |||||||
3 | |||||||||
3.1 | Why Probability? . . . . . . . . . . . . . . . . . . . . . . . . . . . |
CONTENTS
4 |
|
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|||||||||||
Conditional Probability | ||||||||||||
|
|
|||||||||||
Independence and Conditional Independence . . . . . . . . . . . . | ||||||||||||
Expectation, Variance and Covariance |
|
|
||||||||||
Common Probability Distributions | . . . . . . . . . . . . . . . . . | |||||||||||
3.10 | Useful Properties of Common Functions | . . . . . . . . . . . . . . | ||||||||||
3.11 |
|
|||||||||||
3.12 | Technical Details of Continuous Variables |
|
71 | |||||||||
3.13 | ||||||||||||
3.14 | ||||||||||||
|
|
|||||||||||
|
|
|||||||||||
|
Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . |
|
||||||||||
|
||||||||||||
|
||||||||||||
|
|
|||||||||||
Maximum Likelihood Estimation | ||||||||||||
|
Bayesian Statistics |
|
||||||||||
Unsupervised Learning Algorithms | ||||||||||||
|
|
|||||||||||
5.10 | Building a Machine Learning Algorithm | |||||||||||
5.11 | ||||||||||||
|
|
|||||||||||
ii
CONTENTS
Hidden Units |
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
|||||||||||
Back-Propagation and Other Differentiation Algorithms | |||||||||||
|
|||||||||||
|
228 | ||||||||||
|
Regularization and Under-Constrained Problems |
|
|||||||||
|
Semi-Supervised Learning |
|
|||||||||
|
|
||||||||||
7.10 | |||||||||||
7.11 | |||||||||||
7.12 |
|
||||||||||
7.13 |
|
||||||||||
7.14 | |||||||||||
274 | |||||||||||
|
|||||||||||
|
|||||||||||
|
|
||||||||||
|
|
||||||||||
330 | |||||||||||
The Convolution Operation | |||||||||||
|
|||||||||||
|
|||||||||||
Variants of the Basic Convolution Function |
|
||||||||||
|
|||||||||||
Efficient Convolution Algorithms |
|
||||||||||
|
9.10 |
|
||
---|---|---|---|
9.11 |
|
||
10 Sequence Modeling: Recurrent and Recursive Nets | 373 | ||
10.1 | |||
10.2 | |||
10.3 | |||
10.4 |
|
||
10.5 | Deep Recurrent Networks |
|
|
10.6 | |||
10.7 | |||
10.8 | |||
10.9 |
10.10 The Long Short-Term Memory and Other Gated RNNs . . . . . . 408
10.11 Optimization for Long-Term Dependencies . . . . . . . . . . . . . 413
10.12 Explicit Memory | |||||
---|---|---|---|---|---|
421 | |||||
11.1 | |||||
11.2 | |||||
11.3 |
|
||||
11.4 |
|
||||
11.5 | |||||
11.6 | |||||
443 | |||||
12.1 | |||||
12.2 |
|
||||
12.3 |
|
||||
12.4 | Natural Language Processing | ||||
12.5 | |||||
III | 486 | ||||
489 | |||||
13.1 |
|
||||
13.2 |
|
||||
13.3 | Slow Feature Analysis | ||||
13.4 |
13.5 |
|
||||||||
---|---|---|---|---|---|---|---|---|---|
|
502 | ||||||||
14.1 | Undercomplete Autoencoders | ||||||||
14.2 | |||||||||
14.3 | |||||||||
14.4 | |||||||||
14.5 |
|
||||||||
14.6 |
|
||||||||
14.7 | |||||||||
14.8 | Predictive Sparse Decomposition | ||||||||
14.9 | |||||||||
526 | |||||||||
15.1 | Greedy Layer-Wise Unsupervised Pretraining |
|
|||||||
15.2 |
|
||||||||
15.3 | Semi-Supervised Disentangling of Causal Factors | ||||||||
15.4 | |||||||||
15.5 | |||||||||
15.6 |
|
||||||||
|
558 | ||||||||
16.1 | |||||||||
16.2 | |||||||||
16.3 | |||||||||
16.4 | |||||||||
16.5 |
|
||||||||
16.6 |
|
||||||||
16.7 | The Deep Learning Approach to Structured Probabilistic Models | ||||||||
590 | |||||||||
17.1 | Sampling and Monte Carlo Methods | ||||||||
17.2 | |||||||||
17.3 |
|
||||||||
17.4 |
|
||||||||
17.5 | |||||||||
605 | |||||||||
18.1 | The Log-Likelihood Gradient | ||||||||
18.2 |
v
CONTENTS
18.3 | Pseudolikelihood | |||||
---|---|---|---|---|---|---|
18.4 | ||||||
18.5 | ||||||
18.6 | Noise-Contrastive Estimation |
|
||||
18.7 |
|
|||||
631 | ||||||
19.1 | Inference as Optimization | |||||
19.2 | ||||||
19.3 | MAP Inference and Sparse Coding | |||||
19.4 | Variational Inference and Learning |
|
||||
19.5 |
|
|||||
654 | ||||||
20.1 | ||||||
20.2 | ||||||
20.3 | ||||||
20.4 |
|
|||||
20.5 |
|
|||||
20.6 | ||||||
20.7 | ||||||
20.8 | Other Boltzmann Machines | |||||
20.9 |
20.10 Directed Generative Nets . . . . . . . . . . . . . . . . . . . . . . . 692 20.11 Drawing Samples from Autoencoders . . . . . . . . . . . . . . . . 711 20.12 Generative Stochastic Networks . . . . . . . . . . . . . . . . . . . 714 20.13 Other Generation Schemes . . . . . . . . . . . . . . . . . . . . . . 716 20.14 Evaluating Generative Models . . . . . . . . . . . . . . . . . . . . 717 20.15 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720
|
---|
|
|
---|
This book would not have been possible without the contributions of many people.
We would like to thank those who commented on our proposal for the book and helped plan its contents and organization: Guillaume Alain, Kyunghyun Cho, Çağlar Gülçehre, David Krueger, Hugo Larochelle, Razvan Pascanu and Thomas Rohée.
viii |
---|
CONTENTS
• Chapter 5 Machine Learning Basics : Dzmitry Bahdanau, Justin Domingue, Nikhil Garg, Makoto Otsuka, Bob Pepin, Philip Popien, Emmanuel Rayner, Peter Shepard, Kee-Bong Song, Zheng Sun and Andy Wu.
• Chapter ,6 Deep Feedforward Networks: Uriel Berdugo, Fabrizio Bottarel, Elizabeth Burl, Ishan Durugkar, Jeff Hlywa, Jong Wook Kim, David Krueger and Aditya Kumar Praharaj.
• Chapter |
||
---|---|---|
ix
CONTENTS
• Chapter | |
---|---|
• Chapter | |
Wenming Ma, Fady Medhat, Shakir Mohamed and Grégoire Montavon. |
We would like to thank the Google Brain team for providing an intellectual environment where Ian could devote a tremendous amount of time to writing this book and receive feedback and guidance from colleagues. We would especially like to thank Ian’s former manager, Greg Corrado, and his current manager, Samy Bengio, for their support of this project. Finally, we would like to thank Geoffrey Hinton for encouragement when writing was difficult.
x
|
|
||||
---|---|---|---|---|---|
n | rows and | n | |||
e( )i Standard basis vector [0, . . . , 0, 1,0, . . . ,0] with a 1 at position i
diag( ) A square, diagonal matrix with diagonal entries given by a
Sets and Graphs
|
|||||||
---|---|---|---|---|---|---|---|
0and |
|
||||||
|
a and | ||||||
a but including | |||||||
|
A that are not in |
|
|||||
|
aexcept for element |
|
|||||||
---|---|---|---|---|---|---|---|---|---|
iof matrix |
|
||||||||
|
iof matrix | ||||||||
of a 3-D tensor | |||||||||
|
|||||||||
|
iof the random vector |
A | A and | ||||
---|---|---|---|---|---|
|
|||||
A+ | |||||
A |
|
||||
det( |
CONTENTS
∇2 xf | dy |
|
|||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
y with respect to | |||||||||||||||
dx | |||||||||||||||
∂y | y with respect to |
|
|||||||||||||
∂x | |||||||||||||||
∇xy |
|
ywith respect to | |||||||||||||
∇X y | ywith respect to | ||||||||||||||
∇Xy | |||||||||||||||
|
|||||||||||||||
∂f |
|
×n | |||||||||||||
∂x | |||||||||||||||
|
f |
|
|||||||||||||
a b⊥ |
|
|
|
---|---|---|---|
a b⊥ | | | ||
P( ) | |||
p( ) |
specified
a |
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Ex∼P[ ( )] | |||||||||||
|
|
f x ( ) | under | ||||||||
f x ( ) | and | g x ( ) | under | ||||||||
xiii
CONTENTS
f | : A | f | with domain | A and range | ||||||
---|---|---|---|---|---|---|---|---|---|---|
f | ◦ |
|
|
f | and | |||||
notation)
|
||||
---|---|---|---|---|
|
||||
x, i.e., | ||||
Sometimes we use a function f whose argument is a scalar but apply it to a vector, matrix, or tensor: f(x), f(X), or f(X). This denotes the application of f to the array element-wise. For example, if C = σ(X), then C i,j,k = σ(Xi,j,k) for all
valid values of , i j | and | k |
|
---|
|
|
---|
X | The m | ×n | |
---|---|---|---|
Xi,: |
Introduction
Inventors have long dreamed of creating machines that think. This desire dates back to at least the time of ancient Greece. The mythical figures Pygmalion, Daedalus, and Hephaestus may all be interpreted as legendary inventors, and Galatea, Talos, and Pandora may all be regarded as artificial life ( Ovid and Martin , 2004 Sparkes 1996 Tandy 1997 ).
CHAPTER 1. |
|
---|
concepts are built on top of each other, the graph is deep, with many layers. For this reason, we call this approach to AI deep learning .
The difficulties faced by systems relying on hard-coded knowledge suggest that AI systems need the ability to acquire their own knowledge, by extracting patterns from raw data. This capability is known as machine learning. The
2
CHAPTER 1. |
---|
Many artificial intelligence tasks can be solved by designing the right set of features to extract for that task, then providing these features to a simple machine learning algorithm. For example, a useful feature for speaker identification from sound is an estimate of the size of speaker’s vocal tract. It therefore gives a strong clue as to whether the speaker is a man, woman, or child.
However, for many tasks, it is difficult to know what features should be extracted. For example, suppose that we would like to write a program to detect cars in photographs. We know that cars have wheels, so we might like to use the presence of a wheel as a feature. Unfortunately, it is difficult to describe exactly what a wheel looks like in terms of pixel values. A wheel has a simple geometric shape but its image may be complicated by shadows falling on the wheel, the sun glaring off the metal parts of the wheel, the fender of the car or an object in the foreground obscuring part of the wheel, and so on.
CHAPTER 1. | |||
---|---|---|---|
|
|||
|
| |
---|
4
CHAPTER 1. |
|
---|
disentangle |
---|
ones that we do not care about.
Of course, it can be very difficult to extract such high-level, abstract features from raw data. Many of these factors of variation, such as a speaker’s accent, can be identified only using sophisticated, nearly human-level understanding of the data. When it is nearly as difficult to obtain a representation as to solve the original problem, representation learning does not, at first glance, seem to help us.
cepts. Figure | 1.2 |
---|
The idea of learning the right representation for the data provides one perspec-tive on deep learning. Another perspective on deep learning is that depth allows the computer to learn a multi-step computer program. Each layer of the representation can be thought of as the state of the computer’s memory after executing another set of instructions in parallel. Networks with greater depth can execute more instructions in sequence. Sequential instructions offer great power because later
5