Abstract: What are the fundamental quantities to understand the learning process of a deep neural network? Why are some datasets easier than other? What does it means for two tasks to have a similar structure? We argue that information theoretic quantities, and in particular the amount of information that SGD stores in the weights, can be used to characterize the training process of a deep network. In fact, we show that the information in the weights bounds the generalization error and the invariance of the learned representation. It also allows us to connect the learning dynamics with the so called "structure function" of the dataset, and to define a notion of distance between tasks, which relates to fine-tuning. The non-trivial dynamics of information during training give rise to phenomena, such as critical periods for learning, that closely mimics those observed in humans and may suggests that forgetting information about the training data is a necessary part of the learning process.