Until now, all of my blog posts have been about deep learning methods or their application to NLP. Since the last couple of weeks, however, I have started learning about Automatic Speech Recognition (ASR)1. Therefore, I will also include speech-related articles in this publication now.
The ASR logic is very simple (it’s just Bayes rule, like most other things in machine learning). Essentially, given a speech waveform, the objective is to transcribe it, i.
Sparse vectors have become popular recently for 2 reasons:
Sparse matrices require much less storage since they can be stored using various space-saving methods. Sparse vectors are much more interpretable than dense vectors. For instance, the non-zero non-negative components of a sparse word vector may be taken to denote the weights for certain features. In contrast, there is no interpretation for a value like $-0.1347$. Sparsity is often induced through the use of L1 (or Lasso) regularization.
I just finished reading Sebastian Ruder’s amazing article providing an overview of the most popular algorithms used for optimizing gradient descent. Here I’ll make very short notes on them primarily for purposes of recall.
Momentum The update vector consists of another term which has the previous update vector (weighted by $\gamma$). This helps it to move faster downhill — like a ball.
$$ vt = \gamma v{t-1} + \eta \nabla_{\theta}J(\theta) $$