Deep learning 2

It’s a second lecture (https://youtu.be/dqoEU9Ac3ek?si=mxjV2xJUdJLDcGNJ). I learned a lot about Deep Sequence Modeling. There are many variations for different tasks:

one to one - which is useful for binary classification,
many to one - for sentiment analysis,
one to many - for image captioning,
many to many - for machine translation.

We are mostly building Recurrent Neural Networks (RNNs), so they need to meet few design criteria:

handle variable length sequences,
maintain information about order,
track long-term dependencies,
share parameters across the sequence.

To achieve that we need to track the network over some period of sequences or time. To do that we add hidden state, so we get: $y_t = f(x_t, h_t-1)$ where $h_t-1$ is memory from computation.

For teaching such network, we need to somehow encode our input. Mostly it’s done using embeddings. So we take e.g. vocabulary and create embeddings, so we can have assigned index to fixed-sized vector.

The problem is here, that we have too much “noise” in the data and our network is not focusing on important parts. That’s why we need something called attention. Self-attention mechanism focuses only on the important parts of input, so you do not need to feed over and over the past data. You only calculate attention mask to find the best values and identify the most important features in input. This leads us to building block of transformer.