Introduction
In today’s interconnected world, the ability to process and learn from sequential data is paramount, particularly in fields like natural language processing, time series analysis, and audio signal processing. In this second lecture, we will delve into the intricacies of sequence modeling and explore how to construct neural networks that excel in handling sequential data. Following the foundational content covered in the first lecture, we will build on existing knowledge to enhance our understanding of recurrent neural networks (RNNs) and their role in predictive modeling across various domains.
Understanding Sequential Data
What is Sequential Data?
Sequential data refers to a series of data points indexed in a temporal or ordered sequence. Unlike static data where observations are independent, sequential data is characterized by dependencies over time. For instance, consider the task of predicting where a moving ball will travel next. Without prior knowledge of its trajectory, any prediction would be mere speculation. However, by learning from its previous positions, the model can make informed guesses about future positions.
Applications of Sequential Modeling
- Natural Language Processing (NLP): Processing text data which includes predicting the next word in a sentence or determining sentiment from tweet text.
- Stock Price Predictions: Analyzing historical stock prices to forecast future market movement.
- Medical Signals: Interpreting sequences of data from EKG or other health-monitoring devices.
- Biological Sequences: Understanding patterns in DNA sequences and genetic information.
- Climate Patterns: Modeling sequences to predict weather changes over time.
The Importance of RNNs
As highlighted in our initial steps into neural networks, the traditional feedforward networks are inadequate for processing sequential inputs, as they do not maintain any memory of past information. Therefore, we introduce RNNs as a solution to this challenge.
Basics of Recurrent Neural Networks
RNNs are specifically designed to operate on sequences of data by maintaining a hidden state that carries information about previous inputs. Here’s how RNNs function:
- Recurrence Relation: At every time step, RNNs update their hidden state based not only on the input from the current time step but also on the hidden state from the previous step.
- State Update: The output prediction at any given time step is a function of the input at that time step combined with the hidden state representing the computation history of the RNN.
Training RNNs
The training of RNNs involves backpropagation through time (BPTT), a method that computes gradients for each time step across the entire sequence. This allows the weights in the network to be updated based on the loss computed at each step, but it introduces challenges such as the vanishing and exploding gradient problems.
Addressing the Challenges of RNNs
Vanishing & Exploding Gradients
Two significant issues in training RNNs come from their dependence on prior states:
- Vanishing Gradients: When gradients become small, making it difficult for the network to learn long-term dependencies.
- Exploding Gradients: When gradients become excessively large, leading to unstable training and poor model performance.
Solutions: LSTMs
Long Short-Term Memory units (LSTMs) were created to combat these issues. They incorporate mechanisms called gates to control the flow of information, allowing the network to decide what information to keep or discard over long sequences.
Attention Mechanisms
Introducing Self-Attention
To further enhance sequence modeling capabilities, we explore attention mechanisms, which allow models to focus on different parts of the input sequence when making predictions.
- Self-Attention: This process computes a similarity score between different elements of the input sequence to judge which elements hold more significance. This aligns particularly well with natural language, where the relevance of words may depend on the context established by other words.
- Positional Encoding: Since the architecture does not inherently understand sequence order, embeddings or encodings are necessary to describe the position of each element in the sequence.
Implementing Transformers
Transformers utilize self-attention mechanisms to eliminate the need for recurrence entirely, processing sequences in parallel and capturing long-range dependencies** efficiently with robust performance across various domains including computer vision and linguistic applications such as language translation.
Conclusion
Throughout this lecture, we've outlined the fundamentals of sequence modeling with RNNs, LSTMs, and the breakthrough attention mechanisms that have redefined the landscape of deep learning. The flexibility and power of these models empower machine learning applications ranging from music generation to nuanced sentiment analysis in text. As we conclude, we encourage hands-on experimentation with RNNs in the upcoming lab exercises.
Through practice and exploration, you’ll gain invaluable experience in building neural networks capable of learning from and making predictions based on complex sequential data.
Hello everyone! I hope you enjoyed Alexander's
first lecture. I'm Ava and in this second lecture, Lecture 2, we're going to focus on this
question of sequence modeling -- how we can build neural networks that can
handle and learn from sequential data.
So in Alexander's first lecture he
introduced the essentials of neural networks starting with perceptrons building
up to feed forward models and how you can actually train these models and start
to think about deploying them forward.
Now we're going to turn our attention to
specific types of problems that involve sequential processing of data and we'll
realize why these types of problems require a different way of implementing and building
neural networks from what we've seen so far.
And I think some of the components in
this lecture traditionally can be a bit confusing or daunting at first but what I
really really want to do is to build this understanding up from the foundations walking
through step by step developing intuition
all the way to understanding the math and the
operations behind how these networks operate. Okay so let's let's get started to to
begin I to begin I first want to motivate what exactly we mean when we talk about
sequential data or sequential modeling.
So we're going to begin with a really simple
intuitive example let's say we have this picture of a ball and your task is to predict
where this ball is going to travel to next. Now if you don't have any prior information about
the trajectory of the ball it's motion
it's history any guess or prediction about its next position is going
to be exactly that a random guess. If however in addition to the current
location of the ball I gave you some
information about where it was moving in the
past now the problem becomes much easier and I think hopefully we can all agree that most
likely or most likely next prediction is that this ball is going to move forward
to the right in in the next frame.
So this is a really you know reduced down
bare bones intuitive example but the truth is that beyond this sequential
data is really all around us. As I'm speaking the words coming out of
my mouth form a sequence of sound waves
that define audio which we can split up
to think about in this sequential manner similarly text language can be split up into a
sequence of characters or a sequence of words and there are many many more examples in which
sequential processing sequential data is present
right from medical signals like EKGs to financial
markets and projecting stock prices to biological sequences encoded in DNA to patterns in the
climate to patterns of motion and many more and so already hopefully you're getting
a sense of what these types of questions
and problems may look like and where
they are relevant in the real world when we consider applications of sequential
modeling in the real world we can think about a number of different kind of problem definitions
that we can have in our Arsenal and work with
in the first lecture Alexander introduced the
Notions of classification and the notion of regression where he talked about and we learned
about feed forward models that can operate one to one in this fixed and static setting right
given a single input predict a single output
the binary classification example of
will you succeed or pass this class here there's there's no notion of sequence
there's no notion of time now if we introduce this idea of a sequential component we can
handle inputs that may be defined temporally
and potentially also produce a sequential
or temporal output so for as one example we can consider text language and maybe we want to
generate one prediction given a sequence of text classifying whether a message is a
positive sentiment or a negative sentiment
conversely we could have a single input let's say
an image and our goal may be now to generate text or a sequential description of this image right
given this image of a baseball player throwing a ball can we build a neural network that generates
that as a language caption finally we can also
consider applications and problems where we have
sequence in sequence L for example if we want to translate between two languages and indeed this
type of thinking in this type of Architecture is what powers the task of machine translation
in your phones in Google Translate and and many
other examples so hopefully right this has given
you a picture of what sequential data looks like what these types of problem definitions may look
like and from this we're going to start and build up our understanding of what neural networks we
can build and train for these types of problems
so first we're going to begin with the notion
of recurrence and build up from that to Define recurrent neural networks and in the last
portion of the lecture we'll talk about the underlying mechanisms underlying the Transformer
architectures that are very very very powerful in
terms of handling sequential data but as I said
at the beginning right the theme of this lecture is building up that understanding step by step
starting with the fundamentals and the intuition so to do that we're going to go back revisit
the perceptron and move forward from there
right so as Alexander introduced where we
studied the perception perceptron in lecture one the perceptron is defined by this single
neural operation where we have some set of inputs let's say X1 through XM and each of these
numbers are multiplied by a corresponding weight
pass through a non-linear activation function
that then generates a predicted output y hat here we can have multiple inputs
coming in to generate our output but still these inputs are not thought of as
points in a sequence or time steps in a sequence
even if we scale this perceptron and start
to stack multiple perceptrons together to Define these feed forward neural networks we still
don't have this notion of temporal processing or sequential information even though we are able to
translate and convert multiple inputs apply these
weight operations apply this non-linearity
to then Define multiple predicted outputs so taking a look at this diagram right on the left
in blue you have inputs on the right in purple you have these outputs and the green defines the
neural the single neural network layer that's
transforming these inputs to the outputs Next
Step I'm going to just simplify this diagram I'm going to collapse down those stack perceptrons
together and depict this with this green block still it's the same operation going
on right we have an input Vector being
being transformed to predict this output
vector now what I've introduced here which you may notice is this new variable T right
which I'm using to denote a single time step we are considering an input at a single time
step and using our neural network to generate
a single output corresponding to that how could
we start to extend and build off this to now think about multiple time steps and how we could
potentially process a sequence of information well what if we took this diagram all I've done
is just rotated it 90 degrees where we still have
this input vector and being fed in producing an
output vector and what if we can make a copy of this network right and just do this operation
multiple times to try to handle inputs that are fed in corresponding to different times right
we have an individual time step starting with
t0 and we can do the same thing the same operation
for the next time step again treating that as an isolated instance and keep doing this repeatedly
and what you'll notice hopefully is all these models are simply copies of each other just with
different inputs at each of these different time
steps and we can make this concrete right in terms
of what this functional transformation is doing the predicted output at a particular time step
y hat of T is a function of the input at that time step X of T and that function is what is
learned and defined by our neural network weights
okay so I've told you that our goal here is
Right trying to understand sequential data do sequential modeling but what could
be the issue with what this diagram is showing and what I've shown you
here well yeah go ahead [Music]
exactly that's exactly right so the student's
answer was that X1 or it could be related to X naught and you have this temporal dependence but
these isolated replicas don't capture that at all and that's exactly answers the question perfectly
right here a predicted output at a later time
step could depend precisely on inputs at previous
time steps if this is truly a sequential problem with this temporal dependence so how could we
start to reason about this how could we Define a relation that links the Network's computations
at a particular time step to Prior history and
memory from previous time steps well what if we
did exactly that right what if we simply linked the computation and the information understood
by the network to these other replicas via what we call a recurrence relation what this means is
that something about what the network is Computing
at a particular time is passed on to those
later time steps and we Define that according to this variable H which we call this internal
state or you can think of it as a memory term that's maintained by the neurons and the
network and it's this state that's being
passed time set to time step as we read in
and and process this sequential information what this means is that the Network's output
its predictions its computations is not only a function of the input data X but also we have
this other variable H which captures this notion
of State captions captures this notion of memory
that's being computed by the network and passed on over time specifically right to walk through this
our predicted output y hat of T depends not only on the input at a time but also this past memory
this past state and it is this linkage of temporal
dependence and recurrence that defines this idea
of a recurrent neural unit what I've shown is this this connection that's being unrolled over time
but we can also depict this relationship according to a loop this computation to this internal State
variable h of T is being iteratively updated over
time and that's fed back into the neuron the
neurons computation in this recurrence relation this is how we Define these recurrent cells
that comprise recurrent neural networks or and the key here is that we have this this idea of
this recurrence relation that captures the cyclic
temporal dependency and indeed it's this idea
that is really the intuitive Foundation behind recurrent neural networks or rnns and so let's
continue to build up our understanding from here and move forward into how we can actually Define
the RNN operations mathematically and in code
so all we're going to do is formalize this
relationship a little bit more the key idea here is that the RNN is maintaining the state
and it's updating the state at each of these time steps as the sequence is is processed we
Define this by applying this recurrence relation
and what the recurrence relation captures is how
we're actually updating that internal State h of t specifically that state update is exactly like any
other neural network operator operation that we've introduced so far where again we're learning
a function defined by a set of Weights w we're
using that function to update the cell State h
of t and the additional component the newness here is that that function depends both on the
input and the prior time step h of T minus one and what you'll know is that this function f sub
W is defined by a set of weights and it's the same
set of Weights the same set of parameters
that are used time step to time step as the recurrent neural network processes this temporal
information the sequential data okay so the key idea here hopefully is coming coming through is
that this RNN stay update operation takes this
state and updates it each time a sequence is
processed we can also translate this to how we can think about implementing rnns in Python code
or rather pseudocode hopefully getting a better understanding and intuition behind how these
networks work so what we do is we just start by
defining an RNN for now this is abstracted away
and we start we initialize its hidden State and we have some sentence right let's say this is
our input of Interest where we're interested in predicting maybe the next word that's occurring in
this sentence what we can do is Loop through these
individual words in the sentence that Define our
temporal input and at each step as We're looping through each word in that sentence is fed into
the RNN model along with the previous hidden state and this is what generates a prediction for
the next word and updates the RNN state in turn
finally our prediction for the final word in
the sentence the word that we're missing is simply the rnn's output after all the prior
words have been fed in through the model so this is really breaking down how the RNN Works
how it's processing the sequential information
and what you've noticed is that the
RNN computation includes both this update to the hidden State as well
as generating some predicted output at the end that is our ultimate
goal that we're interested in
and so to walk through this how we're actually
generating the output prediction itself what the RNN computes is given some input vector
it then performs this update to the hidden state and this update to the head and state is just
a standard neural network operation just like
we saw in the first lecture where it consists of
taking a weight Matrix multiplying that by the previous hidden State taking another weight Matrix
multiplying that by the input at a time step and applying a non-linearity and in this case right
because we have these two input streams the input
data X of T and the previous state H we have these
two separate weight matrices that the network is learning over the course of its training that
comes together we apply the non-linearity and then we can generate an output at a given
time step by just modifying the hidden state
using a separate weight Matrix to update this
value and then generate a predicted output and that's what there is to it right
that's how the RNN in its single operation updates both the hidden State
and also generates a predicted output
okay so now this gives you the internal working
of how the RNN computation occurs at a particular time step let's next think about how this looks
like over time and Define the computational graph of the RNN as being unrolled or expanded acrost
across time so so far the dominant way I've been
showing the rnns is according to this loop-like
diagram on the Left Right feeding back in on itself another way we can visualize and think
about rnns is as kind of unrolling this recurrence over time over the individual time steps in our
sequence what this means is that we can take
the network at our first time step and continue
to iteratively unroll it across the time steps going on forward all the way until we process all
the time steps in our input now we can formalize this diagram a little bit more by defining the
weight matrices that connect the inputs to the
hidden State update and the weight matrices that
are used to update the internal State across time and finally the weight matrices that Define
the the update to generate a predicted output now recall that in all these cases right for all
these three weight matrices add all these time
steps we are simply reusing the same weight
matrices right so it's one set of parameters one set of weight matrices that just process this
information sequentially now you may be thinking okay so how do we actually start to be thinking
about how to train the RNN how to define the loss
given that we have this temporal processing in
this temporal dependence well a prediction at an individual time step will simply amount to
a computed loss at that particular time step so now we can compare those predictions time step
by time step to the true label and generate a loss
value for those timestamps and finally we can get
our total loss by taking all these individual loss terms together and summing them defining the
total loss for a particular input to the RNN if we can walk through an example of how we
implement this RNN in tensorflow starting from
scratch the RNN can be defined as a layer
operation and a layer class that Alexander introduced in the first lecture and so we can
Define it according to an initialization of weight matrices initialization of a hidden state which
commonly amounts to initializing these two to zero
next we can Define how we can actually pass
forward through the RNN Network to process a given input X and what you'll notice is in this forward
operation the computations are exactly like we just walked through we first update the hidden
state according to that equation we introduced
earlier and then generate a predicted output that
is a transformed version of that hidden state and finally at each time step we return it
both the output and the updated hidden State as this is what is necessary to be stored
to continue this RNN operation over time
what is very convenient is that although you
could Define your RNN Network and your RNN layer completely from scratch is that tensorflow
abstracts this operation away for you so you can simply Define a simple RNN according to
uh this this call that you're seeing here
um which yeah makes all the the computations
very efficient and and very easy and you'll actually get practice implementing and
working with with rnns in today's software lab okay so that gives us the understanding of
rnns and going back to what I what I described
as kind of the problem setups or the problem
definitions at the beginning of this lecture I just want to remind you of the types of sequence
modeling problems on which we can apply rnns right we can think about taking a sequence of
inputs producing one predicted output
at the end of the sequence we can think
about taking a static single input and trying to generate text according
to according to that single input and finally we can think about taking a sequence
of inputs producing a prediction at every time
step in that sequence and then doing this sequence
to sequence type of prediction and translation okay so yeah so so this will
be the the foundation for um the software lab today which will focus
on this problem of of many to many processing
and many to many sequential modeling
taking a sequence going to a sequence what is common and what is universal across
all these types of problems and tasks that we may want to consider with rnns is what I like
to think about what type of design criteria we
need to build a robust and reliable Network for
processing these sequential modeling problems what I mean by that is what are the characteristics
what are the the design requirements that the RNN needs to fulfill in order to be able
to handle sequential data effectively
the first is that sequences can be of different
lengths right they may be short they may be long we want our RNN model or our neural network
model in general to be able to handle sequences of variable lengths secondly and really importantly
is as we were discussing earlier that the whole
point of thinking about things through the lens of
sequence is to try to track and learn dependencies in the data that are related over time so
our model really needs to be able to handle those different dependencies which may occur at
times that are very very distant from each other
next right sequence is all about order right
there's some notion of how current inputs depend on prior inputs and the specific order of the
observations we see makes a big effect on what prediction we may want to generate at the end
and finally in order to be able to process this
information effectively our Network needs to be
able to do what we call parameter sharing meaning that given one set of Weights that set of weights
should be able to apply to different time steps in the sequence and still result in a meaningful
prediction and so today we're going to focus on
how recurrent neural networks meet these design
criteria and how these design criteria motivate the need for even more powerful architectures
that can outperform rnns in sequence modeling so to understand these criteria very concretely
we're going to consider a sequence modeling
problem where given some series of words our task
is just to predict the next word in that sentence so let's say we have this sentence this morning I
took my cat for a walk and our task is to predict the last word in the sentence given the prior
words this morning I took my cap for a blank
our goal is to take our RNN Define it and put
it to test on this task what is our first step to doing this well the very very first step before
we even think about defining the RNN is how we can actually represent this information to the network
in a way that it can process and understand
if we have a model that is processing this data
processing this text-based data and wanting to generate text as the output our problem can arise
in that the neural network itself is not equipped to handle language explicitly right remember
that neural networks are simply functional
operators they're just mathematical operations
and so we can't expect it right it doesn't have an understanding from the start of what a word is
or what language means which means that we need a way to represent language numerically so that
it can be passed in to the network to process
so what we do is that we need to define
a way to translate this text this this language information into a numerical
encoding a vector an array of numbers that can then be fed in to our neural network and
generating a a vector of numbers as its output
so now right this raises the question
of how do we actually Define this transformation how can we transform
language into this numerical encoding the key solution and the key way that a
lot of these networks work is this notion
and concept of embedding what that means
is it it's some transformation that takes indices or something that can be represented as
an index into a numerical Vector of a given size so if we think about how this idea
of embedding works for language data
let's consider a vocabulary of words that we can
possibly have in our language and our goal is to be able to map these individual words in our
vocabulary to a numerical Vector of fixed size one way we could do this is by defining all the
possible words that could occur in this vocabulary
and then indexing them assigning a index
label to each of these distinct words a corresponds to index one cat responds to index
two so on and so forth and this indexing Maps these individual words to numbers unique indices
what these indices can then Define is what we
call a embedding vector which is a fixed length
encoding where we've simply indicated a one value at the index for that word when we observe
that word and this is called a one-hot embedding where we have this fixed length
Vector of the size of our vocabulary and
each instance of the vocabulary corresponds
to a one-hot one at the corresponding index this is a very sparse way to do
this and it's simply based on purely purely count the count index there's
no notion of semantic information meaning
that's captured in this vector-based encoding
alternatively what is very commonly done is to actually use a neural network to learn in encoding
to learn in embedding and the goal here is that we can learn a neural network that then captures
some inherent meaning or inherent semantics
in our input data and Maps related words or
related inputs closer together in this embedding space meaning that they'll have numerical
vectors that are more similar to each other this concept is really really foundational to
how these sequence modeling networks work and
how neural networks work in general okay so with
that in hand we can go back to our design criteria thinking about the capabilities that we desire
first we need to be able to handle variable length sequences if we again want to predict
the next word in the sequence we can have short
sequences we can have long sequences we can have
even longer sentences and our key task is that we want to be able to track dependencies across
all these different lengths and what we need what we mean by dependencies is that there could
be information very very early on in a sequence
but uh that may not be relevant or come up
late until very much later in the sequence and we need to be able to track these dependencies
and maintain this information in our Network dependencies relate to order and sequences are
defined by their order and we know that same words
in a completely different order have completely
different meanings right so our model needs to be able to handle these differences in order and
the differences in length that could result in different predicted outputs okay so hopefully
that example going through the example in text
motivates how we can think about transforming
input data into a numerical encoding that can be passed into the RNN and also what are
the key criteria that we want to meet in handling these these types of problems so so
far we've painted the picture of rnn's how they
work intuition their mathematical operations and
what are the key criteria that they need to meet the final piece to this is how we actually train
and learn the weights in the RNN and that's done through back propagation algorithm with a bit
of a Twist to just handle sequential information
if we go back and think about how we train feed
forward neural network models the steps break down in thinking through starting with an input where
we first take this input and make a forward pass through the network going from input to Output the
key to back propagation that Alexander introduced
was this idea of taking the prediction and back
propagating gradients back through the network and using this operation to then Define and
update the loss with respect to each of the parameters in the network in order to gradually
adjust the parameters the weights of the network
in order to minimize the overall loss now with
rnns as we walked through earlier we have this temporal unrolling which means that we have these
individual losses across the individual steps in our sequence that sum together to comprise
the overall loss what this means is that when
we do back propagation we have to now instead of
back propagating errors through a single Network back propagate the loss through
each of these individual time steps and after we back propagate loss through each
of the individual time steps we then do that
across all time steps all the way from our current
time time T back to the beginning of the sequence and this is the why this is why this algorithm is
called back propagation Through Time right because as you can see the data and the the predictions
and the resulting errors are fed back in time
all the way from where we are currently to
the very beginning of the input data sequence so the back propagations through time is actually
a very tricky algorithm to implement uh in practice and the reason for this is if we take a
close look looking at how gradients flow across
the RNN what this algorithm involves is that many
many repeated computations and multiplications of these weight matrices repeatedly against each
other in order to compute the gradient with respect to the very first time step we have to
make many of these multiplicative repeats of
the weight Matrix why might this be problematic
well if this weight Matrix W is very very big what this can result in is what they call what
we call the exploding gradient problem where our gradients that we're trying to use to optimize our
Network do exactly that they blow up they explode
and they get really big and makes it infeasible
and not possible to train the network stably what we do to mitigate this is a pretty simple solution
called gradient clipping which effectively scales back these very big gradients to try to
constrain them more in a more restricted way
conversely we can have the instance where the
weight matrices are very very small and if these weight matrices are very very small we end up with
a very very small value at the end as a result of these repeated weight Matrix computations and
these repeated um multiplications and this is
a very real problem in rnns in particular where
we can lead into this funnel called a Vanishing gradient where now your gradient has just dropped
down close to zero and again you can't train the network stably now there are particular tools that
we can use to implement that we can Implement to
try to mitigate the Spanish ingredient problem
and we'll touch on each of these three solutions briefly first being how we can Define the
activation function in our Network and how we can change the network architecture itself to try
to better handle this Vanishing gradient problem
before we do that I want to take just
one step back to give you a little more intuition about why Vanishing gradients can
be a real issue for recurrent neural networks Point I've kept trying to reiterate is this notion
of dependency in the sequential data and what
it means to track those dependencies well if the
dependencies are very constrained in a small space not separated out that much by time this repeated
gradient computation and the repeated weight matrix multiplication is not so much of a problem
if we have a very short sequence where the words
are very closely related to each other and it's
pretty obvious what our next output is going to be the RNN can use the immediately passed
information to make a prediction and so there are not going to be that many uh
that much of a requirement to learn effective
weights if the related information
is close to to each other temporally conversely now if we have a sentence
where we have a more long-term dependency what this means is that we need information from
way further back in the sequence to make our
prediction at the end and that gap between what's
relevant and where we are at currently becomes exceedingly large and therefore the vanishing
gradient problem is increasingly exacerbated meaning that we really need to um the RNN becomes
unable to connect the dots and establish this
long-term dependency all because of this Vanishing
gradient issue so the ways that we can imply the ways and modifications that we can make to our
Network to try to alleviate this problem threefold the first is that we can simply change
the activation functions in each of our
neural network layers to be such that they can
effectively try to mitigate and Safeguard from gradients in instances where from shrinking the
gradients in instances where the data is greater than zero and this is in particular true for the
relu activation function and the reason is that in
all instances where X is greater than zero with
the relu function the derivative is one and so that is not less than one and therefore it helps
in mitigating The Vanishing gradient problem another trick is how we initialize the parameters
in the network itself to prevent them from
shrinking to zero too rapidly and there are there
are mathematical ways that we can do this namely by initializing our weights to Identity matrices
and this effectively helps in practice to prevent the weight updates to shrink too rapidly to zero
however the most robust solution to the vanishing
gradient problem is by introducing a slightly
more complicated uh version of the recurrent neural unit to be able to more effectively track
and handle long-term dependencies in the data and this is this idea of gating and what the
idea is is by controlling selectively the flow
of information into the neural unit to be able to
filter out what's not important while maintaining what is important and the key and the most popular
type of recurrent unit that achieves this gated computation is called the lstm or long short term
memory Network today we're not going to go into
detail on lstn's their mathematical details their
operations and so on but I just want to convey the key idea and intuitive idea about why these lstms
are effective at tracking long-term dependencies the core is that the lstm is able to um control
the flow of information through these gates
to be able to more effectively filter out the
unimportant things and store the important things what you can do is Implement Implement lstms
in tensorflow just as you would in RNN but the core concept that I want you to take away when
thinking about the lstm is this idea of controlled
information flow through Gates very briefly the
way that lstm operates is by maintaining a cell State just like a standard RNN and that cell state
is independent from what is directly outputted the way the cell state is updated is according to
these Gates that control the flow of information
for getting and eliminating what is irrelevant
storing the information that is relevant updating the cell state in turn and then
filtering this this updated cell state to produce the predicted output just like the
standard RNN and again we can train the lstm
using the back propagation Through Time algorithm
but the mathematics of how the lstm is defined allows for a completely uninterrupted flow of the
gradients which completely eliminates the well largely eliminates the The Vanishing
gradient problem that I introduced earlier
again we're not if you're if you're interested
in learning more about the mathematics and the details of lstms please come and discuss
with us after the lectures but again just emphasizing the core concept and the
intuition behind how the lstm operates
okay so so far where we've out where we've been at
we've covered a lot of ground we've gone through the fundamental workings of rnns the architecture
the training the type of problems that they've been applied to and I'd like to close this
part by considering some concrete examples
of how you're going to use rnns in your software
lab and that is going to be in the task of Music generation where you're going to work to build an
RNN that can predict the next musical note in a sequence and use it to generate brand new musical
sequences that have never been realized before
so to give you an example of just the quality
and and type of output that you can try to aim towards a few years ago there was a work that
trained in RNN on a corpus of classical music data and famously there's this composer
Schubert who uh wrote a famous unfinished
Symphony that consisted of two movements but
he was unable to finish his uh his Symphony before he died so he died and then he left the
third movement unfinished so a few years ago a group trained a RNN based model to actually try to
generate the third movement to Schubert's famous
unfinished Symphony given the prior to movements
so I'm going to play the result quite right now [Music] okay I I paused it I interrupted it quite
abruptly there but if there are any classical
music aficionados out there hopefully you get
a appreciation for kind of the quality that was generated uh in in terms of the music quality and
this was already from a few years ago and as we'll see in the next lectures the and continuing with
this theme of generative AI the power of these
algorithms has advanced tremendously since
we first played this example um particularly in you know a whole range of domains which I'm
excited to talk about but not for now okay so you'll tackle this problem head on in today's
lab RNN music generation foreign we can think
about the the simple example of input sequence
to a single output with sentiment classification where we can think about for example text like
tweets and assigning positive or negative labels to these these text examples based on the
content that that is learned by the network
okay so this kind of concludes the portion on rnns
and I think it's quite remarkable that using all the foundational Concepts and operations
that we've talked about so far we've been able to try to build up networks that handle
this complex problem of sequential modeling
but like any technology right and RNN is not
without limitations so what are some of those limitations and what are some potential issues
that can arise with using rnns or even lstms the first is this idea of encoding and and
dependency in terms of the the temporal separation
of data that we're trying to process while rnns
require is that the sequential information is fed in and processed time step by time step what that
imposes is what we call an encoding bottleneck right where we have we're trying to encode a lot
of content for example a very large body of text
many different words into a single output that
may be just at the very last time step how do we ensure that all that information leading up to
that time step was properly maintained and encoded and learned by the network in practice this is
very very challenging and a lot of information
can be lost another limitation is that by
doing this time step by time step processing rnns can be quite slow there is not really
an easy way to parallelize that computation and finally together these components of
the encoding bottleneck the requirement to
process this data step by step imposes the biggest
problem which is when we talk about long memory the capacity of the RNN and the lstm is really not
that long we can't really handle data of tens of thousands or hundreds of thousands or even Beyond
sequential information that effectively to learn
the complete amount of information and patterns
that are present within such a rich data source and so because of this very recently there's been
a lot of attention in how we can move Beyond this notion of step-by-step recurrent processing
to build even more powerful architectures for
processing sequential data to understand how we
do how we can start to do this let's take a big step back right think about the high level goal
of sequence modeling that I introduced at the very beginning given some input a sequence of data
we want to build a feature encoding and use our
neural network to learn that and then transform
that feature encoding into a predicted output what we saw is that rnns use this notion
of recurrence to maintain order information processing information time step by time step
but as I just mentioned we had these key three
bottlenecks to rnns what we really want to achieve
is to go beyond these bottlenecks and Achieve even higher capabilities in terms of the power of
these models rather than having an encoding bottleneck ideally we want to process information
continuously as a continuous stream of information
rather than being slow we want to be able to
parallelize computations to speed up processing and finally of course our main goal is
to really try to establish long memory that can build nuanced and Rich
understanding of sequential data
the limitation of rnns that's linked to all
these problems and issues in our inability to achieve these capabilities is that they
require this time step by time step processing so what if we could move beyond that what if we
could eliminate this need for recurrence entirely
and not have to process the data time set by time
step well a first and naive approach would be to just squash all the data all the time steps
together to create a vector that's effectively concatenated right the time steps are eliminated
there's just one one stream where we have now one
vector input with the data from all time points
that's then fed into the model it calculates some feature vector and then generates some output
which hopefully makes sense and because we've squashed all these time steps together we
could simply think about maybe building a
feed forward Network that could that could do this
computation well with that we'd eliminate the need for recurrence but we still have the issues that
it's not scalable because the dense feed forward Network would have to be immensely large defined
by many many different connections and critically
we've completely lost our in order information by
just squashing everything together blindly there's no temporal dependence and we're then stuck in
our ability to try to establish long-term memory so what if instead we could still think
about bringing these time steps together
but be a bit more clever about how we try
to extract information from this input data the key idea is this idea of being able to
identify and attend to what is important in a potentially sequential stream of information and
this is the notion of attention or self-attention
which is an extremely extremely powerful Concept
in modern deep learning and AI I cannot understate or I don't know understand overstate I I cannot
emphasize enough how powerful this concept is attention is the foundational mechanism of the
Transformer architecture which many of you may
have heard about and it's the the the notion
of a transformer can often be very daunting because sometimes they're presented with these
really complex diagrams or deployed in complex applications and you may think okay how
do I even start to make sense of this
at its core though attention the key operation
is a very intuitive idea and we're going to in the last portion of this lecture break
it down step by step to see why it's so powerful and how we can use that as part of
a larger neural network like a Transformer
specifically we're going to be talking and
focusing on this idea of self-attention attending to the most important parts of an
input example so let's consider an image I think it's most intuitive to consider an image
first this is a picture of Iron Man and if our
goal is to try to extract information from this
image of what's important what we could do maybe is using our eyes naively scan over this image
pixel by pixel right just going across the image however our brains maybe maybe internally they're
doing some type of computation like this but you
and I we can simply look at this image and
be able to attend to the important parts we can see that it's Iron Man coming at you
right in the image and then we can focus in a little further and say okay what are the
details about Iron Man that may be important
what is key what you're doing is your brain
is identifying which parts are attending to to attend to and then extracting those
features that deserve the highest attention the first part of this problem is really
the most interesting and challenging one
and it's very similar to the concept of search
effectively that's what search is doing taking some larger body of information and trying
to extract and identify the important parts so let's go there next how does search work you're
thinking you're in this class how can I learn more
about neural networks well in this day and age one
thing you may do besides coming here and joining us is going to the internet having all the videos
out there trying to find something that matches doing a search operation so you have a giant
database like YouTube you want to find a video
you enter in your query deep learning and
what comes out are some possible outputs right for every video in the database there is
going to be some key information related to the interview to that to that video let's say the
title now to do the search what the task is to
find the overlaps between your query and each
of these titles right the keys in the database what we want to compute is a metric of similarity
and relevance between the query and these keys how similar are they to our desired query and we can
do this step by step let's say this first option
of a video about the elegant giant sea turtles
not that similar to our query about deep learning our second option introduction to deep learning
the first introductory lecture on this class yes highly relevant the third option a video about
the late and great Kobe Bryant not that relevant
the key operation here is that there is this
similarity computation bringing the query and the key together the final step is now that
we've identified what key is relevant extracting the relevant information what we want to pay
attention to and that's the video itself we
call this the value and because the searches
is implemented well right we've successfully identified the relevant video on deep learning
that you are going to want to pay attention to and it's this this idea this intuition of
giving a query trying to find similarity
trying to extract the related values
that form the basis of self-attention and how it works in neural networks like
Transformers so to go concretely into this right let's go back now to our text our language
example with the sentence our goal is to identify
and attend to features in this input that are
relevant to the semantic meaning of the sentence now first step we have sequence we have order
we've eliminated recurrence right we're feeding in all the time steps all at once we still need a way
to encode and capture this information about order
and this positional dependence how this is done is
this idea of possession positional encoding which captures some inherent order information present
in the sequence I'm just going to touch on this very briefly but the idea is related to this
idea of embeddings which I introduced earlier
what is done is a neural network layer is
used to encode positional information that captures the relative relationships in
terms of order within within this text that's the high level concept right we're still
being able to process these time steps all at once
there is no notion of time step rather the data
is singular but still we learned this encoding that captures the positional order information now
our next step is to take this encoding and figure out what to attend to exactly like that search
operation that I introduced with the YouTube
example extracting a query extracting a key
extracting a value and relating them to each other so we use neural network layers to do exactly
this given this positional encoding what attention does is applies a neural network layer
transforming that first generating the query
we do this again using a separate neural network
layer and this is a different set of Weights a different set of parameters that then transform
that positional embedding in a different way generating a second output the key and finally
this repeat this operation is repeated with a
third layer a third set of Weights generating the
value now with these three in hand the key the the query the key and the value we can compare
them to each other to try to figure out where in that self-input the network should attend
to what is important and that's the key idea
behind this similarity metric or what you can
think of as an attention score what we're doing is we're Computing a similarity score between a
query and the key and remember that these query and Qui key values are just arrays of numbers
we can Define them as arrays of numbers which
you can think of as vectors in space the query
Vector the query values are some Vector the key the key values are some other vector and
mathematically the way that we can compare these two vectors to understand how similar they are is
by taking the dot product and scaling it captures
how similar these vectors are how whether or not
they're pointing in the same direction right this is the similarity metric and if you are familiar
with a little bit of linear algebra this is also known as the cosine similarity operation functions
exactly the same way for matrices if we apply this
dot product operation to our query in key matrices
key matrices we get this similarity metric out now this is very very key in defining our next
step Computing the attention waiting in terms of what the network should actually attend to within
this input this operation gives us a score which
defines how how the components of the input data
are related to each other so given a sentence right when we compute this similarity score metric
we can then begin to think of Weights that Define the relationship between the sequential the
components of the sequential data to each other
so for example in the this example with a text
sentence he tossed the tennis ball to serve the goal with the score is that words in the
sequence that are related to each other should have high attention weights ball related to
toss related to tennis and this metric itself
is our attention waiting what we have done is
passed that similarity score through a soft Max function which all it does is it constrains
those values to be between 0 and 1. and so you can think of these as relative scores of
relative attention weights finally now that we
have this metric that can captures this notion of
similarity and these internal self-relationships we can finally use this metric to extract
features that are deserving of high attention and that's the exact final step in this
self-attention mechanism in that we take that
attention waiting Matrix multiply it by the value
and get a transformed transformation of of the initial data as our output which in turn reflects
the features that correspond to high attention all right let's take a breath let's
recap what we have just covered so far
the goal with this idea of self-attention
the backbone of Transformers is to eliminate recurrence attend to the most important features
in in the input data in an architecture how this is actually deployed is first we take our
input data we compute these positional encodings
the neural network layers are applied three-fold
to transform the positional encoding into each of the key query and value matrices we can
then compute the self-attention weight score according to the up the dot product operation
that we went through prior and then self-attend
to these features to these uh information to
extract features that deserve High attention what is so powerful about this approach in taking
this attention wait putting it together with the value to extract High attention features is that
this operation the scheme that I'm showing on
the right defines a single self-attention head
and multiple of these self-attention heads can be linked together to form larger Network
architectures where you can think about these different heads trying to extract different
information different relevant parts of the input
to now put together a very very rich encoding and
representation of the data that we're working with intuitively back to our Ironman example what
this idea of multiple self-attention heads can amount to is that different Salient features
and Salient information in the data is extracted
first maybe you consider Iron Man attention had
one and you may have additional attention heads that are picking out other relevant parts of
the data which maybe we did not realize before for example the building or the spaceship
in the background that's chasing iron
man and so this is a key building block of many
many many many powerful architectures that are out there today today I again cannot emphasize
how enough how powerful this mechanism is and indeed this this backbone idea of
self-attention that you just built up
understanding of is the key operation of some
of the most powerful neural networks and deep learning models out there today ranging from the
very powerful language models like gpt3 which are capable of synthesizing natural language in a
very human-like fashion digesting large bodies
of text information to understand relationships
in text to models that are being deployed for extremely impactful applications in biology
and Medicine such as Alpha full 2 which uses this notion of self-attention to look at data
of protein sequences and be able to predict the
three-dimensional structure of a protein just
given sequence information alone and all the way even now to computer vision which will be the
topic of our next lecture tomorrow where the same idea of attention that was initially developed in
sequential data applications has now transformed
the field of computer vision and again using
this key concept of attending to the important features in an input to build these very rich
representations of complex High dimensional data okay so that concludes lectures for today I
know we have covered a lot of territory in a
pretty short amount of time but that is what this
boot camp program is all about so hopefully today you've gotten a sense of the foundations of neural
networks in the lecture with Alexander we talked about rnns how they're well suited for sequential
data how we can train them using back propagation
how we can deploy them for different applications
and finally how we can move Beyond recurrence to build this idea of self-attention for
building increasingly powerful models for deep learning in sequence modeling
all right hopefully you enjoyed we have
um about 45 minutes left for the for the
lab portion and open Office hours in which we welcome you to ask us questions uh of us
and the Tas and to start work on the labs the information for the labs is is up there
thank you so much for your attention foreign
Heads up!
This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.
Generate a summary for freeRelated Summaries
Understanding Introduction to Deep Learning: Foundations, Techniques, and Applications
Explore the exciting world of deep learning, its techniques, applications, and foundations covered in MIT's course.
Comprehensive Introduction to AI: History, Models, and Optimization Techniques
This lecture provides a detailed overview of Artificial Intelligence, covering its historical evolution, core paradigms like modeling, inference, and learning, and foundational optimization methods such as dynamic programming and gradient descent. It also discusses AI's societal impacts, challenges, and course logistics for Stanford's CS221.
Comprehensive Artificial Intelligence Course: AI, ML, Deep Learning & NLP
Explore a full Artificial Intelligence course covering AI history, machine learning types and algorithms, deep learning concepts, and natural language processing with practical Python demos. Learn key AI applications, programming languages, and advanced techniques like reinforcement learning and convolutional neural networks. Perfect for beginners and aspiring machine learning engineers.
Understanding Generative AI: Concepts, Models, and Applications
Explore the fundamentals of generative AI, its models, and real-world applications in this comprehensive guide.
Complete n8n Masterclass: From Beginner to AI Agent Builder
This comprehensive n8n masterclass guides you from understanding low-code automation basics to building advanced AI-powered agents using RAG and vector databases. Learn workflow setup, API integrations, error handling, and best practices to automate tasks and boost productivity.
Most Viewed Summaries
Kolonyalismo at Imperyalismo: Ang Kasaysayan ng Pagsakop sa Pilipinas
Tuklasin ang kasaysayan ng kolonyalismo at imperyalismo sa Pilipinas sa pamamagitan ni Ferdinand Magellan.
A Comprehensive Guide to Using Stable Diffusion Forge UI
Explore the Stable Diffusion Forge UI, customizable settings, models, and more to enhance your image generation experience.
Pamamaraan at Patakarang Kolonyal ng mga Espanyol sa Pilipinas
Tuklasin ang mga pamamaraan at patakaran ng mga Espanyol sa Pilipinas, at ang epekto nito sa mga Pilipino.
Mastering Inpainting with Stable Diffusion: Fix Mistakes and Enhance Your Images
Learn to fix mistakes and enhance images with Stable Diffusion's inpainting features effectively.
Pamaraan at Patakarang Kolonyal ng mga Espanyol sa Pilipinas
Tuklasin ang mga pamamaraan at patakarang kolonyal ng mga Espanyol sa Pilipinas at ang mga epekto nito sa mga Pilipino.

