Mastering Sequence Modeling with Recurrent Neural Networks

Introduction

In today’s interconnected world, the ability to process and learn from sequential data is paramount, particularly in fields like natural language processing, time series analysis, and audio signal processing. In this second lecture, we will delve into the intricacies of sequence modeling and explore how to construct neural networks that excel in handling sequential data. Following the foundational content covered in the first lecture, we will build on existing knowledge to enhance our understanding of recurrent neural networks (RNNs) and their role in predictive modeling across various domains.

Understanding Sequential Data

What is Sequential Data?

Sequential data refers to a series of data points indexed in a temporal or ordered sequence. Unlike static data where observations are independent, sequential data is characterized by dependencies over time. For instance, consider the task of predicting where a moving ball will travel next. Without prior knowledge of its trajectory, any prediction would be mere speculation. However, by learning from its previous positions, the model can make informed guesses about future positions.

Applications of Sequential Modeling

Natural Language Processing (NLP): Processing text data which includes predicting the next word in a sentence or determining sentiment from tweet text.
Stock Price Predictions: Analyzing historical stock prices to forecast future market movement.
Medical Signals: Interpreting sequences of data from EKG or other health-monitoring devices.
Biological Sequences: Understanding patterns in DNA sequences and genetic information.
Climate Patterns: Modeling sequences to predict weather changes over time.

The Importance of RNNs

As highlighted in our initial steps into neural networks, the traditional feedforward networks are inadequate for processing sequential inputs, as they do not maintain any memory of past information. Therefore, we introduce RNNs as a solution to this challenge.

Basics of Recurrent Neural Networks

RNNs are specifically designed to operate on sequences of data by maintaining a hidden state that carries information about previous inputs. Here’s how RNNs function:

Recurrence Relation: At every time step, RNNs update their hidden state based not only on the input from the current time step but also on the hidden state from the previous step.
State Update: The output prediction at any given time step is a function of the input at that time step combined with the hidden state representing the computation history of the RNN.

Training RNNs

The training of RNNs involves backpropagation through time (BPTT), a method that computes gradients for each time step across the entire sequence. This allows the weights in the network to be updated based on the loss computed at each step, but it introduces challenges such as the vanishing and exploding gradient problems.

Addressing the Challenges of RNNs

Vanishing & Exploding Gradients

Two significant issues in training RNNs come from their dependence on prior states:

Vanishing Gradients: When gradients become small, making it difficult for the network to learn long-term dependencies.
Exploding Gradients: When gradients become excessively large, leading to unstable training and poor model performance.

Solutions: LSTMs

Long Short-Term Memory units (LSTMs) were created to combat these issues. They incorporate mechanisms called gates to control the flow of information, allowing the network to decide what information to keep or discard over long sequences.

Attention Mechanisms

Introducing Self-Attention

To further enhance sequence modeling capabilities, we explore attention mechanisms, which allow models to focus on different parts of the input sequence when making predictions.

Self-Attention: This process computes a similarity score between different elements of the input sequence to judge which elements hold more significance. This aligns particularly well with natural language, where the relevance of words may depend on the context established by other words.
Positional Encoding: Since the architecture does not inherently understand sequence order, embeddings or encodings are necessary to describe the position of each element in the sequence.

Implementing Transformers

Transformers utilize self-attention mechanisms to eliminate the need for recurrence entirely, processing sequences in parallel and capturing long-range dependencies** efficiently with robust performance across various domains including computer vision and linguistic applications such as language translation.

Conclusion

Throughout this lecture, we've outlined the fundamentals of sequence modeling with RNNs, LSTMs, and the breakthrough attention mechanisms that have redefined the landscape of deep learning. The flexibility and power of these models empower machine learning applications ranging from music generation to nuanced sentiment analysis in text. As we conclude, we encourage hands-on experimentation with RNNs in the upcoming lab exercises.
Through practice and exploration, you’ll gain invaluable experience in building neural networks capable of learning from and making predictions based on complex sequential data.

Hello everyone! I hope you enjoyed Alexander's first lecture. I'm Ava and in this second lecture, Lecture 2, we're going to focus on this question of sequence modeling -- how we can build neural networks that can handle and learn from sequential data.

So in Alexander's first lecture he introduced the essentials of neural networks starting with perceptrons building up to feed forward models and how you can actually train these models and start to think about deploying them forward.

Now we're going to turn our attention to specific types of problems that involve sequential processing of data and we'll realize why these types of problems require a different way of implementing and building neural networks from what we've seen so far.

And I think some of the components in this lecture traditionally can be a bit confusing or daunting at first but what I really really want to do is to build this understanding up from the foundations walking through step by step developing intuition

all the way to understanding the math and the operations behind how these networks operate. Okay so let's let's get started to to begin I to begin I first want to motivate what exactly we mean when we talk about sequential data or sequential modeling.

So we're going to begin with a really simple intuitive example let's say we have this picture of a ball and your task is to predict where this ball is going to travel to next. Now if you don't have any prior information about

the trajectory of the ball it's motion it's history any guess or prediction about its next position is going to be exactly that a random guess. If however in addition to the current location of the ball I gave you some

information about where it was moving in the past now the problem becomes much easier and I think hopefully we can all agree that most likely or most likely next prediction is that this ball is going to move forward to the right in in the next frame.

So this is a really you know reduced down bare bones intuitive example but the truth is that beyond this sequential data is really all around us. As I'm speaking the words coming out of my mouth form a sequence of sound waves

that define audio which we can split up to think about in this sequential manner similarly text language can be split up into a sequence of characters or a sequence of words and there are many many more examples in which sequential processing sequential data is present

right from medical signals like EKGs to financial markets and projecting stock prices to biological sequences encoded in DNA to patterns in the climate to patterns of motion and many more and so already hopefully you're getting a sense of what these types of questions

and problems may look like and where they are relevant in the real world when we consider applications of sequential modeling in the real world we can think about a number of different kind of problem definitions that we can have in our Arsenal and work with

in the first lecture Alexander introduced the Notions of classification and the notion of regression where he talked about and we learned about feed forward models that can operate one to one in this fixed and static setting right given a single input predict a single output

the binary classification example of will you succeed or pass this class here there's there's no notion of sequence there's no notion of time now if we introduce this idea of a sequential component we can handle inputs that may be defined temporally

and potentially also produce a sequential or temporal output so for as one example we can consider text language and maybe we want to generate one prediction given a sequence of text classifying whether a message is a positive sentiment or a negative sentiment

conversely we could have a single input let's say an image and our goal may be now to generate text or a sequential description of this image right given this image of a baseball player throwing a ball can we build a neural network that generates that as a language caption finally we can also

consider applications and problems where we have sequence in sequence L for example if we want to translate between two languages and indeed this type of thinking in this type of Architecture is what powers the task of machine translation in your phones in Google Translate and and many

other examples so hopefully right this has given you a picture of what sequential data looks like what these types of problem definitions may look like and from this we're going to start and build up our understanding of what neural networks we can build and train for these types of problems

so first we're going to begin with the notion of recurrence and build up from that to Define recurrent neural networks and in the last portion of the lecture we'll talk about the underlying mechanisms underlying the Transformer architectures that are very very very powerful in

terms of handling sequential data but as I said at the beginning right the theme of this lecture is building up that understanding step by step starting with the fundamentals and the intuition so to do that we're going to go back revisit the perceptron and move forward from there

right so as Alexander introduced where we studied the perception perceptron in lecture one the perceptron is defined by this single neural operation where we have some set of inputs let's say X1 through XM and each of these numbers are multiplied by a corresponding weight

pass through a non-linear activation function that then generates a predicted output y hat here we can have multiple inputs coming in to generate our output but still these inputs are not thought of as points in a sequence or time steps in a sequence

even if we scale this perceptron and start to stack multiple perceptrons together to Define these feed forward neural networks we still don't have this notion of temporal processing or sequential information even though we are able to translate and convert multiple inputs apply these

weight operations apply this non-linearity to then Define multiple predicted outputs so taking a look at this diagram right on the left in blue you have inputs on the right in purple you have these outputs and the green defines the neural the single neural network layer that's

transforming these inputs to the outputs Next Step I'm going to just simplify this diagram I'm going to collapse down those stack perceptrons together and depict this with this green block still it's the same operation going on right we have an input Vector being

being transformed to predict this output vector now what I've introduced here which you may notice is this new variable T right which I'm using to denote a single time step we are considering an input at a single time step and using our neural network to generate

a single output corresponding to that how could we start to extend and build off this to now think about multiple time steps and how we could potentially process a sequence of information well what if we took this diagram all I've done is just rotated it 90 degrees where we still have

this input vector and being fed in producing an output vector and what if we can make a copy of this network right and just do this operation multiple times to try to handle inputs that are fed in corresponding to different times right we have an individual time step starting with

t0 and we can do the same thing the same operation for the next time step again treating that as an isolated instance and keep doing this repeatedly and what you'll notice hopefully is all these models are simply copies of each other just with different inputs at each of these different time

steps and we can make this concrete right in terms of what this functional transformation is doing the predicted output at a particular time step y hat of T is a function of the input at that time step X of T and that function is what is learned and defined by our neural network weights

okay so I've told you that our goal here is Right trying to understand sequential data do sequential modeling but what could be the issue with what this diagram is showing and what I've shown you here well yeah go ahead [Music]

exactly that's exactly right so the student's answer was that X1 or it could be related to X naught and you have this temporal dependence but these isolated replicas don't capture that at all and that's exactly answers the question perfectly right here a predicted output at a later time

step could depend precisely on inputs at previous time steps if this is truly a sequential problem with this temporal dependence so how could we start to reason about this how could we Define a relation that links the Network's computations at a particular time step to Prior history and

memory from previous time steps well what if we did exactly that right what if we simply linked the computation and the information understood by the network to these other replicas via what we call a recurrence relation what this means is that something about what the network is Computing

at a particular time is passed on to those later time steps and we Define that according to this variable H which we call this internal state or you can think of it as a memory term that's maintained by the neurons and the network and it's this state that's being

passed time set to time step as we read in and and process this sequential information what this means is that the Network's output its predictions its computations is not only a function of the input data X but also we have this other variable H which captures this notion

of State captions captures this notion of memory that's being computed by the network and passed on over time specifically right to walk through this our predicted output y hat of T depends not only on the input at a time but also this past memory this past state and it is this linkage of temporal

dependence and recurrence that defines this idea of a recurrent neural unit what I've shown is this this connection that's being unrolled over time but we can also depict this relationship according to a loop this computation to this internal State variable h of T is being iteratively updated over

time and that's fed back into the neuron the neurons computation in this recurrence relation this is how we Define these recurrent cells that comprise recurrent neural networks or and the key here is that we have this this idea of this recurrence relation that captures the cyclic

temporal dependency and indeed it's this idea that is really the intuitive Foundation behind recurrent neural networks or rnns and so let's continue to build up our understanding from here and move forward into how we can actually Define the RNN operations mathematically and in code

so all we're going to do is formalize this relationship a little bit more the key idea here is that the RNN is maintaining the state and it's updating the state at each of these time steps as the sequence is is processed we Define this by applying this recurrence relation

and what the recurrence relation captures is how we're actually updating that internal State h of t specifically that state update is exactly like any other neural network operator operation that we've introduced so far where again we're learning a function defined by a set of Weights w we're

using that function to update the cell State h of t and the additional component the newness here is that that function depends both on the input and the prior time step h of T minus one and what you'll know is that this function f sub W is defined by a set of weights and it's the same

set of Weights the same set of parameters that are used time step to time step as the recurrent neural network processes this temporal information the sequential data okay so the key idea here hopefully is coming coming through is that this RNN stay update operation takes this

state and updates it each time a sequence is processed we can also translate this to how we can think about implementing rnns in Python code or rather pseudocode hopefully getting a better understanding and intuition behind how these networks work so what we do is we just start by

defining an RNN for now this is abstracted away and we start we initialize its hidden State and we have some sentence right let's say this is our input of Interest where we're interested in predicting maybe the next word that's occurring in this sentence what we can do is Loop through these

individual words in the sentence that Define our temporal input and at each step as We're looping through each word in that sentence is fed into the RNN model along with the previous hidden state and this is what generates a prediction for the next word and updates the RNN state in turn

finally our prediction for the final word in the sentence the word that we're missing is simply the rnn's output after all the prior words have been fed in through the model so this is really breaking down how the RNN Works how it's processing the sequential information

and what you've noticed is that the RNN computation includes both this update to the hidden State as well as generating some predicted output at the end that is our ultimate goal that we're interested in

and so to walk through this how we're actually generating the output prediction itself what the RNN computes is given some input vector it then performs this update to the hidden state and this update to the head and state is just a standard neural network operation just like

we saw in the first lecture where it consists of taking a weight Matrix multiplying that by the previous hidden State taking another weight Matrix multiplying that by the input at a time step and applying a non-linearity and in this case right because we have these two input streams the input

data X of T and the previous state H we have these two separate weight matrices that the network is learning over the course of its training that comes together we apply the non-linearity and then we can generate an output at a given time step by just modifying the hidden state

using a separate weight Matrix to update this value and then generate a predicted output and that's what there is to it right that's how the RNN in its single operation updates both the hidden State and also generates a predicted output

okay so now this gives you the internal working of how the RNN computation occurs at a particular time step let's next think about how this looks like over time and Define the computational graph of the RNN as being unrolled or expanded acrost across time so so far the dominant way I've been

showing the rnns is according to this loop-like diagram on the Left Right feeding back in on itself another way we can visualize and think about rnns is as kind of unrolling this recurrence over time over the individual time steps in our sequence what this means is that we can take

the network at our first time step and continue to iteratively unroll it across the time steps going on forward all the way until we process all the time steps in our input now we can formalize this diagram a little bit more by defining the weight matrices that connect the inputs to the

hidden State update and the weight matrices that are used to update the internal State across time and finally the weight matrices that Define the the update to generate a predicted output now recall that in all these cases right for all these three weight matrices add all these time

steps we are simply reusing the same weight matrices right so it's one set of parameters one set of weight matrices that just process this information sequentially now you may be thinking okay so how do we actually start to be thinking about how to train the RNN how to define the loss

given that we have this temporal processing in this temporal dependence well a prediction at an individual time step will simply amount to a computed loss at that particular time step so now we can compare those predictions time step by time step to the true label and generate a loss

value for those timestamps and finally we can get our total loss by taking all these individual loss terms together and summing them defining the total loss for a particular input to the RNN if we can walk through an example of how we implement this RNN in tensorflow starting from

scratch the RNN can be defined as a layer operation and a layer class that Alexander introduced in the first lecture and so we can Define it according to an initialization of weight matrices initialization of a hidden state which commonly amounts to initializing these two to zero

next we can Define how we can actually pass forward through the RNN Network to process a given input X and what you'll notice is in this forward operation the computations are exactly like we just walked through we first update the hidden state according to that equation we introduced

earlier and then generate a predicted output that is a transformed version of that hidden state and finally at each time step we return it both the output and the updated hidden State as this is what is necessary to be stored to continue this RNN operation over time

what is very convenient is that although you could Define your RNN Network and your RNN layer completely from scratch is that tensorflow abstracts this operation away for you so you can simply Define a simple RNN according to uh this this call that you're seeing here

um which yeah makes all the the computations very efficient and and very easy and you'll actually get practice implementing and working with with rnns in today's software lab okay so that gives us the understanding of rnns and going back to what I what I described

as kind of the problem setups or the problem definitions at the beginning of this lecture I just want to remind you of the types of sequence modeling problems on which we can apply rnns right we can think about taking a sequence of inputs producing one predicted output

at the end of the sequence we can think about taking a static single input and trying to generate text according to according to that single input and finally we can think about taking a sequence of inputs producing a prediction at every time

step in that sequence and then doing this sequence to sequence type of prediction and translation okay so yeah so so this will be the the foundation for um the software lab today which will focus on this problem of of many to many processing

and many to many sequential modeling taking a sequence going to a sequence what is common and what is universal across all these types of problems and tasks that we may want to consider with rnns is what I like to think about what type of design criteria we

need to build a robust and reliable Network for processing these sequential modeling problems what I mean by that is what are the characteristics what are the the design requirements that the RNN needs to fulfill in order to be able to handle sequential data effectively

the first is that sequences can be of different lengths right they may be short they may be long we want our RNN model or our neural network model in general to be able to handle sequences of variable lengths secondly and really importantly is as we were discussing earlier that the whole

point of thinking about things through the lens of sequence is to try to track and learn dependencies in the data that are related over time so our model really needs to be able to handle those different dependencies which may occur at times that are very very distant from each other

next right sequence is all about order right there's some notion of how current inputs depend on prior inputs and the specific order of the observations we see makes a big effect on what prediction we may want to generate at the end and finally in order to be able to process this

information effectively our Network needs to be able to do what we call parameter sharing meaning that given one set of Weights that set of weights should be able to apply to different time steps in the sequence and still result in a meaningful prediction and so today we're going to focus on

how recurrent neural networks meet these design criteria and how these design criteria motivate the need for even more powerful architectures that can outperform rnns in sequence modeling so to understand these criteria very concretely we're going to consider a sequence modeling

problem where given some series of words our task is just to predict the next word in that sentence so let's say we have this sentence this morning I took my cat for a walk and our task is to predict the last word in the sentence given the prior words this morning I took my cap for a blank

our goal is to take our RNN Define it and put it to test on this task what is our first step to doing this well the very very first step before we even think about defining the RNN is how we can actually represent this information to the network in a way that it can process and understand

if we have a model that is processing this data processing this text-based data and wanting to generate text as the output our problem can arise in that the neural network itself is not equipped to handle language explicitly right remember that neural networks are simply functional

operators they're just mathematical operations and so we can't expect it right it doesn't have an understanding from the start of what a word is or what language means which means that we need a way to represent language numerically so that it can be passed in to the network to process

so what we do is that we need to define a way to translate this text this this language information into a numerical encoding a vector an array of numbers that can then be fed in to our neural network and generating a a vector of numbers as its output

so now right this raises the question of how do we actually Define this transformation how can we transform language into this numerical encoding the key solution and the key way that a lot of these networks work is this notion

and concept of embedding what that means is it it's some transformation that takes indices or something that can be represented as an index into a numerical Vector of a given size so if we think about how this idea of embedding works for language data

let's consider a vocabulary of words that we can possibly have in our language and our goal is to be able to map these individual words in our vocabulary to a numerical Vector of fixed size one way we could do this is by defining all the possible words that could occur in this vocabulary

and then indexing them assigning a index label to each of these distinct words a corresponds to index one cat responds to index two so on and so forth and this indexing Maps these individual words to numbers unique indices what these indices can then Define is what we

call a embedding vector which is a fixed length encoding where we've simply indicated a one value at the index for that word when we observe that word and this is called a one-hot embedding where we have this fixed length Vector of the size of our vocabulary and

each instance of the vocabulary corresponds to a one-hot one at the corresponding index this is a very sparse way to do this and it's simply based on purely purely count the count index there's no notion of semantic information meaning

that's captured in this vector-based encoding alternatively what is very commonly done is to actually use a neural network to learn in encoding to learn in embedding and the goal here is that we can learn a neural network that then captures some inherent meaning or inherent semantics

in our input data and Maps related words or related inputs closer together in this embedding space meaning that they'll have numerical vectors that are more similar to each other this concept is really really foundational to how these sequence modeling networks work and

how neural networks work in general okay so with that in hand we can go back to our design criteria thinking about the capabilities that we desire first we need to be able to handle variable length sequences if we again want to predict the next word in the sequence we can have short

sequences we can have long sequences we can have even longer sentences and our key task is that we want to be able to track dependencies across all these different lengths and what we need what we mean by dependencies is that there could be information very very early on in a sequence

but uh that may not be relevant or come up late until very much later in the sequence and we need to be able to track these dependencies and maintain this information in our Network dependencies relate to order and sequences are defined by their order and we know that same words

in a completely different order have completely different meanings right so our model needs to be able to handle these differences in order and the differences in length that could result in different predicted outputs okay so hopefully that example going through the example in text

motivates how we can think about transforming input data into a numerical encoding that can be passed into the RNN and also what are the key criteria that we want to meet in handling these these types of problems so so far we've painted the picture of rnn's how they

work intuition their mathematical operations and what are the key criteria that they need to meet the final piece to this is how we actually train and learn the weights in the RNN and that's done through back propagation algorithm with a bit of a Twist to just handle sequential information

if we go back and think about how we train feed forward neural network models the steps break down in thinking through starting with an input where we first take this input and make a forward pass through the network going from input to Output the key to back propagation that Alexander introduced

was this idea of taking the prediction and back propagating gradients back through the network and using this operation to then Define and update the loss with respect to each of the parameters in the network in order to gradually adjust the parameters the weights of the network

in order to minimize the overall loss now with rnns as we walked through earlier we have this temporal unrolling which means that we have these individual losses across the individual steps in our sequence that sum together to comprise the overall loss what this means is that when

we do back propagation we have to now instead of back propagating errors through a single Network back propagate the loss through each of these individual time steps and after we back propagate loss through each of the individual time steps we then do that

across all time steps all the way from our current time time T back to the beginning of the sequence and this is the why this is why this algorithm is called back propagation Through Time right because as you can see the data and the the predictions and the resulting errors are fed back in time

all the way from where we are currently to the very beginning of the input data sequence so the back propagations through time is actually a very tricky algorithm to implement uh in practice and the reason for this is if we take a close look looking at how gradients flow across

the RNN what this algorithm involves is that many many repeated computations and multiplications of these weight matrices repeatedly against each other in order to compute the gradient with respect to the very first time step we have to make many of these multiplicative repeats of

the weight Matrix why might this be problematic well if this weight Matrix W is very very big what this can result in is what they call what we call the exploding gradient problem where our gradients that we're trying to use to optimize our Network do exactly that they blow up they explode

and they get really big and makes it infeasible and not possible to train the network stably what we do to mitigate this is a pretty simple solution called gradient clipping which effectively scales back these very big gradients to try to constrain them more in a more restricted way

conversely we can have the instance where the weight matrices are very very small and if these weight matrices are very very small we end up with a very very small value at the end as a result of these repeated weight Matrix computations and these repeated um multiplications and this is

a very real problem in rnns in particular where we can lead into this funnel called a Vanishing gradient where now your gradient has just dropped down close to zero and again you can't train the network stably now there are particular tools that we can use to implement that we can Implement to

try to mitigate the Spanish ingredient problem and we'll touch on each of these three solutions briefly first being how we can Define the activation function in our Network and how we can change the network architecture itself to try to better handle this Vanishing gradient problem

before we do that I want to take just one step back to give you a little more intuition about why Vanishing gradients can be a real issue for recurrent neural networks Point I've kept trying to reiterate is this notion of dependency in the sequential data and what

it means to track those dependencies well if the dependencies are very constrained in a small space not separated out that much by time this repeated gradient computation and the repeated weight matrix multiplication is not so much of a problem if we have a very short sequence where the words

are very closely related to each other and it's pretty obvious what our next output is going to be the RNN can use the immediately passed information to make a prediction and so there are not going to be that many uh that much of a requirement to learn effective

weights if the related information is close to to each other temporally conversely now if we have a sentence where we have a more long-term dependency what this means is that we need information from way further back in the sequence to make our

prediction at the end and that gap between what's relevant and where we are at currently becomes exceedingly large and therefore the vanishing gradient problem is increasingly exacerbated meaning that we really need to um the RNN becomes unable to connect the dots and establish this

long-term dependency all because of this Vanishing gradient issue so the ways that we can imply the ways and modifications that we can make to our Network to try to alleviate this problem threefold the first is that we can simply change the activation functions in each of our

neural network layers to be such that they can effectively try to mitigate and Safeguard from gradients in instances where from shrinking the gradients in instances where the data is greater than zero and this is in particular true for the relu activation function and the reason is that in

all instances where X is greater than zero with the relu function the derivative is one and so that is not less than one and therefore it helps in mitigating The Vanishing gradient problem another trick is how we initialize the parameters in the network itself to prevent them from

shrinking to zero too rapidly and there are there are mathematical ways that we can do this namely by initializing our weights to Identity matrices and this effectively helps in practice to prevent the weight updates to shrink too rapidly to zero however the most robust solution to the vanishing

gradient problem is by introducing a slightly more complicated uh version of the recurrent neural unit to be able to more effectively track and handle long-term dependencies in the data and this is this idea of gating and what the idea is is by controlling selectively the flow

of information into the neural unit to be able to filter out what's not important while maintaining what is important and the key and the most popular type of recurrent unit that achieves this gated computation is called the lstm or long short term memory Network today we're not going to go into

detail on lstn's their mathematical details their operations and so on but I just want to convey the key idea and intuitive idea about why these lstms are effective at tracking long-term dependencies the core is that the lstm is able to um control the flow of information through these gates

to be able to more effectively filter out the unimportant things and store the important things what you can do is Implement Implement lstms in tensorflow just as you would in RNN but the core concept that I want you to take away when thinking about the lstm is this idea of controlled

information flow through Gates very briefly the way that lstm operates is by maintaining a cell State just like a standard RNN and that cell state is independent from what is directly outputted the way the cell state is updated is according to these Gates that control the flow of information

for getting and eliminating what is irrelevant storing the information that is relevant updating the cell state in turn and then filtering this this updated cell state to produce the predicted output just like the standard RNN and again we can train the lstm

using the back propagation Through Time algorithm but the mathematics of how the lstm is defined allows for a completely uninterrupted flow of the gradients which completely eliminates the well largely eliminates the The Vanishing gradient problem that I introduced earlier

again we're not if you're if you're interested in learning more about the mathematics and the details of lstms please come and discuss with us after the lectures but again just emphasizing the core concept and the intuition behind how the lstm operates

okay so so far where we've out where we've been at we've covered a lot of ground we've gone through the fundamental workings of rnns the architecture the training the type of problems that they've been applied to and I'd like to close this part by considering some concrete examples

of how you're going to use rnns in your software lab and that is going to be in the task of Music generation where you're going to work to build an RNN that can predict the next musical note in a sequence and use it to generate brand new musical sequences that have never been realized before

so to give you an example of just the quality and and type of output that you can try to aim towards a few years ago there was a work that trained in RNN on a corpus of classical music data and famously there's this composer Schubert who uh wrote a famous unfinished

Symphony that consisted of two movements but he was unable to finish his uh his Symphony before he died so he died and then he left the third movement unfinished so a few years ago a group trained a RNN based model to actually try to generate the third movement to Schubert's famous

unfinished Symphony given the prior to movements so I'm going to play the result quite right now [Music] okay I I paused it I interrupted it quite abruptly there but if there are any classical

music aficionados out there hopefully you get a appreciation for kind of the quality that was generated uh in in terms of the music quality and this was already from a few years ago and as we'll see in the next lectures the and continuing with this theme of generative AI the power of these

algorithms has advanced tremendously since we first played this example um particularly in you know a whole range of domains which I'm excited to talk about but not for now okay so you'll tackle this problem head on in today's lab RNN music generation foreign we can think

about the the simple example of input sequence to a single output with sentiment classification where we can think about for example text like tweets and assigning positive or negative labels to these these text examples based on the content that that is learned by the network

okay so this kind of concludes the portion on rnns and I think it's quite remarkable that using all the foundational Concepts and operations that we've talked about so far we've been able to try to build up networks that handle this complex problem of sequential modeling

but like any technology right and RNN is not without limitations so what are some of those limitations and what are some potential issues that can arise with using rnns or even lstms the first is this idea of encoding and and dependency in terms of the the temporal separation

of data that we're trying to process while rnns require is that the sequential information is fed in and processed time step by time step what that imposes is what we call an encoding bottleneck right where we have we're trying to encode a lot of content for example a very large body of text

many different words into a single output that may be just at the very last time step how do we ensure that all that information leading up to that time step was properly maintained and encoded and learned by the network in practice this is very very challenging and a lot of information

can be lost another limitation is that by doing this time step by time step processing rnns can be quite slow there is not really an easy way to parallelize that computation and finally together these components of the encoding bottleneck the requirement to

process this data step by step imposes the biggest problem which is when we talk about long memory the capacity of the RNN and the lstm is really not that long we can't really handle data of tens of thousands or hundreds of thousands or even Beyond sequential information that effectively to learn

the complete amount of information and patterns that are present within such a rich data source and so because of this very recently there's been a lot of attention in how we can move Beyond this notion of step-by-step recurrent processing to build even more powerful architectures for

processing sequential data to understand how we do how we can start to do this let's take a big step back right think about the high level goal of sequence modeling that I introduced at the very beginning given some input a sequence of data we want to build a feature encoding and use our

neural network to learn that and then transform that feature encoding into a predicted output what we saw is that rnns use this notion of recurrence to maintain order information processing information time step by time step but as I just mentioned we had these key three

bottlenecks to rnns what we really want to achieve is to go beyond these bottlenecks and Achieve even higher capabilities in terms of the power of these models rather than having an encoding bottleneck ideally we want to process information continuously as a continuous stream of information

rather than being slow we want to be able to parallelize computations to speed up processing and finally of course our main goal is to really try to establish long memory that can build nuanced and Rich understanding of sequential data

the limitation of rnns that's linked to all these problems and issues in our inability to achieve these capabilities is that they require this time step by time step processing so what if we could move beyond that what if we could eliminate this need for recurrence entirely

and not have to process the data time set by time step well a first and naive approach would be to just squash all the data all the time steps together to create a vector that's effectively concatenated right the time steps are eliminated there's just one one stream where we have now one

vector input with the data from all time points that's then fed into the model it calculates some feature vector and then generates some output which hopefully makes sense and because we've squashed all these time steps together we could simply think about maybe building a

feed forward Network that could that could do this computation well with that we'd eliminate the need for recurrence but we still have the issues that it's not scalable because the dense feed forward Network would have to be immensely large defined by many many different connections and critically

we've completely lost our in order information by just squashing everything together blindly there's no temporal dependence and we're then stuck in our ability to try to establish long-term memory so what if instead we could still think about bringing these time steps together

but be a bit more clever about how we try to extract information from this input data the key idea is this idea of being able to identify and attend to what is important in a potentially sequential stream of information and this is the notion of attention or self-attention

which is an extremely extremely powerful Concept in modern deep learning and AI I cannot understate or I don't know understand overstate I I cannot emphasize enough how powerful this concept is attention is the foundational mechanism of the Transformer architecture which many of you may

have heard about and it's the the the notion of a transformer can often be very daunting because sometimes they're presented with these really complex diagrams or deployed in complex applications and you may think okay how do I even start to make sense of this

at its core though attention the key operation is a very intuitive idea and we're going to in the last portion of this lecture break it down step by step to see why it's so powerful and how we can use that as part of a larger neural network like a Transformer

specifically we're going to be talking and focusing on this idea of self-attention attending to the most important parts of an input example so let's consider an image I think it's most intuitive to consider an image first this is a picture of Iron Man and if our

goal is to try to extract information from this image of what's important what we could do maybe is using our eyes naively scan over this image pixel by pixel right just going across the image however our brains maybe maybe internally they're doing some type of computation like this but you

and I we can simply look at this image and be able to attend to the important parts we can see that it's Iron Man coming at you right in the image and then we can focus in a little further and say okay what are the details about Iron Man that may be important

what is key what you're doing is your brain is identifying which parts are attending to to attend to and then extracting those features that deserve the highest attention the first part of this problem is really the most interesting and challenging one

and it's very similar to the concept of search effectively that's what search is doing taking some larger body of information and trying to extract and identify the important parts so let's go there next how does search work you're thinking you're in this class how can I learn more

about neural networks well in this day and age one thing you may do besides coming here and joining us is going to the internet having all the videos out there trying to find something that matches doing a search operation so you have a giant database like YouTube you want to find a video

you enter in your query deep learning and what comes out are some possible outputs right for every video in the database there is going to be some key information related to the interview to that to that video let's say the title now to do the search what the task is to

find the overlaps between your query and each of these titles right the keys in the database what we want to compute is a metric of similarity and relevance between the query and these keys how similar are they to our desired query and we can do this step by step let's say this first option

of a video about the elegant giant sea turtles not that similar to our query about deep learning our second option introduction to deep learning the first introductory lecture on this class yes highly relevant the third option a video about the late and great Kobe Bryant not that relevant

the key operation here is that there is this similarity computation bringing the query and the key together the final step is now that we've identified what key is relevant extracting the relevant information what we want to pay attention to and that's the video itself we

call this the value and because the searches is implemented well right we've successfully identified the relevant video on deep learning that you are going to want to pay attention to and it's this this idea this intuition of giving a query trying to find similarity

trying to extract the related values that form the basis of self-attention and how it works in neural networks like Transformers so to go concretely into this right let's go back now to our text our language example with the sentence our goal is to identify

and attend to features in this input that are relevant to the semantic meaning of the sentence now first step we have sequence we have order we've eliminated recurrence right we're feeding in all the time steps all at once we still need a way to encode and capture this information about order

and this positional dependence how this is done is this idea of possession positional encoding which captures some inherent order information present in the sequence I'm just going to touch on this very briefly but the idea is related to this idea of embeddings which I introduced earlier

what is done is a neural network layer is used to encode positional information that captures the relative relationships in terms of order within within this text that's the high level concept right we're still being able to process these time steps all at once

there is no notion of time step rather the data is singular but still we learned this encoding that captures the positional order information now our next step is to take this encoding and figure out what to attend to exactly like that search operation that I introduced with the YouTube

example extracting a query extracting a key extracting a value and relating them to each other so we use neural network layers to do exactly this given this positional encoding what attention does is applies a neural network layer transforming that first generating the query

we do this again using a separate neural network layer and this is a different set of Weights a different set of parameters that then transform that positional embedding in a different way generating a second output the key and finally this repeat this operation is repeated with a

third layer a third set of Weights generating the value now with these three in hand the key the the query the key and the value we can compare them to each other to try to figure out where in that self-input the network should attend to what is important and that's the key idea

behind this similarity metric or what you can think of as an attention score what we're doing is we're Computing a similarity score between a query and the key and remember that these query and Qui key values are just arrays of numbers we can Define them as arrays of numbers which

you can think of as vectors in space the query Vector the query values are some Vector the key the key values are some other vector and mathematically the way that we can compare these two vectors to understand how similar they are is by taking the dot product and scaling it captures

how similar these vectors are how whether or not they're pointing in the same direction right this is the similarity metric and if you are familiar with a little bit of linear algebra this is also known as the cosine similarity operation functions exactly the same way for matrices if we apply this

dot product operation to our query in key matrices key matrices we get this similarity metric out now this is very very key in defining our next step Computing the attention waiting in terms of what the network should actually attend to within this input this operation gives us a score which

defines how how the components of the input data are related to each other so given a sentence right when we compute this similarity score metric we can then begin to think of Weights that Define the relationship between the sequential the components of the sequential data to each other

so for example in the this example with a text sentence he tossed the tennis ball to serve the goal with the score is that words in the sequence that are related to each other should have high attention weights ball related to toss related to tennis and this metric itself

is our attention waiting what we have done is passed that similarity score through a soft Max function which all it does is it constrains those values to be between 0 and 1. and so you can think of these as relative scores of relative attention weights finally now that we

have this metric that can captures this notion of similarity and these internal self-relationships we can finally use this metric to extract features that are deserving of high attention and that's the exact final step in this self-attention mechanism in that we take that

attention waiting Matrix multiply it by the value and get a transformed transformation of of the initial data as our output which in turn reflects the features that correspond to high attention all right let's take a breath let's recap what we have just covered so far

the goal with this idea of self-attention the backbone of Transformers is to eliminate recurrence attend to the most important features in in the input data in an architecture how this is actually deployed is first we take our input data we compute these positional encodings

the neural network layers are applied three-fold to transform the positional encoding into each of the key query and value matrices we can then compute the self-attention weight score according to the up the dot product operation that we went through prior and then self-attend

to these features to these uh information to extract features that deserve High attention what is so powerful about this approach in taking this attention wait putting it together with the value to extract High attention features is that this operation the scheme that I'm showing on

the right defines a single self-attention head and multiple of these self-attention heads can be linked together to form larger Network architectures where you can think about these different heads trying to extract different information different relevant parts of the input

to now put together a very very rich encoding and representation of the data that we're working with intuitively back to our Ironman example what this idea of multiple self-attention heads can amount to is that different Salient features and Salient information in the data is extracted

first maybe you consider Iron Man attention had one and you may have additional attention heads that are picking out other relevant parts of the data which maybe we did not realize before for example the building or the spaceship in the background that's chasing iron

man and so this is a key building block of many many many many powerful architectures that are out there today today I again cannot emphasize how enough how powerful this mechanism is and indeed this this backbone idea of self-attention that you just built up

understanding of is the key operation of some of the most powerful neural networks and deep learning models out there today ranging from the very powerful language models like gpt3 which are capable of synthesizing natural language in a very human-like fashion digesting large bodies

of text information to understand relationships in text to models that are being deployed for extremely impactful applications in biology and Medicine such as Alpha full 2 which uses this notion of self-attention to look at data of protein sequences and be able to predict the

three-dimensional structure of a protein just given sequence information alone and all the way even now to computer vision which will be the topic of our next lecture tomorrow where the same idea of attention that was initially developed in sequential data applications has now transformed

the field of computer vision and again using this key concept of attending to the important features in an input to build these very rich representations of complex High dimensional data okay so that concludes lectures for today I know we have covered a lot of territory in a

pretty short amount of time but that is what this boot camp program is all about so hopefully today you've gotten a sense of the foundations of neural networks in the lecture with Alexander we talked about rnns how they're well suited for sequential data how we can train them using back propagation

how we can deploy them for different applications and finally how we can move Beyond recurrence to build this idea of self-attention for building increasingly powerful models for deep learning in sequence modeling all right hopefully you enjoyed we have

um about 45 minutes left for the for the lab portion and open Office hours in which we welcome you to ask us questions uh of us and the Tas and to start work on the labs the information for the labs is is up there thank you so much for your attention foreign

Heads up!

This summary and transcript were automatically generated using AI with the Free YouTube Transcript Summary Tool by LunaNotes.

Generate a summary for free

Related Summaries

Understanding Introduction to Deep Learning: Foundations, Techniques, and Applications

Explore the exciting world of deep learning, its techniques, applications, and foundations covered in MIT's course.

Master Generative AI: From Basics to Advanced LangChain Applications

Explore the comprehensive journey into generative AI, from foundational concepts and transformer architectures to practical implementation with LangChain. Learn how to leverage large language models, prompt engineering, retrieval augmented generation, and ChatGPT-like systems to build cutting-edge AI applications and stay ahead in the evolving AI landscape.

Comprehensive Introduction to AI: History, Models, and Optimization Techniques

This lecture provides a detailed overview of Artificial Intelligence, covering its historical evolution, core paradigms like modeling, inference, and learning, and foundational optimization methods such as dynamic programming and gradient descent. It also discusses AI's societal impacts, challenges, and course logistics for Stanford's CS221.

Comprehensive Artificial Intelligence Course: AI, ML, Deep Learning & NLP

Explore a full Artificial Intelligence course covering AI history, machine learning types and algorithms, deep learning concepts, and natural language processing with practical Python demos. Learn key AI applications, programming languages, and advanced techniques like reinforcement learning and convolutional neural networks. Perfect for beginners and aspiring machine learning engineers.

Understanding Generative AI: Concepts, Models, and Applications

Explore the fundamentals of generative AI, its models, and real-world applications in this comprehensive guide.