“Gradient-Based Learning Algorithms for Recurrent Networks and Their Computational Complexity”, Ronald J. Williams, David Zipser1995 (; backlinks)⁠:

[RTRL] In this chapter we describe several gradient-based approaches to training a recurrent network to perform a desired sequential behavior in response to input. In characterizing these approaches as “gradient-based” we mean that at least part of the learning algorithm involves computing the gradient of some form of performance measure for the network in weight space, either exactly or approximately, with this result then used in some appropriate fashion to determine the weight changes. For the type of task investigated here, the performance measure is a simple measure of error between actual and desired output.

Because we deal here only with gradient-based learning algorithms, our primary focus will be on techniques for computing this exact or approximate gradient information. It is to be understood that there may be various alternative ways to use this gradient information in a particular learning algorithm, including simple proportional descent along the error gradient or the use of “momentum” or other more sophisticated acceleration techniques.

We discuss several approaches to performing the desired gradient computation, some based on the familiar backpropagation algorithm and some involving other ideas. Part of the intent of this chapter is to discuss the relationship between these various alternative approaches to gradient computation in recurrent networks. We begin by developing exact gradient computation algorithms, but later we note how they give rise to useful approximation strategies having more desirable computational features. For all these approaches to exact or approximate gradient computation we also provide an analysis of their computational requirements. The reader interested in performing digital computer simulation experiments of these various algorithms may find these analyses particularly helpful. In addition, we note some special architectures which readily lend themselves to specific hybrid strategies giving rise to conceptually and/or computationally simpler algorithms for exact gradient computation. Additional topics discussed are teacher forcing, a useful adjunct to all of the techniques discussed, and some experimental comparisons of the performance of some of the algorithms.