“Comments on the Origin and Application of Markov Decision Processes”, 2002 ():
[Ronald Howard describes how he invented policy iteration (1960): as a grad student in the operations-research Group of Arthur D. Little Inc, he became involved in Sears catalogue mailing optimization, using value iteration. It was a success, increasing profits a few percentage points or millions of dollars annually, but Howard thought that the value iteration could be faster and invented policy iteration.]
…There were many options. A customer could receive up to 14 individual mailings a year, ranging from the general catalog to individual sales fliers. Or, he might receive any subset of these mailings. The cost of any particular mailing was easily determined, but what was the benefit?
To answer that question, let me take you back with me to a grey Chicago day, 20 years ago, when I first saw the Sears’ catalog information system. It was an unforgettable sight. Imagine, if you will, two or 3 acres of green steel filing cabinets. Each cabinet contained steel Addressograph plates, about 4 inches square. Each plate had a stencil for printing the customer’s name and address, and, as inserts, 3 small cards with several punched holes. These holes provided a limited summary over 3 seasons of the customer’s purchasing history as a Sears mail order client. About 100 young women continually circulated among the filing cabinets. They were supplied with one copy of the customer’s latest mail order, and it was their job to update the punches on the cards to incorporate the effect of the order. The information recorded was highly quantized in terms of the number and amount of the orders. The key to the system was the machine that used the steel plates to print labels for the catalog to be mailed. A drawer of plates was stacked into the machine. The machine examined the punched holes in the cards on each plate and determined, according to wired-board logic, whether this pattern of holes qualified the customer to receive the particular catalog being distributed at that time. If the pattern was favorable, a label was printed; otherwise, not. Thus, for example, the management could decide to send the general catalog only to customers who purchased more than $161.99$201959 during the present season. By wiring the machine appropriately, this decision was implemented simply by passing every drawer of cards through the labeling machine.
In making this decision, the management had traditionally looked at the direct profit to be expected from a customer as measured by the difference between the marginal profit on his purchases and the cost of sending him the catalog. As we examined the operation, we began to wonder whether it might be profitable to send catalogs, not just on the basis of the profit they might produce during the present season, but also for the impact they might have in moving the customer into more profitable categories in the future.
…The optimum policy, for both discounted present value and average reward criteria, was found by value iteration. This all took place in the days when computers still had vacuum tubes. And so the runs were fairly time-consuming, but still economical. The optimum policy was different from the policy that had previously been used. The optimum policy was not the policy that maximized expected immediate return, but rather a policy that balanced this return with the effect on future state transitions. The net result was a predicted few percent increase in the profitability of the catalog operation, which, however, amounted to several million dollars per year.
…The experience left me with the suspicion that there might be a way to go directly to the best policy without the need for value iteration. I worked on the problem for about 6 months under Dr. Kimball’s supervision and was able to develop a policy iteration method (1960). I would like to have described the motivating problem at that time, but the proprietary nature of the work with Sears, Roebuck and Company made that impossible. Perhaps, the cause of application might now be further advanced if this work had been presented in terms of its original application rather than by means of artificial examples (, 1971). Of course, on a broader scale this story makes one painfully aware of how thinly our professional journals cover important applications.
…I have discussed the general issue of application in a talk called “The Practicality Gap”. Finally, the only other important application I know of MDPs is concerned with metals futures and is also proprietary.
…I was not aware of Bellman’s ideas on policy iteration; I was already writing my thesis in 1957.