The yield curve has inverted, the trade war seems to be in early innings, and economic data is being revised downward. Forecasters are warning of a recession and even financial panic. CNBC has been replete with talk of markets in turmoil. A good part of the country is apoplectic about Trump’s actions on trade.
On the flip side, retail sales, consumer confidence, and employment remain robust. Important voices including former Fed Chairwoman Janet Yellen have offered a calming perspective. The YTD total return on the S&P 500 including dividends is 15% as of 8/23/19. Bond investors are having a great year. People seem to have forgotten that most of our trade is with countries not named China:
[table id=25 /]
Let’s expand the aperture and see what the data suggests for equity returns and a near-term recession. I built three neural networks using a dataset of 164 monthly economic and markets data going back to 1992. The results are below:
[table id=24 /]
Instead of using Recurrent Neural Networks, I employed fully connected neural networks and included time-series metadata. Also, I achieved higher accuracy results using the newly proposed RAdam optimization function.
TL;DR: A Recession is Not Likely
As I wrote here in July, the data at the time didn’t fit the hysteria narrative nor confirm a rate cut. An NLP analysis of Fed communications and statements suggested otherwise, and the Fed cut. Since then, Fed comments have focused on the lack of a trade war playbook and the desire to sustain the expansion (read: rate cuts). Falling oil prices may be counteracting the negative effects of tariffs. The models I built imply:
- Reasonable 12-month equity returns
- No current recession
- Employment at levels confirmed by other data
Let’s take a deep breath…and get back to work!
Full code for this post can be found on Github under post 60. Read on for detail on the models.
Model Training Data
The training set for this analysis is monthly data on 164 economic and market data going back to 1992. I use the most recent available data, and acknowledge that restatements are possible (e.g. the Bureau of Labor Statistics recently restated nonfarm payrolls through March 2019 down by 501,000).
Currency and Bond Markets
Dollar Index, 2 Year Treasury Maturity Rate, Aaa Corporate Bond Yield to 10-Year, Economic Policy Uncertainty Index, Long-Term Government Bond Yield, CBOE Volatility Index, Gold Price, Baa Corporate Bond Yield to 10-Year, 90 Day Eurodollar, S&P Monthly Return, LIBOR, USD/Euro Exchange Rate, Spread Between 10-Year Treasury and 3-Month, Ted Spread, Fed Funds Rate, NY Fed Recession Indicator, 10-Year-2-Year
Consumer Discretionary, Information Technology, Healthcare, Energy, Utilities, Industrials, Materials, Real Estate, Financials, Consumer Staples, Russell 2000, NASDAQ Composite, S&P500 EPS, Amazon Stock Price, Walmart Stock Price
Total Nonfarm Payrolls, Civilian Unemployment Rate, Long-Term Unemployment Rate, Civilian Labor Force Participation Rate, Civilian Employment-Population Ratio, Employment Level: Part-Time, Initial Claims (average), Continuing Claims, Government Payrolls, Nonfarm Private Payrolls, Manufacturing Payrolls, Job Offers, Wages, Wage Growth, Wages in Manufacturing, Average Hourly Earnings, Average Weekly Hours, Challenger Job Cuts, Job Vacancies, Youth Unemployment,
Inflation, Inflation Expectations, CPI, Core Inflation Rate, Producer Price Changes, Export Prices, Import Prices, Food Inflation, Core PCE Price Index, Core Producer Prices, CPI Housing Utilities, CPI Transportation, PCE Price Index, US Cass Freight Shipments, US Cass Freight Expenditures, Long Beach Inbound, Long Beach Outbound, Long Beach Empties, Truck Tonnage, Freight Transport Index, Rail Freight Carloads, Rail Freight Intermodal,
US ISM PMI, Manufacturing PMI, Industrial Production Index, Manufacturing Production, Capacity Utilization, US Durable Goods, Durable Goods excluding Defense, Durable Goods excluding Transportation, Factory Orders excluding Transportation, Factory Orders, Factory Orders ex Transportation, New Orders, Business Inventories, Wholesale Inventories, NFIB Business Optimism Index, Chicago Fed National Activity Index, Dallas Fed Manufacturing Index, NY Empire State Manufacturing Index, Philadelphia Fed Manufacturing Index, Richmond Fed Manufacturing Index, Auto Assemblies, Light Vehicle Sales, Leading Index for the United States, Kansas Fed Manufacturing Index, Mining Production, Steel Production, Capital Goods Shipments, Capital Goods Orders, RV Unit Shipments YOY, Corporate Profits(quarterly)
Consumer & Housing
UMich Consumer Sentiment, Advance Retail Sales, Retail Sales, Retail Sales YOY, Retail Sales Ex Autos MoM, Disposable Personal Income, Consumer Spending, Personal Spending, Personal Income, Personal Savings, Consumer Credit, Private Sector Credit, Bank Lending Rate, Economic Optimism Index, Chain Store Sales, Gasoline Prices, Debt Service to Personal Income
Housing & Construction
S&P/Case-Shiller U.S. National Home Price Index, Single Housing Starts, Multi-Housing Starts, Building Permits, New Home Sales, Pending Home Sales, Existing Home Sales, Nonresidential Construction Spending, Housing Index, Nahb Housing Market Index
OECD Composite Business Confidence Index,
Europe: Long-Term Government Bond Yields for Euro Area, Euro Area Services PMI, Euro Area Manufacturing PMI, GDP Growth
Germany: German Long-term Government Bond Yields, German 3-month Rates, German Registered Unemployment Rate, Production of Total industry in Germany, Germany New Orders, Services PMI, Manufacturing PMI, GDP Growth
China: Industrial Production, Electricity Production, Consumer Confidence, Composite PMI, Caixin Manufacturing PMI, GDP Growth
RNNs, namely Long short-term memory models (LSTMs), are considered optimal for time series data (given the vanishing gradient problem). But, Jeremy Howard from fast.ai argues that this is the case for single sequence time series or when the dataset is incredibly complex. In reality, an analysis often contains many data sequences, metadata, and other information. Per Howard, most state-of-the-art results in time series are coming from fully connected neural networks with added date related metadata.
Fast.ai provides a useful function, add_datepart, to add data related metadata. The full code to implement the function (and the list of metadata that gets generated) is below:
def add_datepart(df, fldname, drop=True, time=False): "Helper function that adds columns relevant to a date." fld = df[fldname] fld_dtype = fld.dtype if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype): fld_dtype = np.datetime64 if not np.issubdtype(fld_dtype, np.datetime64): df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True) targ_pre = re.sub('[Dd]ate$', '', fldname) attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start'] if time: attr = attr + ['Hour', 'Minute', 'Second'] for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower()) df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9 if drop: df.drop(fldname, axis=1, inplace=True)
RAdam: A New Optimization Function
Previously From Vanilla SGD to Adam…
The Stochastic Gradient Descent (SGD) function optimizes machine learning parameters (or weights). Its goal is to minimize a loss, or cost function, which is the difference between actual and predicted dependent values. Initially, our model produces random predicted values using random parameters. This produces an initial loss function. Now, you can calculate the derivative of this loss function with respect to the parameters. The derivative is multiplied by a user-defined learning rate. Then it is subtracted from the initially random parameters. This process repeats until loss function minimization.
The fast.ai library which I use for all my models employs, as its default, Adam, a variation of basic SGD. This is because SGD can be slow to train. Adam is SGD with two additions: Momentum and RMSProp.
Momentum is a simple, yet powerful concept. Instead of updating parameters by the derivative, you factor in your most recent update as well, creating momentum in a certain direction. Specifically, you take your change in parameters at time t-1 and multiply by a user-defined momentum number (0.9 is common). And then you take (1-momentum) and multiply that by your current time t derivative, and finally add the two together. That is the amount by which you update your parameters at time t.
Why does this work? Imagine you are taking small steps with your learning rate in one direction toward loss function minimization. If you add in momentum, you amplify those steps toward your goal. Now if you start to go too far from the loss minimum, your gradient will point in a different direction then your recent momentum. You will take more, but increasingly smaller steps in the wrong direction (i.e., increasingly smaller momentum in the wrong direction + increasingly bigger gradient in the right direction). And soon enough, you will move in the right direction. Or, imagine you are taking large steps with your learning rate, ensuring that you are not converging to, but in fact overshooting the loss minimum. Adding momentum produces an average of your repeated bi-directional moves, likely leading you to the loss minimum.
For financial markets technicians, this is actually a familiar concept. It is an exponentially weighted moving average of the derivative. Because each t-1 input will contain in it a portion of t-2, t-3, etc. So the direction of previous moves will be quite important in the adjustment of parameters in pursuit of the loss minimum.
Turing Prize winner Geoffrey Hinton proposed RMSProp not in an academic paper, but in his Coursera MOOC on Deep Learning.
RMSProp is an exponentially weighted moving average, not of the previous update, but of the gradient (derivative) itself. The update function takes the learning rate times your gradient but divides the gradient by the square root of the EMA of the gradient.
Why does this make sense? If your gradient is small, your EMA will be small. If the gradient is volatile or always big, the EMA will be a big number. 1 over the square root of a small number will be large, and vice versa. Thus, if you have a small gradient, let’s take bigger steps because you are moving in the right direction. And of course, if you are not moving in the right direction, then take smaller steps, which 1/square root gives you.
In summary, Adam implements both Momentum and RMSProp with SGD as the optimization function. When released, it showed great results, such as below from the original paper. But over time, researchers have identified several weaknesses.
And Now From Adam to RAdam…
What is Warmup in Deep Learning?
The authors state that adaptive learning rate optimizers such as Adam suffer from a risk of converging into poor local optima if a warm-up method is not implemented. What does that mean?
Deep learning models can quickly over-fit if data happens to be organized into related observations and features. In this case, the model will skew toward those features, even if they are not representative of the dataset. Warmup can reduce the effect of these early training examples.
FastAI’s Warmup Method
fit_one_cycle(learn:Learner, cyc_len:int, max_lr:Union[float, Collection[float], slice]=slice(None, 0.003, None), moms:Point=(0.95, 0.85), div_factor:float=25.0, pct_start:float=0.3, final_div:float=None, wd:float=None, callbacks:Optional[Collection[Callback]]=None, tot_epochs:int=None, start_epoch:int=None)
The basic arguments you would select regardless of a warmup policy are:
- cyc_len: how many cycles do you want to train all of your data for
- max-lr: the maximum learning rate to use (0.003 is the default)
- wd: weight decay, a parameter that helps prevent overfitting
Before moving to warmup, let’s show how to select an optimal learning rate, using fast.ai’s learning_rate_finder function.
Per Sylvan Gugger,
“We have already seen how to implement the learning rate finder. Begin to train the model while increasing the learning rate from a very low to a very large one, stop when the loss starts to really get out of control. Plot the losses against the learning rates and pick a value a bit before the minimum, where the loss still improves.Source
Once you select a learning rate, you implement the warmup policy. There are two other notable arguments with defaults in fit_one_cycle:
- moms: maximum and minimum of momentum to use (0.95 and 0.85 are defaults)
- div_factor: a float that is applied below (default 25.0)
Fit_one_cycle does three things:
We progressively increase our learning rate from lr_max/div_factor to lr_max and at the same time we progressively decrease our momentum from mom_max to mom_min.
We do the exact opposite: we progressively decrease our learning rate from lr_max to lr_max/div_factor and at the same time we progressively increase our momentum from mom_min to mom_max.
We further decrease our learning rate from lr_max/div_factor to lr_max/(div_factor x 100) and we keep momentum steady at mom_max.Source
Why does this make sense? The first part ensures that you start training with a smaller learning rate, so you don’t quickly overfit. And per Smith’s paper, the learning rate is progressively adjusted. As it goes upward, it prevents the loss function from landing in steep local minima. Instead, the loss function finds a flatter area that is more likely to lead to the ultimate minima. In the last part of the training, lower learning rates ensure that you can descend into the final steep local minima within that flat area
Additionally. Wright found that decreasing momentum improved results. Early on, you want the optimization function to quickly find new directions to the flatter area. Once it is in the flatter area, you want to apply momentum to move it more quickly to the ultimate loss minimum.
We talked above about how warmup helps avoid Adam jumping into an incorrect local optimum in the early stages of training. The research team behind RAdam was motivated to investigate WHY Adam-type optimizers require a warmup. What they found was excessive variance at the start of training. Warmup can reduce this variance, but the application of it requires deciding how much warmup to use (recall fast.ai has some adjustable defaults). Further, these decisions vary from one dataset to another.
In response, the researchers proposed a mathematical function that would allow steady adjustment of momentum steadily as a function of underlying variance. Momentum does not go to so-called full-speed until the variance of the data settles down. The full paper goes into extensive detail on mathematical derivation of the function.
In the two weeks since beginning this post, the deep learning community has brought other research to my attention. A new paper by Geoffrey Hinton proposes the LookAhead optimizer. Less Smith has merged RAdam and LookAhead into a single codebase that is deployable on fast.ai. And Jeremy Howard has suggested that researchers take a close look at Novograd. More to come on these optimizers in future posts!
Any opinions or forecasts contained herein reflect the personal and subjective judgments and assumptions of the author only. There can be no assurance that developments will transpire as forecasted and actual results will be different. The accuracy of data is not guaranteed but represents the author’s best judgment and can be derived from a variety of sources. The information is subject to change at any time without notice.