HandsOn Machine Learning for Algorithmic Trading
HandsOn Machine Learning for Algorithmic Trading
€30.00
Table of Contents
Preface
1
Chapter 1: Machine Learning for Trading
8
How to read this book
9
What to expect
10
Who should read this book
10
How the book is organized
11
Part I – the framework – from data to strategy design
11
Part 2 – ML fundamentals
12
Part 3 – natural language processing
13
Part 4 – deep and reinforcement learning
13
What you need to succeed
14
Data sources
14
GitHub repository
15
Python libraries
15
The rise of ML in the investment industry
15
From electronic to highfrequency trading
16
Factor investing and smart beta funds
18
Algorithmic pioneers outperform humans at scale
20
ML driven funds attract $1 trillion AUM
21
The emergence of quantamental funds
22
Investments in strategic capabilities
23
ML and alternative data
23
Crowdsourcing of trading algorithms
25
Design and execution of a trading strategy
25
Sourcing and managing data
26
Alpha factor research and evaluation
27
Portfolio optimization and risk management
28
Strategy backtesting
28
ML and algorithmic trading strategies
29
Use Cases of ML for Trading
30
Data mining for feature extraction
30
Supervised learning for alpha factor creation and aggregation
31
Asset allocation
31
Testing trade ideas
32
Reinforcement learning
32
Summary
32
Chapter 2: Market and Fundamental Data
33
How to work with market data
34
Market microstructure
34
Marketplaces
34
Table of Contents
Types of orders
36
Working with order book data
36
The FIX protocol
37
Nasdaq TotalViewITCH Order Book data
38
Parsing binary ITCH messages
38
Reconstructing trades and the order book
42
Regularizing tick data
45
Tick bars
46
Time bars
47
Volume bars
49
Dollar bars
50
API access to market data
50
Remote data access using pandas
51
Reading html tables
51
pandasdatareader for market data
51
The Investor Exchange
52
Quantopian
53
Zipline
54
Quandl
55
Other marketdata providers
56
How to work with fundamental data
57
Financial statement data
57
Automated processing – XBRL
57
Building a fundamental data time series
58
Extracting the financial statements and notes dataset
58
Retrieving all quarterly Apple filings
60
Building a price/earnings time series
61
Other fundamental data sources
62
pandas_datareader – macro and industry data
63
Efficient data storage with pandas
63
Summary
64
Chapter 3: Alternative Data for Finance
65
The alternative data revolution
66
Sources of alternative data
67
Individuals
68
Business processes
68
Sensors
69
Satellites
70
Geolocation data
70
Evaluating alternative datasets
71
Evaluation criteria
72
Quality of the signal content
72
Asset classes
72
Investment style
72
Risk premiums
72
Alpha content and quality
73
Quality of the data
73
Legal and reputational risks
73
Exclusivity
74
Time horizon
74
[ ii ]
Table of Contents
Frequency
74
Reliability
75
Technical aspects
75
Latency
75
Format
75
The market for alternative data
75
Data providers and use cases
77
Social sentiment data
77
Dataminr
78
StockTwits
78
RavenPack
78
Satellite data
78
Geolocation data
79
Email receipt data
79
Working with alternative data
79
Scraping OpenTable data
79
Extracting data from HTML using requests and BeautifulSoup
80
Introducing Selenium – using browser automation
81
Building a dataset of restaurant bookings
82
One step further – Scrapy and splash
83
Earnings call transcripts
84
Parsing HTML using regular expressions
85
Summary
87
Chapter 4: Alpha Factor Research
88
Engineering alpha factors
89
Important factor categories
90
Momentum and sentiment factors
90
Rationale
91
Key metrics
92
Value factors
93
Rationale
94
Key metrics
95
Volatility and size factors
96
Rationale
96
Key metrics
97
Quality factors
97
Rationale
98
Key metrics
98
How to transform data into factors
99
Useful pandas and NumPy methods
100
Loading the data
100
Resampling from daily to monthly frequency
100
Computing momentum factors
101
Using lagged returns and different holding periods
102
Compute factor betas
102
Builtin Quantopian factors
103
TALib
103
Seeking signals – how to use zipline
104
The architecture – eventdriven trading simulation
105
A single alpha factor from market data
106
[ iii ]
Table of Contents
Combining factors from diverse data sources
108
Separating signal and noise – how to use alphalens
110
Creating forward returns and factor quantiles
110
Predictive performance by factor quantiles
112
The information coefficient
114
Factor turnover
117
Alpha factor resources
117
Alternative algorithmic trading libraries
117
Summary
118
Chapter 5: Strategy Evaluation
119
How to build and test a portfolio with zipline
120
Scheduled trading and portfolio rebalancing
120
How to measure performance with pyfolio
122
The Sharpe ratio
122
The fundamental law of active management
123
In and outofsample performance with pyfolio
124
Getting pyfolio input from alphalens
125
Getting pyfolio input from a zipline backtest
125
Walkforward testing outofsample returns
126
Summary performance statistics
127
Drawdown periods and factor exposure
128
Modeling event risk
129
How to avoid the pitfalls of backtesting
129
Data challenges
130
Lookahead bias
130
Survivorship bias
130
Outlier control
131
Unrepresentative period
131
Implementation issues
131
Marktomarket performance
131
Trading costs
132
Timing of trades
132
Datasnooping and backtestoverfitting
132
The minimum backtest length and the deflated SR
133
Optimal stopping for backtests
133
How to manage portfolio risk and return
134
Meanvariance optimization
135
How it works
136
The efficient frontier in Python
136
Challenges and shortcomings
139
Alternatives to meanvariance optimization
140
The 1/n portfolio
140
The minimumvariance portfolio
141
Global Portfolio Optimization – The BlackLitterman approach
141
How to size your bets – the Kelly rule
142
The optimal size of a bet
142
Optimal investment – single asset
143
[ iv ]
Table of Contents
Optimal investment – multiple assets
144
Risk parity
144
Risk factor investment
145
Hierarchical risk parity
145
Summary
146
Chapter 6: The Machine Learning Process
147
Learning from data
148
Supervised learning
150
Unsupervised learning
150
Applications
151
Cluster algorithms
151
Dimensionality reduction
152
Reinforcement learning
152
The machine learning workflow
153
Basic walkthrough – knearest neighbors
154
Frame the problem – goals and metrics
154
Prediction versus inference
155
Causal inference
155
Regression problems
156
Classification problems
158
Receiver operating characteristics and the area under the curve
159
Precisionrecall curves
159
Collecting and preparing the data
160
Explore, extract, and engineer features
161
Using information theory to evaluate features
161
Selecting an ML algorithm
162
Design and tune the model
162
The biasvariance tradeoff
163
Underfitting versus overfitting
163
Managing the tradeoff
164
Learning curves
165
How to use crossvalidation for model selection
166
How to implement crossvalidation in Python
167
Basic traintest split
167
Crossvalidation
168
Using a holdout test set
168
KFold iterator
169
Leaveoneout CV
169
LeavePOut CV
170
ShuffleSplit
170
Parameter tuning with scikitlearn
170
Validation curves with yellowbricks
171
Learning curves
171
Parameter tuning using GridSearchCV and pipeline
172
Challenges with crossvalidation in finance
172
Time series crossvalidation with sklearn
173
Purging, embargoing, and combinatorial CV
173
Summary
174
[ v ]
Table of Contents
Chapter 7: Linear Models
175
Linear regression for inference and prediction
176
The multiple linear regression model
177
How to formulate the model
177
How to train the model
178
Least squares
178
Maximum likelihood estimation
179
Gradient descent
180
The Gauss—Markov theorem
181
How to conduct statistical inference
182
How to diagnose and remedy problems
184
Goodness of fit
184
Heteroskedasticity
185
Serial correlation
186
Multicollinearity
187
How to run linear regression in practice
187
OLS with statsmodels
187
Stochastic gradient descent with sklearn
190
How to build a linear factor model
190
From the CAPM to the Fama—French fivefactor model
191
Obtaining the risk factors
193
Fama—Macbeth regression
194
Shrinkage methods: regularization for linear regression
198
How to hedge against overfitting
198
How ridge regression works
199
How lasso regression works
201
How to use linear regression to predict returns
201
Prepare the data
201
Universe creation and time horizon
202
Target return computation
202
Alpha factor selection and transformation
203
Data cleaning – missing data
203
Data exploration
204
Dummy encoding of categorical variables
204
Creating forward returns
205
Linear OLS regression using statsmodels
206
Diagnostic statistics
206
Linear OLS regression using sklearn
207
Custom time series crossvalidation
207
Select features and target
207
Crossvalidating the model
208
Test results – information coefficient and RMSE
209
Ridge regression using sklearn
210
Tuning the regularization parameters using crossvalidation
211
Crossvalidation results and ridge coefficient paths
212
Top 10 coefficients
212
Lasso regression using sklearn
213
[ vi ]
Table of Contents
Crossvalidated information coefficient and Lasso Path
214
Linear classification
215
The logistic regression model
215
Objective function
216
The logistic function
216
Maximum likelihood estimation
217
How to conduct inference with statsmodels
218
How to use logistic regression for prediction
220
How to predict price movements using sklearn
220
Summary
222
Chapter 8: Time Series Models
224
Analytical tools for diagnostics and feature extraction
225
How to decompose time series patterns
226
How to compute rolling window statistics
227
Moving averages and exponential smoothing
228
How to measure autocorrelation
229
How to diagnose and achieve stationarity
229
Time series transformations
230
How to diagnose and address unit roots
231
Unit root tests
233
How to apply time series transformations
234
Univariate time series models
236
How to build autoregressive models
237
How to identify the number of lags
237
How to diagnose model fit
238
How to build moving average models
238
How to identify the number of lags
239
The relationship between AR and MA models
239
How to build ARIMA models and extensions
239
How to identify the number of AR and MA terms
240
Adding features – ARMAX
240
Adding seasonal differencing – SARIMAX
241
How to forecast macro fundamentals
241
How to use time series models to forecast volatility
243
The autoregressive conditional heteroskedasticity (ARCH) model
244
Generalizing ARCH – the GARCH model
245
Selecting the lag order
245
How to build a volatilityforecasting model
246
Multivariate time series models
250
Systems of equations
250
The vector autoregressive (VAR) model
251
How to use the VAR model for macro fundamentals forecasts
252
Cointegration – time series with a common trend
256
Testing for cointegration
257
How to use cointegration for a pairstrading strategy
258
Summary
259
[ vii ]
Table of Contents
Chapter 9: Bayesian Machine Learning
260
How Bayesian machine learning works
261
How to update assumptions from empirical evidence
262
Exact inference: Maximum a Posteriori estimation
263
How to select priors
264
How to keep inference simple – conjugate priors
265
How to dynamically estimate the probabilities of asset price moves
265
Approximate inference: stochastic versus deterministic approaches
267
Samplingbased stochastic inference
268
Markov chain Monte Carlo sampling
268
Gibbs sampling
269
MetropolisHastings sampling
270
Hamiltonian Monte Carlo – going NUTS
270
Variational Inference
270
Automatic Differentiation Variational Inference (ADVI)
271
Probabilistic programming with PyMC3
271
Bayesian machine learning with Theano
272
The PyMC3 workflow
272
Model definition – Bayesian logistic regression
273
Visualization and plate notation
274
The Generalized Linear Models module
275
MAP inference
275
Approximate inference – MCMC
275
Credible intervals
276
Approximate inference – variational Bayes
276
Model diagnostics
277
Convergence
277
Posterior Predictive Checks
279
Prediction
279
Practical applications
280
Bayesian Sharpe ratio and performance comparison
280
Model definition
281
Performance comparison
281
Bayesian time series models
282
Stochastic volatility models
283
Summary
283
Chapter 10: Decision Trees and Random Forests
284
Decision trees
285
How trees learn and apply decision rules
285
How to use decision trees in practice
287
How to prepare the data
287
How to code a custom crossvalidation class
288
How to build a regression tree
288
How to build a classification tree
291
How to optimize for node purity
291
How to train a classification tree
292
How to visualize a decision tree
292
How to evaluate decision tree predictions
293
Feature importance
294
[ viii ]
Table of Contents
Overfitting and regularization
294
How to regularize a decision tree
295
Decision tree pruning
296
How to tune the hyperparameters
297
GridsearchCV for decision trees
297
How to inspect the tree structure
298
Learning curves
299
Strengths and weaknesses of decision trees
300
Random forests
301
Ensemble models
302
How bagging lowers model variance
303
Bagged decision trees
304
How to build a random forest
306
How to train and tune a random forest
307
Feature importance for random forests
310
Outofbag testing
311
Pros and cons of random forests
311
Summary
312
Chapter 11: Gradient Boosting Machines
313
Adaptive boosting
314
The AdaBoost algorithm
315
AdaBoost with sklearn
317
Gradient boosting machines
319
How to train and tune GBM models
321
Ensemble size and early stopping
321
Shrinkage and learning rate
322
Subsampling and stochastic gradient boosting
322
How to use gradient boosting with sklearn
323
How to tune parameters with GridSearchCV
324
Parameter impact on test scores
325
How to test on the holdout set
327
Fast scalable GBM implementations
327
How algorithmic innovations drive performance
328
Secondorder loss function approximation
328
Simplified splitfinding algorithms
330
Depthwise versus leafwise growth
330
GPUbased training
331
DART – dropout for trees
331
Treatment of categorical features
332
Additional features and optimizations
333
How to use XGBoost, LightGBM, and CatBoost
333
How to create binary data formats
333
How to tune hyperparameters
335
Objectives and loss functions
335
Learning parameters
335
Regularization
336
Randomized grid search
336
[ ix ]
Table of Contents
How to evaluate the results
338
Crossvalidation results across models
338
How to interpret GBM results
342
Feature importance
342
Partial dependence plots
343
SHapley Additive exPlanations
345
How to summarize SHAP values by feature
346
How to use force plots to explain a prediction
347
How to analyze feature interaction
349
Summary
350
Chapter 12: Unsupervised Learning
351
Dimensionality reduction
352
Linear and nonlinear algorithms
354
The curse of dimensionality
355
Linear dimensionality reduction
357
Principal Component Analysis
358
Visualizing PCA in 2D
358
The assumptions made by PCA
359
How the PCA algorithm works
360
PCA based on the covariance matrix
360
PCA using Singular Value Decomposition
362
PCA with sklearn
363
Independent Component Analysis
365
ICA assumptions
365
The ICA algorithm
366
ICA with sklearn
366
PCA for algorithmic trading
366
Datadriven risk factors
366
Eigen portfolios
369
Manifold learning
372
tSNE
374
UMAP
375
Clustering
376
kMeans clustering
377
Evaluating cluster quality
379
Hierarchical clustering
381
Visualization – dendrograms
382
Densitybased clustering
383
DBSCAN
383
Hierarchical DBSCAN
384
Gaussian mixture models
384
The expectationmaximization algorithm
385
Hierarchical risk parity
386
Summary
388
Chapter 13: Working with Text Data
389
How to extract features from text data
390
Challenges of NLP
390
The NLP workflow
391
[ x ]
Table of Contents
Parsing and tokenizing text data
392
Linguistic annotation
392
Semantic annotation
393
Labeling
393
Use cases
393
From text to tokens – the NLP pipeline
394
NLP pipeline with spaCy and textacy
394
Parsing, tokenizing, and annotating a sentence
395
Batchprocessing documents
396
Sentence boundary detection
397
Named entity recognition
397
Ngrams
398
spaCy’s streaming API
398
Multilanguage NLP
398
NLP with TextBlob
400
Stemming
400
Sentiment polarity and subjectivity
401
From tokens to numbers – the documentterm matrix
401
The BoW model
401
Measuring the similarity of documents
402
Documentterm matrix with sklearn
403
Using CountVectorizer
404
Visualizing vocabulary distribution
404
Finding the most similar documents
405
TfidFTransformer and TfidFVectorizer
406
The effect of smoothing
407
How to summarize news articles using TfidFVectorizer
408
Text Preprocessing – review
408
Text classification and sentiment analysis
408
The Naive Bayes classifier
409
Bayes’ theorem refresher
409
The conditional independence assumption
410
News article classification
411
Training and evaluating multinomial Naive Bayes classifier
411
Sentiment analysis
412
Twitter data
412
Multinomial Naive Bayes
412
Comparison with TextBlob sentiment scores
413
Business reviews – the Yelp dataset challenge
413
Benchmark accuracy
414
Multinomial Naive Bayes model
414
Oneversusall logistic regression
415
Combining text and numerical features
415
Multinomial logistic regression
416
Gradientboosting machine
416
Summary
417
Chapter 14: Topic Modeling
418
Learning latent topics: goals and approaches
419
From linear algebra to hierarchical probabilistic models
420
[ xi ]
Table of Contents
Latent semantic indexing
420
How to implement LSI using sklearn
422
Pros and cons
424
Probabilistic latent semantic analysis
424
How to implement pLSA using sklearn
425
Latent Dirichlet allocation
427
How LDA works
427
The Dirichlet distribution
428
The generative model
428
Reverseengineering the process
429
How to evaluate LDA topics
430
Perplexity
430
Topic coherence
430
How to implement LDA using sklearn
431
How to visualize LDA results using pyLDAvis
432
How to implement LDA using gensim
433
Topic modeling for earnings calls
436
Data preprocessing
437
Model training and evaluation
437
Running experiments
438
Topic modeling for Yelp business reviews
439
Summary
440
Chapter 15: Word Embeddings
441
How word embeddings encode semantics
442
How neural language models learn usage in context
442
The Word2vec model – learn embeddings at scale
443
Model objective – simplifying the softmax
444
Automatic phrase detection
445
How to evaluate embeddings – vector arithmetic and analogies
445
How to use pretrained word vectors
447
GloVe – global vectors for word representation
448
How to train your own word vector embeddings
449
The SkipGram architecture in Keras
449
Noisecontrastive estimation
449
The model components
449
Visualizing embeddings using TensorBoard
450
Word vectors from SEC filings using gensim
450
Preprocessing
450
Automatic phrase detection
451
Model training
451
Model evaluation
452
Performance impact of parameter settings
452
Sentiment analysis with Doc2vec
453
Training Doc2vec on yelp sentiment data
454
Create input data
454
Bonus – Word2vec for translation
457
[ xii ]
Table of Contents
Summary
457
Chapter 16: Next Steps
458
Key takeaways and lessons learned
459
Data is the single most important ingredient
459
Quality control
459
Data integration
460
Domain expertise helps unlock value in data
460
Feature engineering and alpha factor research
461
ML is a toolkit for solving problems with data
461
Model diagnostics help speed up optimization
462
Making do without a free lunch
462
Managing the biasvariance tradeoff
463
Define targeted model objectives
463
The optimization verification test
464
Beware of backtest overfitting
464
How to gain insights from blackbox models
464
ML for trading in practice
465
Data management technologies
465
Database systems
466
Big Data technologies – Hadoop and Spark
466
ML tools
467
Online trading platforms
468
Quantopian
468
QuantConnect
469
QuantRocket
469
Conclusion
469
Other Books You May Enjoy
470
Index
473
[ xiii ]
Preface
The availability of diverse data has increased the demand for expertise in algorithmic
trading strategies. With this book, you will select and apply machine learning (ML) to a
broad range of data sources and create powerful algorithmic strategies.
This book will start by introducing you to essential elements, such as evaluating datasets,
accessing data APIs using Python, using Quandl to access financial data, and managing
prediction errors. We then cover various machine learning techniques and algorithms that
can be used to build and train algorithmic models using pandas, Seaborn, StatsModels, and
sklearn. We will then build, estimate, and interpret AR(p), MA(q), and ARIMA (p, d, q)
models using StatsModels. You will apply Bayesian concepts of prior, evidence, and
posterior, in order to distinguish the concept of uncertainty using PyMC3. We will then
utilize NLTK, sklearn, and spaCy to assign sentiment scores to financial news and classify
documents to extract trading signals. We will learn to design, build, tune, and evaluate feed
forward neural networks, recurrent neural networks (RNNs), and convolutional neural
networks (CNNs), using Keras to design sophisticated algorithms. You will apply transfer
learning to satellite image data to predict economic activity. Finally, we will apply
reinforcement learning for optimal trading results.
By the end of the book, you will be able to adopt algorithmic trading to implement smart
investing strategies.
Who this book is for
The book is for data analysts, data scientists, and Python developers, as well as investment
analysts and portfolio managers working within the finance and investment industry. If
you want to perform efficient algorithmic trading by developing smart investigating
strategies using ML algorithms, this is the book you need! Some understanding of Python
and ML techniques is mandatory.
What this book covers
Chapter 1, Machine Learning for Trading, identifies the focus of the book by outlining how
ML matters in generating and evaluating signals for the design and execution of a trading
strategy. It outlines the strategy process from hypothesis generation and modeling, data
selection, and backtesting to evaluation and execution in a portfolio context, including risk
management.
Preface
Chapter 2, Market and Fundamental Data, covers sources and working with original
exchangeprovided tick and financial reporting data, as well as how to access numerous
opensource data providers that we will rely on throughout this book.
Chapter 3, Alternative Data for Finance, provides categories and criteria to assess the
exploding number of sources and providers. It also demonstrates how to create alternative
Reviews
Average Rating
Detailed Rating
Stars 5 

0 
Stars 4 

0 
Stars 3 

0 
Stars 2 

0 
Stars 1 

0 
Be the first to review “HandsOn Machine Learning for Algorithmic Trading” Cancel reply
€30.00
There are no reviews yet.