This is an unofficial PyTorch implementation by Ignacio Oguiza of - oguiza@gmail.com based on:
- Zerveas, G., Jayaraman, S., Patel, D., Bhamidipaty, A., & Eickhoff, C. (2020). A Transformer-based Framework for Multivariate Time Series Representation Learning. arXiv preprint arXiv:2010.02803v2.
- No official implementation available as far as I know (Oct 10th, 2020)
This paper uses 'Attention is all you need' as a major reference:
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008).
This implementation is adapted to work with the rest of the tsai
library, and contain some hyperparameters that are not available in the original implementation. I included them to experiment with them.
Usual values are the ones that appear in the "Attention is all you need" and "A Transformer-based Framework for Multivariate Time Series Representation Learning" papers.
The default values are the ones selected as a default configuration in the latter.
- c_in: the number of features (aka variables, dimensions, channels) in the time series dataset. dls.var
- c_out: the number of target classes. dls.c
- seq_len: number of time steps in the time series. dls.len
- max_seq_len: useful to control the temporal resolution in long time series to avoid memory issues. Default. None.
- d_model: total dimension of the model (number of features created by the model). Usual values: 128-1024. Default: 128.
- n_heads: parallel attention heads. Usual values: 8-16. Default: 16.
- d_k: size of the learned linear projection of queries and keys in the MHA. Usual values: 16-512. Default: None -> (d_model/n_heads) = 32.
- d_v: size of the learned linear projection of values in the MHA. Usual values: 16-512. Default: None -> (d_model/n_heads) = 32.
- d_ff: the dimension of the feedforward network model. Usual values: 256-4096. Default: 256.
- res_dropout: amount of residual dropout applied in the encoder. Usual values: 0.-0.3. Default: 0.1.
- activation: the activation function of intermediate layer, relu or gelu. Default: 'gelu'.
- num_layers: the number of sub-encoder-layers in the encoder. Usual values: 2-8. Default: 3.
- fc_dropout: dropout applied to the final fully connected layer. Usual values: 0.-0.8. Default: 0.
- y_range: range of possible y values (used in regression tasks). Default: None
- kwargs: nn.Conv1d kwargs. If not {}, a nn.Conv1d with those kwargs will be applied to original time series.
In general, transformers require a lower lr compared to other time series models when used with the same datasets. It's important to use
learn.lr_find()
to learn what a good lr may be.The paper authors recommend to standardize data by feature. This can be done by adding
TSStandardize(by_var=True
as a batch_tfm when creating theTSDataLoaders
.The authors used LabelSmoothingCrossEntropyFlat() as the loss function.
When using TST with a long time series, you may use
max_w_len
to reduce the memory size and thus avoid gpu issues.`In some of the cases I've used it, you may need to increase the res_dropout > .1 and/ or fc_dropout > 0 in order to achieve a good performance.
t = torch.rand(16, 50, 128)
output, attn = _MultiHeadAttention(d_model=128, n_heads=3, d_k=8, d_v=6)(t, t, t)
output.shape, attn.shape
t = torch.rand(16, 50, 128)
output = _TSTEncoderLayer(q_len=50, d_model=128, n_heads=3, d_k=None, d_v=None, d_ff=512, res_dropout=0.1, activation='gelu')(t)
output.shape
bs = 32
c_in = 9 # aka channels, features, variables, dimensions
c_out = 2
seq_len = 5000
xb = torch.randn(bs, c_in, seq_len)
# standardize by channel by_var based on the training set
xb = (xb - xb.mean((0, 2), keepdim=True)) / xb.std((0, 2), keepdim=True)
# Settings
max_seq_len = 256
d_model = 128
n_heads = 16
d_k = d_v = None # if None --> d_model // n_heads
d_ff = 256
res_dropout = 0.1
activation = "gelu"
n_layers = 3
fc_dropout = 0.1
kwargs = {}
model = TST(c_in, c_out, seq_len, max_seq_len=max_seq_len, d_model=d_model, n_heads=n_heads,
d_k=d_k, d_v=d_v, d_ff=d_ff, res_dropout=res_dropout, activation=activation, n_layers=n_layers,
fc_dropout=fc_dropout, **kwargs)
test_eq(model(xb).shape, [bs, c_out])
print(f'model parameters: {count_parameters(model)}')
bs = 32
c_in = 9 # aka channels, features, variables, dimensions
c_out = 2
seq_len = 60
xb = torch.randn(bs, c_in, seq_len)
# standardize by channel by_var based on the training set
xb = (xb - xb.mean((0, 2), keepdim=True)) / xb.std((0, 2), keepdim=True)
# Settings
max_seq_len = 120
d_model = 128
n_heads = 16
d_k = d_v = None # if None --> d_model // n_heads
d_ff = 256
res_dropout = 0.1
act = "gelu"
n_layers = 3
fc_dropout = 0.1
kwargs = {}
# kwargs = dict(kernel_size=5, padding=2)
model = TST(c_in, c_out, seq_len, max_seq_len=max_seq_len, d_model=d_model, n_heads=n_heads,
d_k=d_k, d_v=d_v, d_ff=d_ff, res_dropout=res_dropout, act=act, n_layers=n_layers,
fc_dropout=fc_dropout, **kwargs)
test_eq(model(xb).shape, [bs, c_out])
print(f'model parameters: {count_parameters(model)}')