models.GPT
Calculate total number of FLOPs, see Chinchilla paper Appendix F as reference: https://arxiv.org/pdf/2203.15556.pdf
Copied from: https://github.com/karpathy/nanoGPT/blob/master/scaling_laws.ipynb
Reference: - Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022). Section: Appendix. (F).
GPT (Generative Pre-trained Transformer) layer.
This layer implements the architecture of the GPT model, which consists of multiple decoder layers followed by a linear mapping head for language modeling.
Parameters:
- decoder (class): Class representing the decoder layer of the Transformer model.
- embeddings (class): Class representing the token embeddings.
- pos_embeddings (class): Class representing the positional embeddings.
- embedding_size (int): Size of the token embeddings. Default is 1280.
- vocab_size (int): Size of the vocabulary. Default is 8008.
- input_len (int): Length of the input sequence. Default is 64.
- num_decoders (int): Number of decoder layers in the GPT model. Default is 5.
- dropout_rate (float): Dropout rate. Default is 0.1.
- num_heads (int): Number of attention heads. Default is 32.
- head_dims (int): Dimensionality of each attention head. Default is 40.
- fc_dim_factor (int): Factor controlling the dimensionality of the fully connected layers. Default is 5.
Attributes:
- num_decoders (int): Number of decoder layers in the GPT model.
- decoders (list): List of decoder layer instances.
- embeddings (keras.layers.Layer): Token embeddings layer instance.
- lm_head (keras.layers.Dense): Dense layer for language modeling.
- _config (dict): Configuration dictionary storing the parameters used to initialize the GPT layer.
References: - Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018). - Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
Example:
>>> import keras
>>> from Corpus2GPT.models import *
>>> in = keras.Input(shape=(64,))
>>> out = GPT(decoder, TokenAndPositionEmbedding)(in)
>>> GPT = keras.Model(inputs=in, outputs=out)
>>> GPT.summary()
Initializes the GPT layer.
Args:
- decoder (class): Class representing the decoder layer of the Transformer model.
- embeddings (class): Class representing the token embeddings.
- pos_embeddings (class): Class representing the positional embeddings.
- embedding_size (int): Size of the token embeddings. Default is 1280.
- vocab_size (int): Size of the vocabulary. Default is 8008.
- input_len (int): Length of the input sequence. Default is 64.
- num_decoders (int): Number of decoder layers in the GPT model. Default is 5.
- dropout_rate (float): Dropout rate. Default is 0.1.
- num_heads (int): Number of attention heads. Default is 32.
- head_dims (int): Dimensionality of each attention head. Default is 40.
- fc_dim_factor (int): Factor controlling the dimensionality of the fully connected layers. Default is 5.
Executes the forward pass of the GPT layer.
Args:
- inputs: Input tensor representing the token indices.
Returns:
- Tensor: Output tensor representing the logits for language modeling.
Inherited Members
- keras.src.layers.layer.Layer
- get_build_config
- build_from_config
- add_variable
- add_weight
- trainable
- variables
- trainable_variables
- non_trainable_variables
- weights
- trainable_weights
- non_trainable_weights
- metrics
- metrics_variables
- get_weights
- set_weights
- dtype
- compute_dtype
- variable_dtype
- input_dtype
- supports_masking
- stateless_call
- add_loss
- losses
- save_own_variables
- load_own_variables
- count_params
- get_config
- keras.src.ops.operation.Operation
- from_config
- input
- output
Builds a GPT (Generative Pre-trained Transformer) model.
Parameters:
- input_len (int): The length of the input sequence.
- vocab_size (int): The size of the vocabulary.
- embed_dim (int): The dimensionality of the token embeddings.
- num_decoders (int): The number of decoder layers.
- dropout_rate (float): The dropout rate to apply within the model.
- num_heads (int): The number of attention heads in each decoder layer.
- head_dims (int): The dimensionality of each attention head.
- fc_dim_factor (int): The factor to determine the dimensionality of the feedforward network within each decoder layer.
- optimizer (str, optional): The optimizer to use for training. Defaults to 'adam'.
Returns:
- tuple: A tuple containing the GPT model and the total number of floating-point operations (FLOPs).
References:
- Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
- Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
Example:
>>> GPT, flops = build_GPT(256, vocab_size, 1000, 2, 0, 50, 20, 5)
>>> # The total computational Floating-Point operations required for a single forward inference
>>> print(flops)
>>> GPT.summary()