models.GPT

def FLOP( input_len, vocab_size, embed_dim, num_heads, num_decoders, fc_dim_factor):

Calculate total number of FLOPs, see Chinchilla paper Appendix F as reference: https://arxiv.org/pdf/2203.15556.pdf

Copied from: https://github.com/karpathy/nanoGPT/blob/master/scaling_laws.ipynb

Reference: - Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022). Section: Appendix. (F).

class GPT(keras.src.layers.layer.Layer):

GPT (Generative Pre-trained Transformer) layer.

This layer implements the architecture of the GPT model, which consists of multiple decoder layers followed by a linear mapping head for language modeling.

Parameters:

  • decoder (class): Class representing the decoder layer of the Transformer model.
  • embeddings (class): Class representing the token embeddings.
  • pos_embeddings (class): Class representing the positional embeddings.
  • embedding_size (int): Size of the token embeddings. Default is 1280.
  • vocab_size (int): Size of the vocabulary. Default is 8008.
  • input_len (int): Length of the input sequence. Default is 64.
  • num_decoders (int): Number of decoder layers in the GPT model. Default is 5.
  • dropout_rate (float): Dropout rate. Default is 0.1.
  • num_heads (int): Number of attention heads. Default is 32.
  • head_dims (int): Dimensionality of each attention head. Default is 40.
  • fc_dim_factor (int): Factor controlling the dimensionality of the fully connected layers. Default is 5.

Attributes:

  • num_decoders (int): Number of decoder layers in the GPT model.
  • decoders (list): List of decoder layer instances.
  • embeddings (keras.layers.Layer): Token embeddings layer instance.
  • lm_head (keras.layers.Dense): Dense layer for language modeling.
  • _config (dict): Configuration dictionary storing the parameters used to initialize the GPT layer.

References: - Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018). - Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

Example:

>>> import keras
    >>> from Corpus2GPT.models import *
 
>>> in = keras.Input(shape=(64,)) >>> out = GPT(decoder, TokenAndPositionEmbedding)(in)
>>> GPT = keras.Model(inputs=in, outputs=out) >>> GPT.summary()

GPT( decoder, embeddings, pos_embeddings=None, embedding_size=1280, vocab_size=8008, input_len=64, num_decoders=5, dropout_rate=0.1, num_heads=32, head_dims=40, fc_dim_factor=5)

Initializes the GPT layer.

Args:

  • decoder (class): Class representing the decoder layer of the Transformer model.
  • embeddings (class): Class representing the token embeddings.
  • pos_embeddings (class): Class representing the positional embeddings.
  • embedding_size (int): Size of the token embeddings. Default is 1280.
  • vocab_size (int): Size of the vocabulary. Default is 8008.
  • input_len (int): Length of the input sequence. Default is 64.
  • num_decoders (int): Number of decoder layers in the GPT model. Default is 5.
  • dropout_rate (float): Dropout rate. Default is 0.1.
  • num_heads (int): Number of attention heads. Default is 32.
  • head_dims (int): Dimensionality of each attention head. Default is 40.
  • fc_dim_factor (int): Factor controlling the dimensionality of the fully connected layers. Default is 5.
def call(self, inputs):

Executes the forward pass of the GPT layer.

Args:

  • inputs: Input tensor representing the token indices.

Returns:

  • Tensor: Output tensor representing the logits for language modeling.
Inherited Members
keras.src.layers.layer.Layer
get_build_config
build_from_config
add_variable
add_weight
trainable
variables
trainable_variables
non_trainable_variables
weights
trainable_weights
non_trainable_weights
metrics
metrics_variables
get_weights
set_weights
dtype
compute_dtype
variable_dtype
input_dtype
supports_masking
stateless_call
add_loss
losses
save_own_variables
load_own_variables
count_params
get_config
keras.src.ops.operation.Operation
from_config
input
output
def build_GPT( input_len, vocab_size, embed_dim, num_decoders, dropout_rate, num_heads, head_dims, fc_dim_factor, optimizer='adam'):

Builds a GPT (Generative Pre-trained Transformer) model.

Parameters:

  • input_len (int): The length of the input sequence.
  • vocab_size (int): The size of the vocabulary.
  • embed_dim (int): The dimensionality of the token embeddings.
  • num_decoders (int): The number of decoder layers.
  • dropout_rate (float): The dropout rate to apply within the model.
  • num_heads (int): The number of attention heads in each decoder layer.
  • head_dims (int): The dimensionality of each attention head.
  • fc_dim_factor (int): The factor to determine the dimensionality of the feedforward network within each decoder layer.
  • optimizer (str, optional): The optimizer to use for training. Defaults to 'adam'.

Returns:

  • tuple: A tuple containing the GPT model and the total number of floating-point operations (FLOPs).

References:

  • Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
  • Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

Example:

>>> GPT, flops = build_GPT(256, vocab_size, 1000, 2, 0, 50, 20, 5)
 
>>> # The total computational Floating-Point operations required for a single forward inference >>> print(flops) >>> GPT.summary()