Corpus2GPT

What is Corpus2GPT?

Corpus2GPT revolutionizes language model research with its unique approach, offering a user-friendly platform that prioritizes accessibility and ease of use. Unlike other existing tools with complex and cumbersome codebases, Corpus2GPT stands out for its modular design, making it effortless to navigate, modify, and understand. With comprehensive documentation and support for various language corpora, backends, and scaling solutions, it caters to a diverse range of users, from seasoned researchers to industry professionals and enthusiasts alike. Experience the future of language model exploration with Corpus2GPT—where simplicity meets innovation. Corpus2GPT is a pioneering project that empowers users to train their own GPT models on diverse datasets, including local languages and various corpus types. Compatible with Keras, it seamlessly supports TensorFlow, PyTorch, or JAX backends, offering unparalleled flexibility and benchmarking capabilities.

Key Features

Classical Multihead Attention
Decoder Module
Random Sampling Search Strategies
Multiple Language Support
Sentence Piece Tokenizer and Vectorizer
GPT Builder

Upcoming Features

MoE Support
Distributed Model Loading
RAG Models
Transformer Debugger
Fine-Tuning Interface
Hyperparameter Optimization
Tree of Thoughts (Problem-Solving)
Model Distillation

Get Involved

Corpus2GPT is an open-source project, and we welcome contributions from the community. Whether you're an experienced developer, a domain expert, or someone passionate about language modeling and NLP, there are many ways you can contribute.

Visit GitHub Repository

Documentation

Documentation Home