Clean, modular rewrite of GLADtoTEXT with modern C++ practices, comprehensive testing, and CMake build system.
- ✅ Modern C++17: Clean interfaces, RAII, proper const-correctness
- ✅ CMake Build System: Cross-platform, IDE-friendly
- ✅ Comprehensive Tests: GoogleTest framework with 18+ unit tests
- ✅ Memory Efficient: Aligned allocations for SIMD, proper resource management
- ✅ Deterministic: Seeded RNG for reproducible experiments
- ✅ Modular Design: Clear separation of concerns
- ModelConfig: Centralized configuration (no feature flags)
- EmbeddingTable: Hash-based embedding storage with aligned memory
- WordEncoder: N-gram + phonetic encoding
- NGramGenerator: Character n-gram extraction
- PhoneticEncoder: Soundex-like phonetic encoding
- HashFunction: FNV-1a and MurmurHash3 implementations
- RNG: Deterministic random number generation (MT19937-64)
- Logger: Thread-safe logging with levels
- AlignedAlloc: SIMD-friendly memory allocation
- EnglishTokenizer: Simple whitespace + punctuation tokenizer
# Run tests
./build_and_test.sh
# Or manually
mkdir build && cd build
cmake ..
cmake --build .
ctest --output-on-failure
# Run example
./build/example_usage# All tests
./build/gladtotext_tests
# Specific test
./build/gladtotext_tests --gtest_filter=EmbeddingTableTest.*
# List tests
./build/gladtotext_tests --gtest_list_tests- ✅ ModelConfig: defaults, equality, validation
- ✅ RNG: deterministic generation
- ✅ EmbeddingTable: construction, access, determinism
- ✅ NGramGenerator: generation, correctness
- ✅ HashFunction: FNV-1a, MurmurHash3
- ✅ Tokenizer: basic, punctuation, edge cases
- ✅ LinearClassifier: forward, backward, determinism
- ✅ Softmax & CrossEntropy: numerical stability, correctness
- ✅ WordEncoder: encoding, determinism, phonetic contribution
- ✅ MeanSentenceEncoder: averaging, empty handling, determinism
- ✅ PhoneticEncoder: soundex, case handling, edge cases
- ✅ Training: overfitting, determinism, convergence
- ✅ Edge Cases: long inputs, special chars, unicode, extreme values
- ✅ Integration: end-to-end pipeline, training reduces loss
Total: 79 tests across 15 test suites
#include "config/model_config.h"
#include "embedding/embedding_table.h"
#include "word_ecoder/word_encoder.h"
// Create config
ModelConfig config;
// Create embeddings
EmbeddingTable embeddings(config.bucket_count,
config.embedding_dim,
config.seed);
// Create encoder
NGramGenerator ngram(config.ngram_min, config.ngram_max);
PhoneticEncoder phonetic;
WordEncoder encoder(embeddings, ngram, &phonetic,
config.bucket_count, config.phonetic_gamma);
// Encode word
std::vector<float> embedding(config.embedding_dim);
encoder.encode("hello", embedding.data());core/
├── config/ # Configuration
├── embedding/ # Embedding storage
├── word_ecoder/ # Word encoding logic
├── ngram/ # N-gram generation
├── phonetic/ # Phonetic encoding
├── hashing/ # Hash functions
├── tokenizer/ # Text tokenization
└── utils/ # Utilities (RNG, logger, memory)
tests/ # Unit tests
external/ # GoogleTest (auto-downloaded)
- Simpler Config: 14 parameters vs 20+ boolean flags
- Better Testing: Unit tests vs shell scripts only
- Modern Build: CMake vs basic Makefile
- Cleaner Code: Separation of concerns, RAII
- Performance: Aligned memory, scratch buffers
- Reproducibility: Seeded RNG throughout
Example with default config:
- Embedding table: ~195 MB (200k buckets × 256 dim)
- Per-word encoding: ~2 KB scratch space
- Total: Configurable via
bucket_countandembedding_dim
ModelConfig config;
config.embedding_dim = 128; // Smaller embeddings
config.bucket_count = 100000; // Fewer buckets
config.ngram_min = 2; // Shorter n-grams
config.ngram_max = 5;
config.phonetic_gamma = 0.1f; // Less phonetic weight
config.seed = 42; // Reproducibility- Add attention mechanism
- Add training loop
- Add classification head
- Add model serialization
- Add Python bindings
- Add benchmarks
| Feature | Main | Revisit |
|---|---|---|
| Build System | Makefile | CMake |
| Testing | Shell scripts | GoogleTest |
| Config | 20+ flags | 14 parameters |
| Code Style | Header-only | Header + impl |
| Memory | Manual | RAII + aligned |
| Reproducibility | Limited | Full (seeded) |