| # CodeMode Core Scripts ๐ | |
| This directory contains the **modular core logic** for the CodeMode pipeline. It is designed to be cleaner and more production-ready than the experimental notebooks. | |
| ## Structure | |
| ### 1. Ingestion (`scripts/core/ingestion`) | |
| Handles data collection and processing. | |
| - `ingest.py`: The Git Crawler (formerly `git_crawler.py`). | |
| - `chunk.py`: The Universal Chunker (formerly `repo_chunker.py`). | |
| - `generate_data.py`: Creates training triplets (formerly `pairs_triplets_generator.py`). | |
| **Usage:** | |
| ```bash | |
| # Example: Ingest a repo | |
| python -m scripts.core.ingestion.ingest --url https://github.com/crewAIInc/crewAI | |
| # Example: Generate Triplets | |
| python -m scripts.core.ingestion.generate_data --chunks data/processed/chunks.jsonl --output data/training | |
| ``` | |
| ### 2. Training (`scripts/core/training`) | |
| Handles model training and embedding generation. | |
| - `train.py`: Main training loop. | |
| - `model.py`: The CodeEmbedder model architecture. | |
| - `trainer.py`: The training loop logic. | |
| **Usage:** | |
| ```bash | |
| # Example: Train the model | |
| python -m scripts.core.training.train --data_path data/training/triplets.jsonl --epochs 3 | |
| ``` | |
| ## Why this structure? | |
| - **Separation of Concerns:** Training logic doesn't depend on web scraping libraries. | |
| - **Reusability:** You can import `CodeEmbedder` or `RepoChunker` in other projects easily. | |
| - **Production Ready:** Direct python scripts instead of notebooks. | |