CodeMode / scripts /core /README.md
CodeMode Agent
Deploy CodeMode via Agent
463fc7e
# CodeMode Core Scripts ๐Ÿš€
This directory contains the **modular core logic** for the CodeMode pipeline. It is designed to be cleaner and more production-ready than the experimental notebooks.
## Structure
### 1. Ingestion (`scripts/core/ingestion`)
Handles data collection and processing.
- `ingest.py`: The Git Crawler (formerly `git_crawler.py`).
- `chunk.py`: The Universal Chunker (formerly `repo_chunker.py`).
- `generate_data.py`: Creates training triplets (formerly `pairs_triplets_generator.py`).
**Usage:**
```bash
# Example: Ingest a repo
python -m scripts.core.ingestion.ingest --url https://github.com/crewAIInc/crewAI
# Example: Generate Triplets
python -m scripts.core.ingestion.generate_data --chunks data/processed/chunks.jsonl --output data/training
```
### 2. Training (`scripts/core/training`)
Handles model training and embedding generation.
- `train.py`: Main training loop.
- `model.py`: The CodeEmbedder model architecture.
- `trainer.py`: The training loop logic.
**Usage:**
```bash
# Example: Train the model
python -m scripts.core.training.train --data_path data/training/triplets.jsonl --epochs 3
```
## Why this structure?
- **Separation of Concerns:** Training logic doesn't depend on web scraping libraries.
- **Reusability:** You can import `CodeEmbedder` or `RepoChunker` in other projects easily.
- **Production Ready:** Direct python scripts instead of notebooks.