Spaces:

icyayaka
/

CodeMode

Running

CodeMode Agent

Deploy CodeMode via Agent

463fc7e 22 days ago

1.41 kB

	# CodeMode Core Scripts 🚀

	This directory contains the modular core logic for the CodeMode pipeline. It is designed to be cleaner and more production-ready than the experimental notebooks.

	## Structure

	### 1. Ingestion (`scripts/core/ingestion`)
	Handles data collection and processing.
	- `ingest.py`: The Git Crawler (formerly `git_crawler.py`).
	- `chunk.py`: The Universal Chunker (formerly `repo_chunker.py`).
	- `generate_data.py`: Creates training triplets (formerly `pairs_triplets_generator.py`).

	Usage:
	```bash
	# Example: Ingest a repo
	python -m scripts.core.ingestion.ingest --url https://github.com/crewAIInc/crewAI

	# Example: Generate Triplets
	python -m scripts.core.ingestion.generate_data --chunks data/processed/chunks.jsonl --output data/training
	```

	### 2. Training (`scripts/core/training`)
	Handles model training and embedding generation.
	- `train.py`: Main training loop.
	- `model.py`: The CodeEmbedder model architecture.
	- `trainer.py`: The training loop logic.

	Usage:
	```bash
	# Example: Train the model
	python -m scripts.core.training.train --data_path data/training/triplets.jsonl --epochs 3
	```

	## Why this structure?
	- Separation of Concerns: Training logic doesn't depend on web scraping libraries.
	- Reusability: You can import `CodeEmbedder` or `RepoChunker` in other projects easily.
	- Production Ready: Direct python scripts instead of notebooks.