Spaces:

Tyycha
/

Ru2SQL

Running

App Files Files Community

Tyycha commited on 29 days ago

Commit

8871df9

0 Parent(s):

initial commit

Browse files

Files changed (46) hide show

.env.example +23 -0
.gitignore +70 -0
README.md +184 -0
adapters/qwen-coder-pauq-lora/.gitattributes +36 -0
adapters/qwen-coder-pauq-lora/README.md +199 -0
adapters/qwen-coder-pauq-lora/adapter_config.json +48 -0
adapters/qwen-coder-pauq-lora/chat_template.jinja +54 -0
adapters/qwen-coder-pauq-lora/tokenizer.json +3 -0
adapters/qwen-coder-pauq-lora/tokenizer_config.json +30 -0
configs/example_vocabulary.yaml +33 -0
data/demo/sales.sqlite +0 -0
data/demo/sales.sqlite-journal +0 -0
data/demo/test.db +0 -0
data/demo/test.db-journal +0 -0
data/pauq_repo +1 -0
notebooks/kaggle_train_qwen_qlora.ipynb +428 -0
plan_VKR_text2sql_ru.md +264 -0
pyproject.toml +74 -0
requirements.txt +14 -0
src/__init__.py +3 -0
src/api/__init__.py +0 -0
src/api/dependencies.py +42 -0
src/api/main.py +110 -0
src/api/schemas.py +36 -0
src/business/__init__.py +3 -0
src/business/vocabulary.py +173 -0
src/config.py +48 -0
src/data/__init__.py +0 -0
src/data/loader.py +52 -0
src/data/prompt.py +29 -0
src/data/schema.py +76 -0
src/db/__init__.py +4 -0
src/db/connector.py +238 -0
src/db/executor.py +152 -0
src/evaluation/__init__.py +0 -0
src/evaluation/evaluate.py +72 -0
src/evaluation/metrics.py +89 -0
src/models/__init__.py +0 -0
src/models/inference.py +94 -0
src/models/postprocess.py +50 -0
streamlit_app.py +375 -0
tests/__init__.py +0 -0
tests/test_metrics.py +56 -0
tests/test_postprocess.py +46 -0
tests/test_prompt.py +32 -0
tests/test_schema.py +44 -0

.env.example ADDED Viewed

	@@ -0,0 +1,23 @@

+# Скопируй в .env и заполни. .env в git не уходит.
+# API ключ для baseline-сравнения (выбери одного провайдера)
+GIGACHAT_API_KEY=
+OPENAI_API_KEY=
+YANDEXGPT_API_KEY=
+YANDEXGPT_FOLDER_ID=
+# HuggingFace (нужен для скачивания приватных адаптеров)
+HF_TOKEN=
+# Локальная модель
+BASE_MODEL_NAME=Qwen/Qwen2.5-Coder-3B-Instruct
+LORA_ADAPTER_PATH=./checkpoints/qwen-coder-pauq-lora
+DEVICE=cpu
+# Пути
+PAUQ_DATA_DIR=./data/pauq
+DATABASES_DIR=./data/databases
+# API
+API_HOST=127.0.0.1
+API_PORT=8000

.gitignore ADDED Viewed

	@@ -0,0 +1,70 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+*.egg-info/
+.eggs/
+dist/
+build/
+# Virtual environments
+.venv/
+venv/
+env/
+.python-version
+# uv
+.uv/
+uv.lock
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+# Jupyter
+.ipynb_checkpoints/
+*.ipynb_checkpoints
+# Environment variables
+.env
+.env.local
+.env.*.local
+# ML artifacts
+checkpoints/
+wandb/
+*.bin
+*.safetensors
+*.gguf
+# Data
+data/pauq/
+data/databases/
+data/processed/
+data/*.json
+data/*.sqlite
+data/*.db
+# Демо-база нужна в репозитории
+!data/demo/sales.sqlite
+# Logs
+*.log
+logs/
+# OS
+.DS_Store
+Thumbs.db
+desktop.ini
+# Test artifacts
+.pytest_cache/
+.coverage
+htmlcov/
+# Outputs
+outputs/
+results/

README.md ADDED Viewed

	@@ -0,0 +1,184 @@

+---
+title: Ru2SQL
+emoji: 🗄️
+colorFrom: blue
+colorTo: purple
+sdk: streamlit
+sdk_version: 1.35.0
+app_file: streamlit_app.py
+pinned: false
+---
+# ru2sql
+Генеративная модель для преобразования вопросов на русском языке в SQL-запросы.
+Практическая часть ВКР, направление «Программная инженерия», 4 курс.
+**Стек:** Python 3.10+, PyTorch, transformers, PEFT (LoRA), FastAPI, sqlglot.
+**Основная модель:** Qwen2.5-Coder-3B-Instruct, дообученная методом QLoRA на датасете PAUQ.
+**Сравнение:** ruT5-base baseline + GigaChat API.
+См. `plan_VKR_text2sql_ru.md` для полного плана работ на месяц.
+---
+## Быстрый старт (на десктопе)
+### 1. Установка
+```bash
+# Установи uv (https://docs.astral.sh/uv/) если ещё нет
+pip install uv
+# Клонируй репозиторий и установи зависимости
+git clone <твой-репо> ru2sql
+cd ru2sql
+uv venv
+.venv\Scripts\activate          # Windows
+# source .venv/bin/activate     # Linux/Mac
+uv pip install -e ".[dev]"
+```
+### 2. Конфигурация
+```bash
+copy .env.example .env          # Windows
+# cp .env.example .env          # Linux/Mac
+```
+Открой `.env` и заполни ключи (минимум `GIGACHAT_API_KEY` для baseline-сравнения, остальное опционально).
+### 3. Скачай PAUQ
+```bash
+git clone https://github.com/ai-forever/pauq.git data/pauq_repo
+# Затем разложи train.json/dev.json/test.json в data/pauq/
+# и SQLite-файлы в data/databases/{db_id}/{db_id}.sqlite
+```
+### 4. Тесты
+```bash
+pytest -v
+```
+Тесты для модулей `prompt`, `postprocess`, `metrics`, `schema` должны проходить
+без скачивания модели и датасета.
+### 5. Запуск API
+```bash
+uvicorn src.api.main:app --reload
+# Swagger UI: http://127.0.0.1:8000/docs
+```
+При первом запуске модель Qwen2.5-Coder-3B (~6 GB) скачается из HuggingFace Hub.
+На CPU инференс занимает 15–30 секунд на запрос — это ожидаемо.
+### 6. Запрос к API
+```bash
+curl -X POST http://127.0.0.1:8000/generate-sql \
+     -H "Content-Type: application/json" \
+     -d '{"question": "Сколько студентов на факультете ПИ?", "db_id": "university"}'
+```
+---
+## Обучение модели
+Тренировка идёт **в Kaggle Notebook** (бесплатный T4 GPU). Локально на CPU/AMD GPU
+обучить 3B-модель не получится.
+Шаги:
+1. Открой `notebooks/kaggle_train_qwen_qlora.ipynb` на kaggle.com.
+2. В Settings выбери Accelerator: GPU T4 x1 (или x2 для скорости).
+3. Add-ons → Secrets → добавь `HF_TOKEN` и `WANDB_API_KEY`.
+4. Запусти все ячейки. Тренировка ~4–6 часов.
+5. По завершении адаптер пушится на твой приватный HF-репо.
+6. Скачай его на десктоп:
+   ```bash
+   huggingface-cli download your-username/qwen-coder-pauq-lora \
+       --local-dir checkpoints/qwen-coder-pauq-lora
+   ```
+После этого `LORA_ADAPTER_PATH` в `.env` укажет на скачанный адаптер,
+и API будет использовать дообученную модель.
+---
+## Структура проекта
+```
+ru2sql/
+├── pyproject.toml              # зависимости (uv)
+├── .env.example                # шаблон конфигурации
+├── plan_VKR_text2sql_ru.md     # план работ на месяц
+├── notebooks/
+│   └── kaggle_train_qwen_qlora.ipynb
+├── src/
+│   ├── config.py               # настройки через pydantic-settings
+│   ├── data/
+│   │   ├── loader.py           # чтение PAUQ JSON
+│   │   ├── schema.py           # SchemaRetriever (DDL из SQLite)
+│   │   └── prompt.py           # PromptBuilder + chat-template
+│   ├── models/
+│   │   ├── inference.py        # InferenceEngine (модель + LoRA)
+│   │   └── postprocess.py      # очистка SQL + sqlglot валидация
+│   ├── evaluation/
+│   │   ├── metrics.py          # Exact Match + Execution Accuracy
+│   │   └── evaluate.py         # CLI для прогона на split'е
+│   └── api/
+│       ├── main.py             # FastAPI app
+│       ├── schemas.py          # Pydantic-модели
+│       └── dependencies.py     # lifespan + DI
+└── tests/
+    ├── test_prompt.py
+    ├── test_postprocess.py
+    ├── test_metrics.py
+    └── test_schema.py
+```
+---
+## Прогон оценки
+```bash
+# Полный прогон на dev split
+python -m src.evaluation.evaluate --split dev
+# Быстрая проверка на 50 примерах
+python -m src.evaluation.evaluate --split dev --limit 50
+```
+Результат сохраняется в `results/predictions.jsonl`, метрики печатаются в stdout.
+---
+## Метрики (планируемые)
+| Модель | EM | Execution Accuracy |
+|---|---|---|
+| ruT5-base (baseline) | 25–35% | 30–40% |
+| **Qwen2.5-Coder-3B + QLoRA** | **50–60%** | **55–70%** |
+| GigaChat API (zero-shot) | 55–70% | 65–80% |
+---
+## Что НЕ входит в MVP
+Сознательно оставлено в раздел «направления дальнейшей работы»:
+- Few-shot retrieval похожих примеров.
+- Schema linking (автоматический отбор релевантных таблиц).
+- Self-correction по ошибкам исполнения SQL.
+- Constrained decoding (грамматика SQL).
+- Дообучение на синтетических данных.
+---
+## Лицензия и атрибуция
+Учебный проект. Использует:
+- PAUQ — Apache 2.0, https://github.com/ai-forever/pauq
+- Qwen2.5-Coder — Apache 2.0, https://huggingface.co/Qwen/Qwen2.5-Coder-3B-Instruct
+- ruT5 — MIT, https://huggingface.co/ai-forever/ruT5-base

adapters/qwen-coder-pauq-lora/.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

adapters/qwen-coder-pauq-lora/README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

adapters/qwen-coder-pauq-lora/adapter_config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "Qwen/Qwen2.5-Coder-3B-Instruct",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "lora_ga_config": null,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.19.1",
+  "qalora_group_size": 16,
+  "r": 16,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "up_proj",
+    "v_proj",
+    "gate_proj",
+    "k_proj",
+    "o_proj",
+    "q_proj",
+    "down_proj"
+  ],
+  "target_parameters": null,
+  "task_type": "CAUSAL_LM",
+  "trainable_token_indices": null,
+  "use_bdlora": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

adapters/qwen-coder-pauq-lora/chat_template.jinja ADDED Viewed

	@@ -0,0 +1,54 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- messages[0]['content'] }}
+    {%- else %}
+        {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}
+    {%- endif %}
+    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0]['role'] == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
+    {%- else %}
+        {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- for message in messages %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
+        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {{- '<|im_start|>' + message.role }}
+        {%- if message.content %}
+            {{- '\n' + message.content }}
+        {%- endif %}
+        {%- for tool_call in message.tool_calls %}
+            {%- if tool_call.function is defined %}
+                {%- set tool_call = tool_call.function %}
+            {%- endif %}
+            {{- '\n<tool_call>\n{"name": "' }}
+            {{- tool_call.name }}
+            {{- '", "arguments": ' }}
+            {{- tool_call.arguments | tojson }}
+            {{- '}\n</tool_call>' }}
+        {%- endfor %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- message.content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+{%- endif %}

adapters/qwen-coder-pauq-lora/tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3fd169731d2cbde95e10bf356d66d5997fd885dd8dbb6fb4684da3f23b2585d8
+size 11421892

adapters/qwen-coder-pauq-lora/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "is_local": false,
+  "local_files_only": false,
+  "model_max_length": 32768,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}

configs/example_vocabulary.yaml ADDED Viewed

	@@ -0,0 +1,33 @@

+# Бизнес-словарь компании — пример заполнения
+# Скопируй этот файл, переименуй под свою компанию и заполни своими терминами.
+# Путь к файлу указывается при запуске утилиты.
+company: "ООО Ромашка"
+# Бизнес-термины и метрики
+# Ключ — слово/фраза как говорит аналитик
+# Значение — что это означает в терминах SQL / данных
+terms:
+  выручка: "SUM(orders.amount) при условии orders.status = 'paid'"
+  оборот: "SUM(orders.amount) по всем заказам включая отменённые"
+  активный клиент: "клиент, совершивший хотя бы одну покупку за последние 90 дней"
+  новый клиент: "клиент, зарегистрированный менее 30 дней назад"
+  этот год: "YEAR(order_date) совпадает с текущим годом"
+  прошлый месяц: "месяц предшествующий текущему"
+  этот квартал: "текущий квартал календарного года (Q1=янв-март, Q2=апр-июн и т.д.)"
+  средний чек: "AVG(orders.amount) по оплаченным заказам"
+  конверсия: "доля оплаченных заказов от общего числа"
+# Стандартные условия фильтрации (применяются по умолчанию если аналитик явно не указал иное)
+filters:
+  только_оплаченные: "orders.status = 'paid'"
+  без_возвратов: "orders.is_return = 0 или orders.is_return IS NULL"
+  только_активные_товары: "products.is_active = 1"
+# Дополнительные правила и особенности схемы
+notes:
+  - "Таблица orders содержит все заказы. Колонка amount — сумма в рублях."
+  - "Клиенты хранятся в таблице customers, товары — в products."
+  - "Связь заказ-товар через таблицу order_items (order_id, product_id, quantity, price)."
+  - "Даты хранятся в формате YYYY-MM-DD в колонке order_date."
+  - "Менеджеры хранятся в таблице managers, связь с заказами через orders.manager_id."

data/demo/sales.sqlite ADDED Viewed

Binary file (57.3 kB). View file

data/demo/sales.sqlite-journal ADDED Viewed

Binary file (512 Bytes). View file

data/demo/test.db ADDED Viewed

Binary file (8.19 kB). View file

data/demo/test.db-journal ADDED Viewed

Binary file (512 Bytes). View file

data/pauq_repo ADDED Viewed

	@@ -0,0 +1 @@


1	+ Subproject commit 1c4a286e30c883f9b9bd5ca59b27cee76d4544ab

notebooks/kaggle_train_qwen_qlora.ipynb ADDED Viewed

	@@ -0,0 +1,428 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Обучение Qwen2.5-Coder-3B на PAUQ через QLoRA\n",
+    "\n",
+    "**Где запускать:** Kaggle Notebook с GPU T4 (Settings → Accelerator → GPU T4 x2 или T4 x1).\n",
+    "\n",
+    "**Что делаем:**\n",
+    "1. Ставим зависимости.\n",
+    "2. Качаем PAUQ.\n",
+    "3. Загружаем Qwen2.5-Coder-3B в 4-bit.\n",
+    "4. Готовим датасет в chat-формате.\n",
+    "5. Дообучаем LoRA-адаптер через `SFTTrainer`.\n",
+    "6. Сохраняем адаптер локально и (опционально) пушим на HuggingFace Hub.\n",
+    "\n",
+    "**Время на T4:** ~2–3 часа на эпоху (если ~10к примеров, max_seq_length=1024)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Установка зависимостей"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -q -U \\\n",
+    "    transformers==4.44.2 \\\n",
+    "    peft==0.12.0 \\\n",
+    "    accelerate==0.33.0 \\\n",
+    "    bitsandbytes==0.43.3 \\\n",
+    "    trl==0.10.1 \\\n",
+    "    datasets==2.20.0 \\\n",
+    "    sqlglot==25.5.1 \\\n",
+    "    wandb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Авторизация HuggingFace и W&B\n",
+    "\n",
+    "В Kaggle добавь секреты: `HF_TOKEN` и `WANDB_API_KEY` через Add-ons → Secrets. Тогда они подхватятся автоматически."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from kaggle_secrets import UserSecretsClient\n",
+    "\n",
+    "secrets = UserSecretsClient()\n",
+    "os.environ[\"HF_TOKEN\"] = secrets.get_secret(\"HF_TOKEN\")\n",
+    "os.environ[\"WANDB_API_KEY\"] = secrets.get_secret(\"WANDB_API_KEY\")\n",
+    "\n",
+    "from huggingface_hub import login\n",
+    "login(token=os.environ[\"HF_TOKEN\"])\n",
+    "\n",
+    "import wandb\n",
+    "wandb.login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Скачиваем PAUQ\n",
+    "\n",
+    "Альтернатива: загрузить PAUQ как Kaggle Dataset и подключить через `/kaggle/input/`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!git clone https://github.com/ai-forever/pauq.git /kaggle/working/pauq_repo\n",
+    "!ls /kaggle/working/pauq_repo"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from pathlib import Path\n",
+    "\n",
+    "# Точные пути зависят от структуры репозитория. Найди train/dev/test файлы:\n",
+    "for p in Path(\"/kaggle/working/pauq_repo\").rglob(\"*.json\"):\n",
+    "    print(p)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ОБНОВИ пути после `ls` выше\n",
+    "TRAIN_JSON = Path(\"/kaggle/working/pauq_repo/path/to/train.json\")\n",
+    "DEV_JSON = Path(\"/kaggle/working/pauq_repo/path/to/dev.json\")\n",
+    "DATABASES_DIR = Path(\"/kaggle/working/pauq_repo/path/to/databases\")\n",
+    "\n",
+    "with TRAIN_JSON.open() as f:\n",
+    "    train_raw = json.load(f)\n",
+    "with DEV_JSON.open() as f:\n",
+    "    dev_raw = json.load(f)\n",
+    "\n",
+    "print(f\"train: {len(train_raw)}, dev: {len(dev_raw)}\")\n",
+    "print(\"Пример:\", train_raw[0])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. SchemaRetriever и PromptBuilder (инлайн)\n",
+    "\n",
+    "В Kaggle нет нашего пакета `src/`, поэтому копируем минимум нужного кода прямо сюда."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sqlite3\n",
+    "from functools import lru_cache\n",
+    "\n",
+    "SYSTEM_PROMPT = (\n",
+    "    \"Ты — ассистент, который преобразует вопросы на русском языке в корректные SQL-запросы. \"\n",
+    "    \"Тебе даётся схема базы данных в виде CREATE TABLE statements и пример нескольких строк. \"\n",
+    "    \"Сгенерируй один SQL-запрос, который отвечает на вопрос пользователя. \"\n",
+    "    \"Возвращай ТОЛЬКО SQL без объяснений, без markdown, без префиксов.\"\n",
+    ")\n",
+    "\n",
+    "@lru_cache(maxsize=512)\n",
+    "def render_schema(db_id: str, n_samples: int = 2) -> str:\n",
+    "    db_path = DATABASES_DIR / db_id / f\"{db_id}.sqlite\"\n",
+    "    if not db_path.exists():\n",
+    "        return \"\"\n",
+    "    conn = sqlite3.connect(f\"file:{db_path}?mode=ro\", uri=True)\n",
+    "    conn.text_factory = lambda b: b.decode(\"utf-8\", errors=\"replace\")\n",
+    "    cur = conn.cursor()\n",
+    "    cur.execute(\"SELECT name, sql FROM sqlite_master WHERE type='table' AND name NOT LIKE 'sqlite_%'\")\n",
+    "    parts = []\n",
+    "    for name, ddl in cur.fetchall():\n",
+    "        if not ddl:\n",
+    "            continue\n",
+    "        parts.append(ddl.strip() + \";\")\n",
+    "        try:\n",
+    "            cur.execute(f'SELECT * FROM \"{name}\" LIMIT {n_samples}')\n",
+    "            rows = cur.fetchall()\n",
+    "            for r in rows:\n",
+    "                parts.append(f\"-- {r}\")\n",
+    "        except sqlite3.Error:\n",
+    "            pass\n",
+    "        parts.append(\"\")\n",
+    "    conn.close()\n",
+    "    return \"\\n\".join(parts).strip()\n",
+    "\n",
+    "def build_messages(schema: str, question: str, sql: str | None = None):\n",
+    "    user = f\"### Schema:\\n{schema}\\n\\n### Question:\\n{question}\\n\\n### SQL:\\n\"\n",
+    "    msgs = [\n",
+    "        {\"role\": \"system\", \"content\": SYSTEM_PROMPT},\n",
+    "        {\"role\": \"user\", \"content\": user},\n",
+    "    ]\n",
+    "    if sql is not None:\n",
+    "        msgs.append({\"role\": \"assistant\", \"content\": sql.strip()})\n",
+    "    return msgs"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Готовим датасет для SFT"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import Dataset\n",
+    "\n",
+    "def to_record(item):\n",
+    "    q = item.get(\"question\") or item.get(\"question_ru\") or \"\"\n",
+    "    sql = item.get(\"query\") or item.get(\"sql_query\") or item.get(\"sql\") or \"\"\n",
+    "    db_id = item.get(\"db_id\") or item.get(\"database\") or \"\"\n",
+    "    if not (q and sql and db_id):\n",
+    "        return None\n",
+    "    schema = render_schema(db_id)\n",
+    "    if not schema:\n",
+    "        return None\n",
+    "    return {\"messages\": build_messages(schema, q.strip(), sql.strip())}\n",
+    "\n",
+    "train_records = [r for r in (to_record(x) for x in train_raw) if r]\n",
+    "dev_records = [r for r in (to_record(x) for x in dev_raw) if r]\n",
+    "print(f\"train usable: {len(train_records)}, dev usable: {len(dev_records)}\")\n",
+    "\n",
+    "train_ds = Dataset.from_list(train_records)\n",
+    "dev_ds = Dataset.from_list(dev_records)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Загружаем модель в 4-bit"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig\n",
+    "\n",
+    "MODEL_NAME = \"Qwen/Qwen2.5-Coder-3B-Instruct\"\n",
+    "\n",
+    "bnb_config = BitsAndBytesConfig(\n",
+    "    load_in_4bit=True,\n",
+    "    bnb_4bit_quant_type=\"nf4\",\n",
+    "    bnb_4bit_compute_dtype=torch.bfloat16,\n",
+    "    bnb_4bit_use_double_quant=True,\n",
+    ")\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\n",
+    "if tokenizer.pad_token is None:\n",
+    "    tokenizer.pad_token = tokenizer.eos_token\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(\n",
+    "    MODEL_NAME,\n",
+    "    quantization_config=bnb_config,\n",
+    "    device_map=\"auto\",\n",
+    "    torch_dtype=torch.bfloat16,\n",
+    ")\n",
+    "model.config.use_cache = False"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7. Конфиг LoRA"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from peft import LoraConfig, prepare_model_for_kbit_training\n",
+    "\n",
+    "model = prepare_model_for_kbit_training(model)\n",
+    "\n",
+    "lora_config = LoraConfig(\n",
+    "    r=16,\n",
+    "    lora_alpha=32,\n",
+    "    lora_dropout=0.05,\n",
+    "    bias=\"none\",\n",
+    "    task_type=\"CAUSAL_LM\",\n",
+    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
+    "                    \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8. Тренировка через SFTTrainer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import SFTConfig, SFTTrainer\n",
+    "\n",
+    "OUTPUT_DIR = \"/kaggle/working/qwen-coder-pauq-lora\"\n",
+    "\n",
+    "sft_config = SFTConfig(\n",
+    "    output_dir=OUTPUT_DIR,\n",
+    "    num_train_epochs=2,\n",
+    "    per_device_train_batch_size=1,\n",
+    "    gradient_accumulation_steps=8,\n",
+    "    gradient_checkpointing=True,\n",
+    "    learning_rate=2e-4,\n",
+    "    lr_scheduler_type=\"cosine\",\n",
+    "    warmup_ratio=0.03,\n",
+    "    optim=\"paged_adamw_8bit\",\n",
+    "    bf16=True,\n",
+    "    logging_steps=20,\n",
+    "    save_strategy=\"epoch\",\n",
+    "    save_total_limit=2,\n",
+    "    eval_strategy=\"no\",  # eval делаем отдельно после тренировки\n",
+    "    max_seq_length=1024,\n",
+    "    packing=False,\n",
+    "    report_to=\"wandb\",\n",
+    "    run_name=\"qwen3b-pauq-qlora\",\n",
+    ")\n",
+    "\n",
+    "trainer = SFTTrainer(\n",
+    "    model=model,\n",
+    "    tokenizer=tokenizer,\n",
+    "    train_dataset=train_ds,\n",
+    "    peft_config=lora_config,\n",
+    "    args=sft_config,\n",
+    ")\n",
+    "\n",
+    "trainer.train()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainer.save_model(OUTPUT_DIR)\n",
+    "tokenizer.save_pretrained(OUTPUT_DIR)\n",
+    "print(\"Saved to\", OUTPUT_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9. Быстрая проверка inference"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model.config.use_cache = True\n",
+    "model.eval()\n",
+    "\n",
+    "ex = dev_records[0]\n",
+    "prompt_msgs = ex[\"messages\"][:2]  # без assistant-ответа\n",
+    "prompt = tokenizer.apply_chat_template(prompt_msgs, tokenize=False, add_generation_prompt=True)\n",
+    "inputs = tokenizer(prompt, return_tensors=\"pt\").to(model.device)\n",
+    "\n",
+    "with torch.no_grad():\n",
+    "    out = model.generate(**inputs, max_new_tokens=256, do_sample=False,\n",
+    "                          pad_token_id=tokenizer.eos_token_id)\n",
+    "new_tokens = out[0][inputs[\"input_ids\"].shape[1]:]\n",
+    "print(\"Pred:\", tokenizer.decode(new_tokens, skip_special_tokens=True))\n",
+    "print(\"Gold:\", ex[\"messages\"][2][\"content\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 10. Загрузка адаптера на HuggingFace Hub (приватный репо)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "HF_REPO = \"your-username/qwen-coder-pauq-lora\"  # замени на свой\n",
+    "\n",
+    "trainer.model.push_to_hub(HF_REPO, private=True)\n",
+    "tokenizer.push_to_hub(HF_REPO, private=True)\n",
+    "print(\"Pushed to\", HF_REPO)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Дальше\n",
+    "\n",
+    "1. Скачай адаптер на десктоп: `huggingface-cli download your-username/qwen-coder-pauq-lora --local-dir checkpoints/qwen-coder-pauq-lora`.\n",
+    "2. Запусти `python -m src.evaluation.evaluate --split dev --limit 100` локально, либо запусти полный eval здесь же на Kaggle.\n",
+    "3. Если метрики низкие: проверь prompt format, увеличь эпохи, понизь learning rate."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {"name": "ipython", "version": 3},
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

plan_VKR_text2sql_ru.md ADDED Viewed

	@@ -0,0 +1,264 @@

+# План практической части ВКР: «Утилита Natural Language → SQL для бизнес-аналитики»
+**Студент:** Danis, ПИ, 4 курс
+**Срок:** 4 недели
+**Дата:** 29 апреля 2026
+---
+## 0. Контур решения
+**Финальный продукт:** утилита, которая позволяет аналитику малого и среднего бизнеса задавать вопросы на русском языке и получать готовые данные из корпоративной базы данных — без знания SQL.
+Система: вопрос на русском → бизнес-словарь компании → схема БД → SQL → выполнение → результат.
+Подход: fine-tuning **Qwen2.5-Coder-3B-Instruct** методом QLoRA на датасете **PAUQ**, обёрнутый в **FastAPI** с дополнительными модулями подключения к произвольной БД, настраиваемым бизнес-словарём и веб-интерфейсом на Streamlit.
+Для научного сравнения параллельно прогоняется **GigaChat API** (или OpenAI) и **ruT5-base** baseline.
+Инфраструктура:
+- Тренировка: **Kaggle Notebooks** (T4 16 GB бесплатно).
+- Разработка кода и API: **десктоп** Ryzen 5 3600X + 16 GB RAM.
+- Демо на защите: **ноутбук** Ryzen 5 5500U + 16 GB RAM, инференс на CPU.
+Артефакты ВКР:
+- Рабочая утилита с веб-интерфейсом (Streamlit)
+- Модуль подключения к произвольной БД (SQLite / PostgreSQL / MySQL)
+- Модуль бизнес-словаря (YAML-конфиг с определениями метрик компании)
+- Сравнительная таблица метрик (EM, Execution Accuracy)
+- Анализ ошибок на 30+ примерах
+---
+## 1. Технологический стек
+### 1.1 Среда разработки
+| Компонент | Выбор |
+|---|---|
+| Язык | Python 3.10+ |
+| Менеджер пакетов | uv (быстрый, современный) |
+| Контроль версий | Git + GitHub |
+| IDE | VS Code |
+### 1.2 ML и обучение
+| Компонент | Выбор | Где используется |
+|---|---|---|
+| PyTorch 2.x | основа | Kaggle |
+| transformers | модели и токенизация | Kaggle + десктоп |
+| peft | LoRA/QLoRA | Kaggle |
+| bitsandbytes | 4-bit квантизация | Kaggle (на CPU не нужен) |
+| trl | SFTTrainer | Kaggle |
+| datasets | работа с PAUQ | Kaggle + десктоп |
+| W&B | логирование экспериментов | Kaggle |
+### 1.3 Инференс на десктопе и ноутбуке
+Для локального инференса без GPU есть два пути:
+| Путь | Скорость | Сложность | Применение |
+|---|---|---|---|
+| transformers на CPU (int8) | 15–30 с/запрос | проще | разработка, отладка |
+| llama.cpp (gguf int4) | 5–15 с/запрос | сложнее | финальное демо |
+**Рекомендация:** для разработки — transformers, для защиты — llama.cpp.
+### 1.4 API и SQL
+| Компонент | Выбор |
+|---|---|
+| FastAPI + Uvicorn | REST API |
+| Pydantic v2 | валидация |
+| sqlite3 (stdlib) | работа с БД из PAUQ |
+| sqlglot | парсинг и валидация SQL |
+| pytest | тесты |
+---
+## 2. Архитектура
+```
+┌──────────────────────────────────────────────────────────────┐
+│              Streamlit Web Interface                         │
+│  Поле вопроса | Выбор БД | Редактор бизнес-словаря           │
+│  Таблица результатов | История запросов                      │
+└──────────────────────────┬───────────────────────────────────┘
+                           │ HTTP
+┌──────────────────────────▼───────────────────────────────────┐
+│                  FastAPI REST API                            │
+│  POST /query  {question_ru, db_id} → {sql, result, ...}      │
+└──────┬──────────────┬───────────────┬─────────────────────���──┘
+       │              │               │
+       ▼              ▼               ▼
+┌────────────┐ ┌────────────┐ ┌─────────────────┐
+│ DbConnector│ │ Business   │ │ SchemaRetriever │
+│ SQLite /   │ │ Vocabulary │ │ (DDL из БД)     │
+│ Postgres / │ │ (YAML-     │ └────────┬────────┘
+│ MySQL      │ │ конфиг)    │          │
+└─────┬──────┘ └─────┬──────┘          │
+      │              │                 │
+      │         ┌────▼─────────────────▼──┐
+      │         │      PromptBuilder      │
+      │         │  вопрос + схема +        │
+      │         │  определения метрик      │
+      │         └────────────┬────────────┘
+      │                      ▼
+      │         ┌────────────────────────┐
+      │         │    InferenceEngine     │
+      │         │  Qwen2.5-Coder-3B      │
+      │         │  + LoRA adapter        │
+      │         └────────────┬───────────┘
+      │                      ▼
+      │         ┌────────────────────────┐
+      │         │   SqlPostProcessor     │
+      │         │   (sqlglot validation) │
+      │         └────────────┬───────────┘
+      │                      │
+      └──────────────────────┘
+                      │ выполнить SQL
+                      ▼
+             ┌─────────────────┐
+             │   SqlExecutor   │
+             │  результат →    │
+             │  аналитику      │
+             └─────────────────┘
+```
+Структура проекта (см. файлы в репозитории):
+```
+ru2sql/
+├── README.md
+├── pyproject.toml
+├── .gitignore
+├── notebooks/
+│   └── kaggle_train_qwen_qlora.ipynb
+├── src/
+│   ├── config.py
+│   ├── data/        — loader, schema, prompt
+│   ├── models/      — inference, postprocess
+│   ├── evaluation/  — metrics, evaluate
+│   └── api/         — main, schemas, dependencies
+├── tests/
+└── scripts/
+```
+---
+## 3. Помесячный план
+### Неделя 1. Окружение, данные, baseline
+**Цель:** работающий pipeline от вопроса до SQL на маленькой модели.
+| День | Задача |
+|---|---|
+| 1 | Установка Python 3.10+, uv, Git. Клонирование репозитория. `uv sync`. Проверка что FastAPI стартует. |
+| 2 | Регистрация на Kaggle, HuggingFace, W&B. Скачивание PAUQ (https://github.com/ai-forever/pauq). |
+| 3 | Анализ датасета в notebook: распределения, сложности, примеры. Реализация `SchemaRetriever`. |
+| 4 | Реализация `PromptBuilder`. Тесты: `pytest tests/test_prompt.py`. |
+| 5–6 | Kaggle-notebook: обучение **ruT5-base** на 2 эпохи. Сохранение чекпойнта. |
+| 7 | Реализация `metrics.py` (EM + Execution Accuracy). Прогон ruT5 на dev. Запись в W&B. |
+Контрольная точка недели: ruT5-base даёт 25–35% EM на PAUQ dev.
+### Неделя 2. Главная модель (Qwen2.5-Coder-3B + QLoRA)
+**Цель:** обученный LoRA-адаптер для Qwen с метриками выше baseline.
+| День | Задача |
+|---|---|
+| 1 | Kaggle-notebook: загрузка Qwen2.5-Coder-3B в 4-bit, тестовый inference. |
+| 2 | Подготовка PAUQ в chat-формате под модель. |
+| 3–4 | SFTTrainer + LoRA (r=16, alpha=32). Прогон 2–3 эпохи (~4–6 часов суммарно). |
+| 5 | Сохранение LoRA-адаптера на HuggingFace Hub (приватный репозиторий). |
+| 6 | Скачивание адаптера на десктоп. Локальный инференс на CPU через transformers. |
+| 7 | Прогон на dev split, метрики, error analysis на 30 примерах. |
+Контрольная точка недели: Qwen+LoRA даёт 50–60% EM на PAUQ dev и работает на десктопе.
+### Неделя 3. Бизнес-утилита: коннектор + словарь + исполнение SQL
+**Цель:** превратить API в полноценную бизнес-утилиту — подключение к реальной БД, настройка под компанию, возврат данных.
+| День | Задача |
+|---|---|
+| 1 | FastAPI: `/generate-sql`, `/query`, `/databases`, `/health`. Lifespan для загрузки модели. |
+| 2 | Модуль `DbConnector` — подключение к SQLite/PostgreSQL/MySQL по строке подключения. Автоматическое чтение схемы (`INFORMATION_SCHEMA`). |
+| 3 | Модуль `BusinessVocabulary` — загрузка YAML-конфига с определениями метрик. Подстановка определений в промпт перед генерацией SQL. Пример конфига: `выручка: "SUM(orders.amount) WHERE status='paid'"`. |
+| 4 | Эндпоинт `/query` — принимает вопрос, генерирует SQL, выполняет на подключённой БД, возвращает результат в JSON (таблица строк). |
+| 5 | Получение API-ключа GigaChat (или YandexGPT), скрипт прогона на тех же примерах. Сравнительная таблица: ruT5 vs Qwen+LoRA vs GigaChat по EM и EX. |
+| 6 | `SqlPostProcessor` через sqlglot. Тесты pytest на все новые модули. |
+| 7 | Создание демо-базы данных (SQLite) с реалистичными бизнес-данными: продажи, клиенты, товары. Написание бизнес-словаря под эту базу. |
+Контрольная точка недели: аналитик вводит "Какая выручка за январь?" → утилита возвращает число из реальной БД.
+### Неделя 4. Streamlit-интерфейс, демо, материалы для ВКР
+**Цель:** красивый рабочий продукт для защиты + готовые материалы для текста ВКР.
+| День | Задача |
+|---|---|
+| 1 | Streamlit-интерфейс: поле ввода вопроса, выбор БД, отображение сгенерированного SQL и таблицы результатов. |
+| 2 | В интерфейсе: вкладка настройки бизнес-словаря (редактирование YAML прямо в браузере). История запросов. |
+| 3 | Error analysis: разбор 30 ошибок Qwen+LoRA, классификация по категориям (неверный JOIN, неверное условие WHERE и т.д.). |
+| 4 | Конвертация LoRA + базовой модели в gguf через llama.cpp для быстрого инференса на CPU. |
+| 5 | Диаграммы архитектуры (draw.io), скриншоты интерфейса, графики метрик (matplotlib). |
+| 6 | Глава «Реализация» и глава «Практическое применение» в тексте ВКР. |
+| 7 | Прогон полного сценария на ноутбуке с демо-базой. Резервная копия чекпойнта на HuggingFace. |
+---
+## 4. Метрики качества
+Стандарт для Text-to-SQL:
+- **Exact Match (EM)** — нормализуем оба SQL и сравниваем посимвольно.
+- **Execution Accuracy (EX)** — выполняем оба SQL на реальной SQLite, сравниваем результаты как множества кортежей.
+EX важнее EM, потому что разные SQL могут дать одинаковый результат.
+Целевые числа на PAUQ dev (ориентировочно):
+- ruT5-base: 25–35% EM, 30–40% EX.
+- Qwen2.5-Coder-3B + LoRA: 50–60% EM, 55–70% EX.
+- GigaChat / GPT-4 (zero-shot, через API): 55–70% EM, 65–80% EX.
+Ваш Qwen после QLoRA должен быть близок к API-моделям. Это и будет защищаемый результат.
+---
+## 5. Риски и план B
+| Риск | План B |
+|---|---|
+| Kaggle квота закончилась | Переключиться на Google Colab Free или арендовать GPU на vast.ai (~$2 за обучение) |
+| Qwen-3B плохо сходится | Понизить learning rate до 1e-4, увеличить эпохи до 5, проверить prompt format |
+| llama.cpp не успеваю настроить к защите | Демо через transformers на CPU напрямую — медленнее, но работает |
+| GigaChat недоступен | YandexGPT либо OpenAI через VPN — Pydantic-обёртка одна, провайдер меняется одной строчкой |
+| Не хватает времени на error analysis | Минимум — 20 ошибок руками, простая классификация в Excel |
+---
+## 6. Что вынести в «направления дальнейшей работы»
+Эти улучшения **не делаем** в рамках месяца, но упоминаем в ВКР:
+- Few-shot retrieval (поиск похожих примеров через эмбеддинги).
+- Schema linking (автоматический отбор таблиц).
+- Self-correction (выполнение SQL, исправление по ошибке).
+- Constrained decoding (ограничение токенов до валидной SQL-грамматики).
+- Дообучение на синтетических данных от GPT-4.
+---
+## 7. Итоговый чек-лист на старте
+- [ ] Установлены Python 3.10+, uv, Git, VS Code на десктопе
+- [ ] Создан репозиторий ru2sql на GitHub
+- [ ] Зарегистрированы аккаунты Kaggle, HuggingFace, W&B
+- [ ] Получен ключ GigaChat (или OpenAI)
+- [ ] Скачан PAUQ
+- [ ] `uv sync` проходит без ошибок
+- [ ] `uvicorn src.api.main:app --reload` стартует
+- [ ] Прочитаны статьи: Spider (2018), QLoRA (2023), краткое описание Qwen2.5-Coder
+После чек-листа можно стартовать День 3 первой недели.

pyproject.toml ADDED Viewed

	@@ -0,0 +1,74 @@

+[project]
+name = "ru2sql"
+version = "0.1.0"
+description = "Russian-to-SQL generative model for graduation thesis"
+authors = [{ name = "Danis", email = "[email protected]" }]
+requires-python = ">=3.10,<3.13"
+readme = "README.md"
+dependencies = [
+    # API
+    "fastapi>=0.115.0",
+    "uvicorn[standard]>=0.30.0",
+    "pydantic>=2.7.0",
+    "pydantic-settings>=2.4.0",
+    # SQL parsing / validation
+    "sqlglot>=25.0.0",
+    # Data
+    "datasets>=2.20.0",
+    "pandas>=2.2.0",
+    # ML inference (CPU-friendly versions for desktop/laptop)
+    # Heavy training deps (bitsandbytes, peft, trl) live in [training] and run on Kaggle
+    "torch>=2.3.0",
+    "transformers>=4.44.0",
+    "accelerate>=0.33.0",
+    "peft>=0.12.0",  # for loading LoRA adapter at inference time
+    # Misc
+    "python-dotenv>=1.0.0",
+    "httpx>=0.27.0",  # for GigaChat/OpenAI API client
+    "tqdm>=4.66.0",
+    # Интерфейс
+    "streamlit>=1.35.0",
+    "pyyaml>=6.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.3.0",
+    "pytest-asyncio>=0.23.0",
+    "ruff>=0.6.0",
+    "ipykernel>=6.29.0",
+    "matplotlib>=3.9.0",
+    "seaborn>=0.13.0",
+]
+# Heavy GPU-only deps. Install on Kaggle: `pip install -e .[training]`
+training = [
+    "bitsandbytes>=0.43.0",
+    "trl>=0.10.0",
+    "wandb>=0.17.0",
+]
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+[tool.hatch.build.targets.wheel]
+packages = ["src"]
+[tool.ruff]
+line-length = 100
+target-version = "py310"
+[tool.ruff.lint]
+select = ["E", "F", "W", "I", "B", "UP"]
+ignore = ["E501"]
+[tool.pytest.ini_options]
+testpaths = ["tests"]
+pythonpath = ["."]

requirements.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+streamlit>=1.35.0
+torch>=2.3.0
+transformers>=4.44.0
+accelerate>=0.33.0
+peft>=0.12.0
+pydantic>=2.7.0
+pydantic-settings>=2.4.0
+sqlglot>=25.0.0
+pandas>=2.2.0
+python-dotenv>=1.0.0
+huggingface_hub>=1.0.0
+pyyaml>=6.0
+tqdm>=4.66.0
+httpx>=0.27.0

src/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ """ru2sql — Russian-to-SQL generative model."""
2	+
3	+ __version__ = "0.1.0"

src/api/__init__.py ADDED Viewed

File without changes

src/api/dependencies.py ADDED Viewed

	@@ -0,0 +1,42 @@

+"""FastAPI lifespan и DI: загрузка модели один раз при старте."""
+from __future__ import annotations
+from contextlib import asynccontextmanager
+from fastapi import FastAPI
+from src.config import settings
+from src.data.schema import SchemaRetriever
+from src.models.inference import InferenceEngine
+class AppState:
+    engine: InferenceEngine | None = None
+    schema_retriever: SchemaRetriever | None = None
+state = AppState()
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Грузим модель при старте, освобождаем при остановке."""
+    state.engine = InferenceEngine()
+    state.engine.load()
+    state.schema_retriever = SchemaRetriever(settings.databases_dir)
+    yield
+    state.engine = None
+    state.schema_retriever = None
+def get_engine() -> InferenceEngine:
+    if state.engine is None:
+        raise RuntimeError("Inference engine not initialized")
+    return state.engine
+def get_schema_retriever() -> SchemaRetriever:
+    if state.schema_retriever is None:
+        raise RuntimeError("SchemaRetriever not initialized")
+    return state.schema_retriever

src/api/main.py ADDED Viewed

	@@ -0,0 +1,110 @@

+"""FastAPI приложение.
+Запуск:
+    uvicorn src.api.main:app --reload
+    # Swagger UI: http://127.0.0.1:8000/docs
+"""
+from __future__ import annotations
+import sqlite3
+from fastapi import Depends, FastAPI, HTTPException
+from fastapi.concurrency import run_in_threadpool
+from src.api.dependencies import get_engine, get_schema_retriever, lifespan
+from src.api.schemas import (
+    DatabaseInfo,
+    ExecutionResult,
+    GenerateRequest,
+    GenerateResponse,
+    HealthResponse,
+)
+from src.config import settings
+from src.data.schema import SchemaRetriever
+from src.models.inference import InferenceEngine
+from src.models.postprocess import is_valid_sql
+app = FastAPI(
+    title="ru2sql",
+    description="Преобразование вопросов на русском в SQL-запросы",
+    version="0.1.0",
+    lifespan=lifespan,
+)
+@app.get("/health", response_model=HealthResponse)
+def health(engine: InferenceEngine = Depends(get_engine)):
+    return HealthResponse(
+        status="ok",
+        model_loaded=engine._loaded,
+        base_model=engine.base_model_name,
+    )
+@app.get("/databases", response_model=list[DatabaseInfo])
+def list_databases(retriever: SchemaRetriever = Depends(get_schema_retriever)):
+    out: list[DatabaseInfo] = []
+    for db_id in retriever.list_databases():
+        try:
+            tables = [t.name for t in retriever.get_tables(db_id, n_sample_rows=0)]
+            out.append(DatabaseInfo(db_id=db_id, tables=tables))
+        except FileNotFoundError:
+            continue
+    return out
+@app.post("/generate-sql", response_model=GenerateResponse)
+async def generate_sql(
+    req: GenerateRequest,
+    engine: InferenceEngine = Depends(get_engine),
+    retriever: SchemaRetriever = Depends(get_schema_retriever),
+):
+    try:
+        schema_text = retriever.render_schema(req.db_id)
+    except FileNotFoundError as e:
+        raise HTTPException(status_code=404, detail=str(e)) from e
+    # Inference синхронный и тяжёлый — выносим в threadpool
+    result = await run_in_threadpool(engine.generate, schema_text, req.question)
+    valid = is_valid_sql(result.sql)
+    response = GenerateResponse(
+        sql=result.sql,
+        raw_output=result.raw_output,
+        is_valid_sql=valid,
+    )
+    if req.execute and valid:
+        try:
+            response.execution = await run_in_threadpool(
+                _execute_sql, req.db_id, result.sql, retriever
+            )
+        except sqlite3.Error as e:
+            response.error = f"SQL execution error: {e}"
+    return response
+def _execute_sql(db_id: str, sql: str, retriever: SchemaRetriever) -> ExecutionResult:
+    db_path = retriever.db_path(db_id)
+    conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
+    try:
+        conn.text_factory = lambda b: b.decode("utf-8", errors="replace")
+        cur = conn.cursor()
+        cur.execute(sql)
+        rows = cur.fetchmany(100)
+        cols = [d[0] for d in cur.description] if cur.description else []
+        return ExecutionResult(
+            columns=cols,
+            rows=[list(r) for r in rows],
+            row_count=len(rows),
+        )
+    finally:
+        conn.close()
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("src.api.main:app", host=settings.api_host, port=settings.api_port, reload=True)

src/api/schemas.py ADDED Viewed

	@@ -0,0 +1,36 @@

+"""Pydantic-модели для FastAPI endpoints."""
+from __future__ import annotations
+from pydantic import BaseModel, Field
+class GenerateRequest(BaseModel):
+    question: str = Field(..., min_length=1, max_length=2000, description="Вопрос на русском")
+    db_id: str = Field(..., min_length=1, description="Идентификатор БД из PAUQ")
+    execute: bool = Field(default=False, description="Прогнать сгенерированный SQL на БД")
+class ExecutionResult(BaseModel):
+    columns: list[str]
+    rows: list[list]
+    row_count: int
+class GenerateResponse(BaseModel):
+    sql: str
+    raw_output: str
+    is_valid_sql: bool
+    execution: ExecutionResult | None = None
+    error: str | None = None
+class DatabaseInfo(BaseModel):
+    db_id: str
+    tables: list[str]
+class HealthResponse(BaseModel):
+    status: str
+    model_loaded: bool
+    base_model: str

src/business/__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from .vocabulary import BusinessVocabulary
2	+
3	+ __all__ = ["BusinessVocabulary"]

src/business/vocabulary.py ADDED Viewed

	@@ -0,0 +1,173 @@

+"""BusinessVocabulary — настраиваемый бизнес-словарь компании.
+Позволяет аналитику один раз описать бизнес-термины и метрики компании в YAML-файле,
+после чего модель правильно интерпретирует их в SQL-запросах.
+Пример YAML-конфига (configs/example_vocabulary.yaml):
+    company: "ООО Ромашка"
+    terms:
+      выручка: "SUM(orders.amount) WHERE orders.status = 'paid'"
+      активный клиент: "клиент, совершивший покупку за последние 90 дней"
+      этот год: "YEAR(order_date) = strftime('%Y', 'now')"
+      прошлый месяц: "strftime('%Y-%m', order_date) = strftime('%Y-%m', 'now', '-1 month')"
+    filters:
+      только_оплаченные: "orders.status = 'paid'"
+      без_возвратов: "orders.is_return = 0"
+Пример использования:
+    vocab = BusinessVocabulary.from_yaml("configs/my_company.yaml")
+    enriched_prompt = vocab.enrich_prompt("Какая выручка за январь?")
+"""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from pathlib import Path
+try:
+    import yaml  # type: ignore
+    _YAML_AVAILABLE = True
+except ImportError:
+    _YAML_AVAILABLE = False
+@dataclass
+class BusinessVocabulary:
+    """Хранит бизнес-термины и метрики компании, подставляет их в промпт модели."""
+    company: str = ""
+    terms: dict[str, str] = field(default_factory=dict)
+    filters: dict[str, str] = field(default_factory=dict)
+    notes: list[str] = field(default_factory=list)
+    # ------------------------------------------------------------------
+    # Загрузка
+    # ------------------------------------------------------------------
+    @classmethod
+    def from_yaml(cls, path: str | Path) -> "BusinessVocabulary":
+        """Загружает словарь из YAML-файла."""
+        if not _YAML_AVAILABLE:
+            raise ImportError("Установи PyYAML: pip install pyyaml")
+        path = Path(path)
+        if not path.exists():
+            raise FileNotFoundError(f"Файл бизнес-словаря не найден: {path}")
+        with open(path, encoding="utf-8") as f:
+            data = yaml.safe_load(f) or {}
+        return cls(
+            company=data.get("company", ""),
+            terms=data.get("terms", {}),
+            filters=data.get("filters", {}),
+            notes=data.get("notes", []),
+        )
+    @classmethod
+    def from_dict(cls, data: dict) -> "BusinessVocabulary":
+        """Создаёт словарь из словаря Python (удобно для API и Streamlit)."""
+        return cls(
+            company=data.get("company", ""),
+            terms=data.get("terms", {}),
+            filters=data.get("filters", {}),
+            notes=data.get("notes", []),
+        )
+    @classmethod
+    def empty(cls) -> "BusinessVocabulary":
+        """Пустой словарь — для случая когда компания ещё не настроила термины."""
+        return cls()
+    # ------------------------------------------------------------------
+    # Использование
+    # ------------------------------------------------------------------
+    def enrich_prompt(self, question: str) -> str:
+        """Добавляет к вопросу пользователя контекст из бизнес-словаря.
+        Если вопрос содержит известные термины — подставляет их определения.
+        Возвращает обогащённый вопрос для подстановки в промпт модели.
+        """
+        if not self.terms and not self.filters and not self.notes:
+            return question
+        context_lines: list[str] = []
+        # Находим термины которые упоминаются в вопросе
+        question_lower = question.lower()
+        relevant_terms = {
+            term: definition
+            for term, definition in self.terms.items()
+            if term.lower() in question_lower
+        }
+        if relevant_terms:
+            context_lines.append("Определения терминов компании:")
+            for term, definition in relevant_terms.items():
+                context_lines.append(f"  - {term}: {definition}")
+        if self.filters:
+            context_lines.append("Стандартные фильтры компании:")
+            for name, condition in self.filters.items():
+                context_lines.append(f"  - {name}: {condition}")
+        if self.notes:
+            context_lines.append("Дополнительные правила:")
+            for note in self.notes:
+                context_lines.append(f"  - {note}")
+        if not context_lines:
+            return question
+        context = "\n".join(context_lines)
+        return f"{question}\n\n[Контекст компании]\n{context}"
+    def render_system_context(self) -> str:
+        """Текст для системного промпта — описывает все термины компании."""
+        if not self.terms and not self.filters and not self.notes:
+            return ""
+        lines: list[str] = []
+        if self.company:
+            lines.append(f"Компания: {self.company}")
+            lines.append("")
+        if self.terms:
+            lines.append("Бизнес-термины и метрики:")
+            for term, definition in self.terms.items():
+                lines.append(f"  - «{term}» означает: {definition}")
+        if self.filters:
+            lines.append("")
+            lines.append("Стандартные условия фильтрации:")
+            for name, condition in self.filters.items():
+                lines.append(f"  - {name}: {condition}")
+        if self.notes:
+            lines.append("")
+            lines.append("Важные правила:")
+            for note in self.notes:
+                lines.append(f"  - {note}")
+        return "\n".join(lines)
+    def to_yaml_string(self) -> str:
+        """Сериализует словарь обратно в YAML-строку (для редактора в Streamlit)."""
+        if not _YAML_AVAILABLE:
+            raise ImportError("Установи PyYAML: pip install pyyaml")
+        data = {
+            "company": self.company,
+            "terms": self.terms,
+            "filters": self.filters,
+            "notes": self.notes,
+        }
+        return yaml.dump(data, allow_unicode=True, sort_keys=False, default_flow_style=False)
+    def save_yaml(self, path: str | Path) -> None:
+        """Сохраняет словарь в YAML-файл."""
+        path = Path(path)
+        path.parent.mkdir(parents=True, exist_ok=True)
+        with open(path, "w", encoding="utf-8") as f:
+            f.write(self.to_yaml_string())
+    def __bool__(self) -> bool:
+        return bool(self.terms or self.filters or self.notes)

src/config.py ADDED Viewed

	@@ -0,0 +1,48 @@

+"""Конфигурация проекта. Читаем из .env через pydantic-settings."""
+from __future__ import annotations
+from pathlib import Path
+from pydantic_settings import BaseSettings, SettingsConfigDict
+ROOT_DIR = Path(__file__).resolve().parent.parent
+class Settings(BaseSettings):
+    """Все настройки приложения. Значения берутся из .env, переменных окружения, либо дефолтов."""
+    model_config = SettingsConfigDict(
+        env_file=str(ROOT_DIR / ".env"),
+        env_file_encoding="utf-8",
+        extra="ignore",
+    )
+    # Модель
+    base_model_name: str = "Qwen/Qwen2.5-Coder-3B-Instruct"
+    lora_adapter_path: str = str(ROOT_DIR / "checkpoints" / "qwen-coder-pauq-lora")
+    device: str = "cpu"  # "cpu" | "cuda" | "mps"
+    # Данные
+    pauq_data_dir: Path = ROOT_DIR / "data" / "pauq"
+    databases_dir: Path = ROOT_DIR / "data" / "databases"
+    # API ключи (используется только тот, который заполнен)
+    gigachat_api_key: str = ""
+    openai_api_key: str = ""
+    yandexgpt_api_key: str = ""
+    yandexgpt_folder_id: str = ""
+    hf_token: str = ""
+    # FastAPI
+    api_host: str = "127.0.0.1"
+    api_port: int = 8000
+    # Inference defaults
+    max_new_tokens: int = 256
+    temperature: float = 0.0  # для SQL детерминизм лучше
+    do_sample: bool = False
+# Singleton-инстанс. Импортируется по всему проекту: `from src.config import settings`
+settings = Settings()

src/data/__init__.py ADDED Viewed

File without changes

src/data/loader.py ADDED Viewed

	@@ -0,0 +1,52 @@

+"""Загрузчик датасета PAUQ.
+PAUQ распространяется в JSON-формате с полями question, query, db_id и т.д.
+См. https://github.com/ai-forever/pauq
+"""
+from __future__ import annotations
+import json
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Iterator
+@dataclass
+class PauqExample:
+    question: str
+    query: str  # gold SQL
+    db_id: str
+    query_type: str | None = None  # easy/medium/hard/extra если есть
+    raw: dict | None = None
+def load_pauq_split(path: Path | str) -> list[PauqExample]:
+    """Читает train.json / dev.json / test.json из PAUQ."""
+    path = Path(path)
+    with path.open("r", encoding="utf-8") as f:
+        raw = json.load(f)
+    examples: list[PauqExample] = []
+    for item in raw:
+        # PAUQ имеет несколько ревизий формата; пробуем самые частые поля
+        question = item.get("question") or item.get("question_ru") or ""
+        query = item.get("query") or item.get("sql_query") or item.get("sql") or ""
+        db_id = item.get("db_id") or item.get("database") or ""
+        if not (question and query and db_id):
+            continue
+        examples.append(
+            PauqExample(
+                question=question.strip(),
+                query=query.strip(),
+                db_id=db_id.strip(),
+                query_type=item.get("query_type") or item.get("hardness"),
+                raw=item,
+            )
+        )
+    return examples
+def iter_pauq_split(path: Path | str) -> Iterator[PauqExample]:
+    """Удобно при больших датасетах — генератор."""
+    yield from load_pauq_split(path)

src/data/prompt.py ADDED Viewed

	@@ -0,0 +1,29 @@

+"""PromptBuilder — формирует input для модели в формате chat-template."""
+from __future__ import annotations
+SYSTEM_PROMPT = (
+    "Ты — ассистент, который преобразует вопросы на русском языке в корректные SQL-запросы. "
+    "Тебе даётся схема базы данных в виде CREATE TABLE statements и пример нескольких строк. "
+    "Сгенерируй один SQL-запрос, который отвечает на вопрос пользователя. "
+    "Возвращай ТОЛЬКО SQL без объяснений, без markdown, без префиксов."
+)
+def build_user_message(schema: str, question: str) -> str:
+    return f"### Schema:\n{schema}\n\n### Question:\n{question}\n\n### SQL:\n"
+def build_chat_messages(schema: str, question: str) -> list[dict]:
+    """Формат для tokenizer.apply_chat_template."""
+    return [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": build_user_message(schema, question)},
+    ]
+def build_training_example(schema: str, question: str, sql: str) -> list[dict]:
+    """Полный диалог для SFT с ответом ассистента."""
+    msgs = build_chat_messages(schema, question)
+    msgs.append({"role": "assistant", "content": sql.strip()})
+    return msgs

src/data/schema.py ADDED Viewed

	@@ -0,0 +1,76 @@

+"""SchemaRetriever — извлекает DDL и примеры строк из SQLite-файлов PAUQ/Spider."""
+from __future__ import annotations
+import sqlite3
+from dataclasses import dataclass
+from pathlib import Path
+@dataclass
+class TableInfo:
+    name: str
+    create_sql: str
+    sample_rows: list[tuple]
+class SchemaRetriever:
+    """Читает структуру SQLite-БД для подачи в prompt модели."""
+    def __init__(self, databases_dir: Path | str):
+        self.databases_dir = Path(databases_dir)
+    def db_path(self, db_id: str) -> Path:
+        """В Spider/PAUQ каждая БД лежит в databases_dir/{db_id}/{db_id}.sqlite."""
+        path = self.databases_dir / db_id / f"{db_id}.sqlite"
+        if not path.exists():
+            raise FileNotFoundError(f"Database file not found: {path}")
+        return path
+    def get_tables(self, db_id: str, n_sample_rows: int = 3) -> list[TableInfo]:
+        """Возвращает список таблиц с CREATE-SQL и примером строк."""
+        path = self.db_path(db_id)
+        conn = sqlite3.connect(f"file:{path}?mode=ro", uri=True)
+        try:
+            conn.text_factory = lambda b: b.decode("utf-8", errors="replace")
+            cur = conn.cursor()
+            cur.execute(
+                "SELECT name, sql FROM sqlite_master "
+                "WHERE type='table' AND name NOT LIKE 'sqlite_%'"
+            )
+            rows = cur.fetchall()
+            tables: list[TableInfo] = []
+            for table_name, create_sql in rows:
+                if not create_sql:
+                    continue
+                try:
+                    cur.execute(f'SELECT * FROM "{table_name}" LIMIT {n_sample_rows}')
+                    samples = cur.fetchall()
+                except sqlite3.Error:
+                    samples = []
+                tables.append(
+                    TableInfo(name=table_name, create_sql=create_sql.strip(), sample_rows=samples)
+                )
+            return tables
+        finally:
+            conn.close()
+    def render_schema(self, db_id: str, include_samples: bool = True) -> str:
+        """Текстовое представление схемы для prompt'а."""
+        tables = self.get_tables(db_id)
+        parts: list[str] = []
+        for t in tables:
+            parts.append(t.create_sql + ";")
+            if include_samples and t.sample_rows:
+                parts.append(f"-- Примеры строк из {t.name}:")
+                for row in t.sample_rows:
+                    parts.append(f"-- {row}")
+            parts.append("")
+        return "\n".join(parts).strip()
+    def list_databases(self) -> list[str]:
+        """Список доступных db_id."""
+        if not self.databases_dir.exists():
+            return []
+        return sorted(p.name for p in self.databases_dir.iterdir() if p.is_dir())

src/db/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+from .connector import DbConnector
+from .executor import SqlExecutor, QueryResult
+__all__ = ["DbConnector", "SqlExecutor", "QueryResult"]

src/db/connector.py ADDED Viewed

	@@ -0,0 +1,238 @@

+"""DbConnector -- podklyuchenie k proizvolnoy baze dannykh i chtenie skhemy.
+Podderzhivaemye tipy BD:
+    SQLite     -- put k faylu: "sqlite:///path/to/db.sqlite" ili prosto put
+    PostgreSQL -- "postgresql://user:pass@host:port/dbname"  (trebuet psycopg2)
+    MySQL      -- "mysql://user:pass@host:port/dbname"       (trebuet pymysql)
+Primer:
+    conn = DbConnector("sqlite:///data/demo/sales.sqlite")
+    print(conn.render_schema())
+    tables = conn.list_tables()
+"""
+from __future__ import annotations
+import sqlite3
+from dataclasses import dataclass, field
+from pathlib import Path
+from urllib.parse import urlparse
+@dataclass
+class ColumnInfo:
+    name: str
+    type: str
+    nullable: bool = True
+    primary_key: bool = False
+@dataclass
+class TableInfo:
+    name: str
+    columns: list[ColumnInfo] = field(default_factory=list)
+    sample_rows: list[tuple] = field(default_factory=list)
+    def to_ddl(self) -> str:
+        """Generiruet CREATE TABLE statement iz metadannykh."""
+        col_parts = []
+        for col in self.columns:
+            line = f"    {col.name} {col.type}"
+            if col.primary_key:
+                line += " PRIMARY KEY"
+            if not col.nullable:
+                line += " NOT NULL"
+            col_parts.append(line)
+        return f"CREATE TABLE {self.name} (\n" + ",\n".join(col_parts) + "\n);"
+class DbConnector:
+    """Universalnyy konektor k BD. Umeet chitat skhemu dlya podstanovki v prompt."""
+    def __init__(self, connection_string: str, n_sample_rows: int = 2):
+        self.connection_string = self._normalize(connection_string)
+        self.n_sample_rows = n_sample_rows
+        self._db_type = self._detect_type(self.connection_string)
+    def list_tables(self) -> list[str]:
+        return [t.name for t in self._get_tables(n_sample_rows=0)]
+    def get_schema(self, include_samples: bool = True) -> list[TableInfo]:
+        return self._get_tables(n_sample_rows=self.n_sample_rows if include_samples else 0)
+    def render_schema(self, include_samples: bool = True) -> str:
+        tables = self.get_schema(include_samples=include_samples)
+        parts: list[str] = []
+        for t in tables:
+            parts.append(t.to_ddl())
+            if include_samples and t.sample_rows:
+                parts.append(f"-- Primery strok iz {t.name}:")
+                for row in t.sample_rows:
+                    parts.append(f"--   {row}")
+            parts.append("")
+        return "\n".join(parts).strip()
+    def test_connection(self) -> bool:
+        try:
+            self._get_tables(n_sample_rows=0)
+            return True
+        except Exception:
+            return False
+    def _get_tables(self, n_sample_rows: int) -> list[TableInfo]:
+        if self._db_type == "sqlite":
+            return self._get_tables_sqlite(n_sample_rows)
+        elif self._db_type == "postgresql":
+            return self._get_tables_postgres(n_sample_rows)
+        elif self._db_type == "mysql":
+            return self._get_tables_mysql(n_sample_rows)
+        else:
+            raise ValueError(f"Neizvestnyy tip BD: {self._db_type}")
+    def _get_tables_sqlite(self, n_sample_rows: int) -> list[TableInfo]:
+        path = self._safe_sqlite_path(self._sqlite_path())
+        conn = sqlite3.connect(str(path))
+        conn.text_factory = lambda b: b.decode("utf-8", errors="replace")
+        try:
+            cur = conn.cursor()
+            cur.execute(
+                "SELECT name FROM sqlite_master "
+                "WHERE type='table' AND name NOT LIKE 'sqlite_%' "
+                "ORDER BY name"
+            )
+            table_names = [r[0] for r in cur.fetchall()]
+            tables: list[TableInfo] = []
+            for name in table_names:
+                cur.execute(f'PRAGMA table_info("{name}")')
+                cols = [
+                    ColumnInfo(
+                        name=row[1],
+                        type=row[2] or "TEXT",
+                        nullable=not row[3],
+                        primary_key=bool(row[5]),
+                    )
+                    for row in cur.fetchall()
+                ]
+                samples: list[tuple] = []
+                if n_sample_rows > 0:
+                    try:
+                        cur.execute(f'SELECT * FROM "{name}" LIMIT {n_sample_rows}')
+                        samples = cur.fetchall()
+                    except sqlite3.Error:
+                        pass
+                tables.append(TableInfo(name=name, columns=cols, sample_rows=samples))
+            return tables
+        finally:
+            conn.close()
+    def _get_tables_postgres(self, n_sample_rows: int) -> list[TableInfo]:
+        try:
+            import psycopg2  # type: ignore
+        except ImportError as e:
+            raise ImportError("Ustanovi psycopg2: pip install psycopg2-binary") from e
+        conn = psycopg2.connect(self.connection_string)
+        try:
+            cur = conn.cursor()
+            cur.execute(
+                "SELECT table_name FROM information_schema.tables "
+                "WHERE table_schema = 'public' AND table_type = 'BASE TABLE' "
+                "ORDER BY table_name"
+            )
+            table_names = [r[0] for r in cur.fetchall()]
+            tables: list[TableInfo] = []
+            for name in table_names:
+                cur.execute(
+                    "SELECT column_name, data_type, is_nullable "
+                    "FROM information_schema.columns "
+                    "WHERE table_name = %s AND table_schema = 'public' "
+                    "ORDER BY ordinal_position",
+                    (name,),
+                )
+                cols = [
+                    ColumnInfo(name=r[0], type=r[1], nullable=(r[2] == "YES"))
+                    for r in cur.fetchall()
+                ]
+                samples: list[tuple] = []
+                if n_sample_rows > 0:
+                    cur.execute(f'SELECT * FROM "{name}" LIMIT {n_sample_rows}')
+                    samples = cur.fetchall()
+                tables.append(TableInfo(name=name, columns=cols, sample_rows=samples))
+            return tables
+        finally:
+            conn.close()
+    def _get_tables_mysql(self, n_sample_rows: int) -> list[TableInfo]:
+        try:
+            import pymysql  # type: ignore
+        except ImportError as e:
+            raise ImportError("Ustanovi pymysql: pip install pymysql") from e
+        parsed = urlparse(self.connection_string)
+        conn = pymysql.connect(
+            host=parsed.hostname,
+            port=parsed.port or 3306,
+            user=parsed.username,
+            password=parsed.password,
+            database=parsed.path.lstrip("/"),
+        )
+        try:
+            cur = conn.cursor()
+            cur.execute("SHOW TABLES")
+            table_names = [r[0] for r in cur.fetchall()]
+            tables: list[TableInfo] = []
+            for name in table_names:
+                cur.execute(f"DESCRIBE `{name}`")
+                cols = [
+                    ColumnInfo(
+                        name=r[0], type=r[1],
+                        nullable=(r[2] == "YES"),
+                        primary_key=(r[3] == "PRI"),
+                    )
+                    for r in cur.fetchall()
+                ]
+                samples: list[tuple] = []
+                if n_sample_rows > 0:
+                    cur.execute(f"SELECT * FROM `{name}` LIMIT {n_sample_rows}")
+                    samples = cur.fetchall()
+                tables.append(TableInfo(name=name, columns=cols, sample_rows=samples))
+            return tables
+        finally:
+            conn.close()
+    def _sqlite_path(self) -> Path:
+        cs = self.connection_string
+        if cs.startswith("sqlite:///"):
+            return Path(cs[10:])
+        return Path(cs)
+    @staticmethod
+    def _safe_sqlite_path(path: Path) -> Path:
+        """Esli ryadom s BD est journal-fayl, kopируем fayl vo vremennuyu direktoriu."""
+        import shutil
+        import tempfile
+        journal = Path(str(path) + "-journal")
+        wal = Path(str(path) + "-wal")
+        if journal.exists() or wal.exists():
+            tmp = Path(tempfile.mktemp(suffix=".sqlite"))
+            shutil.copy2(path, tmp)
+            return tmp
+        return path
+    @staticmethod
+    def _normalize(cs: str) -> str:
+        """Esli peredan prosto put k faylu -- prevraschaem v sqlite:// URI."""
+        cs = cs.strip()
+        if cs.endswith(".sqlite") or cs.endswith(".db"):
+            return f"sqlite:///{cs}"
+        return cs
+    @staticmethod
+    def _detect_type(cs: str) -> str:
+        if cs.startswith("sqlite"):
+            return "sqlite"
+        if cs.startswith("postgresql") or cs.startswith("postgres"):
+            return "postgresql"
+        if cs.startswith("mysql"):
+            return "mysql"
+        raise ValueError(f"Ne udalos opredelit tip BD: {cs}")

src/db/executor.py ADDED Viewed

	@@ -0,0 +1,152 @@

+"""SqlExecutor -- vypolnyaet SQL-zapros na podklyuchennoy BD i vozvraschaet rezultat.
+Primer:
+    executor = SqlExecutor("sqlite:///data/demo/sales.sqlite")
+    result = executor.run("SELECT SUM(amount) FROM orders WHERE status='paid'")
+    print(result.columns)
+    print(result.rows)
+"""
+from __future__ import annotations
+import sqlite3
+from dataclasses import dataclass, field
+from pathlib import Path
+from urllib.parse import urlparse
+@dataclass
+class QueryResult:
+    """Rezultat vypolneniya SQL-zaprosa."""
+    columns: list[str]
+    rows: list[list]
+    row_count: int
+    sql: str
+    error: str | None = None
+    @property
+    def success(self) -> bool:
+        return self.error is None
+    def to_dict(self) -> dict:
+        return {
+            "columns": self.columns,
+            "rows": self.rows,
+            "row_count": self.row_count,
+            "sql": self.sql,
+            "error": self.error,
+        }
+    def to_markdown_table(self) -> str:
+        if self.error:
+            return f"Oshibka: {self.error}"
+        if not self.rows:
+            return "(pustoy rezultat)"
+        header = " | ".join(self.columns)
+        sep = " | ".join(["---"] * len(self.columns))
+        rows = "\n".join(" | ".join(str(v) for v in row) for row in self.rows)
+        return f"{header}\n{sep}\n{rows}"
+class SqlExecutor:
+    """Vypolnyaet SQL na podklyuchennoy BD."""
+    MAX_ROWS = 500
+    def __init__(self, connection_string: str):
+        self.connection_string = connection_string.strip()
+        self._db_type = self._detect_type(self.connection_string)
+    def run(self, sql: str) -> QueryResult:
+        try:
+            if self._db_type == "sqlite":
+                return self._run_sqlite(sql)
+            elif self._db_type == "postgresql":
+                return self._run_postgres(sql)
+            elif self._db_type == "mysql":
+                return self._run_mysql(sql)
+            else:
+                return QueryResult(columns=[], rows=[], row_count=0, sql=sql,
+                                   error=f"Neizvestnyy tip BD: {self._db_type}")
+        except Exception as e:
+            return QueryResult(columns=[], rows=[], row_count=0, sql=sql, error=str(e))
+    def _run_sqlite(self, sql: str) -> QueryResult:
+        path = self._safe_sqlite_path(self._sqlite_path())
+        conn = sqlite3.connect(str(path))
+        conn.text_factory = lambda b: b.decode("utf-8", errors="replace")
+        try:
+            cur = conn.cursor()
+            cur.execute(sql)
+            cols = [d[0] for d in (cur.description or [])]
+            rows = [list(r) for r in cur.fetchmany(self.MAX_ROWS)]
+            return QueryResult(columns=cols, rows=rows, row_count=len(rows), sql=sql)
+        finally:
+            conn.close()
+    def _run_postgres(self, sql: str) -> QueryResult:
+        try:
+            import psycopg2  # type: ignore
+        except ImportError as e:
+            raise ImportError("Ustanovi psycopg2: pip install psycopg2-binary") from e
+        conn = psycopg2.connect(self.connection_string)
+        try:
+            cur = conn.cursor()
+            cur.execute(sql)
+            cols = [d[0] for d in (cur.description or [])]
+            rows = [list(r) for r in cur.fetchmany(self.MAX_ROWS)]
+            return QueryResult(columns=cols, rows=rows, row_count=len(rows), sql=sql)
+        finally:
+            conn.close()
+    def _run_mysql(self, sql: str) -> QueryResult:
+        try:
+            import pymysql  # type: ignore
+        except ImportError as e:
+            raise ImportError("Ustanovi pymysql: pip install pymysql") from e
+        parsed = urlparse(self.connection_string)
+        conn = pymysql.connect(
+            host=parsed.hostname,
+            port=parsed.port or 3306,
+            user=parsed.username,
+            password=parsed.password,
+            database=parsed.path.lstrip("/"),
+        )
+        try:
+            cur = conn.cursor()
+            cur.execute(sql)
+            cols = [d[0] for d in (cur.description or [])]
+            rows = [list(r) for r in cur.fetchmany(self.MAX_ROWS)]
+            return QueryResult(columns=cols, rows=rows, row_count=len(rows), sql=sql)
+        finally:
+            conn.close()
+    def _sqlite_path(self) -> Path:
+        cs = self.connection_string
+        if cs.startswith("sqlite:///"):
+            return Path(cs[10:])
+        return Path(cs)
+    @staticmethod
+    def _safe_sqlite_path(path: Path) -> Path:
+        import shutil
+        import tempfile
+        journal = Path(str(path) + "-journal")
+        wal = Path(str(path) + "-wal")
+        if journal.exists() or wal.exists():
+            tmp = Path(tempfile.mktemp(suffix=".sqlite"))
+            shutil.copy2(path, tmp)
+            return tmp
+        return path
+    @staticmethod
+    def _detect_type(cs: str) -> str:
+        if cs.startswith("sqlite") or cs.endswith(".sqlite") or cs.endswith(".db"):
+            return "sqlite"
+        if cs.startswith("postgresql") or cs.startswith("postgres"):
+            return "postgresql"
+        if cs.startswith("mysql"):
+            return "mysql"
+        raise ValueError(f"Ne udalos opredelit tip BD: {cs}")

src/evaluation/__init__.py ADDED Viewed

File without changes

src/evaluation/evaluate.py ADDED Viewed

	@@ -0,0 +1,72 @@

+"""Скрипт прогона модели на test-сплите PAUQ.
+Использование:
+    python -m src.evaluation.evaluate --split dev --limit 50
+"""
+from __future__ import annotations
+import argparse
+import json
+from pathlib import Path
+from tqdm import tqdm
+from src.config import settings
+from src.data.loader import load_pauq_split
+from src.data.schema import SchemaRetriever
+from src.evaluation.metrics import compute_metrics
+from src.models.inference import InferenceEngine
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--split", default="dev", choices=["train", "dev", "test"])
+    parser.add_argument("--limit", type=int, default=None, help="Ограничить число примеров")
+    parser.add_argument("--output", type=Path, default=Path("results/predictions.jsonl"))
+    args = parser.parse_args()
+    split_path = settings.pauq_data_dir / f"{args.split}.json"
+    examples = load_pauq_split(split_path)
+    if args.limit:
+        examples = examples[: args.limit]
+    schema_ret = SchemaRetriever(settings.databases_dir)
+    engine = InferenceEngine()
+    engine.load()
+    predictions: list[str] = []
+    golds: list[str] = []
+    db_ids: list[str] = []
+    rows = []
+    for ex in tqdm(examples, desc="Inference"):
+        try:
+            schema = schema_ret.render_schema(ex.db_id)
+        except FileNotFoundError:
+            continue
+        result = engine.generate(schema, ex.question)
+        predictions.append(result.sql)
+        golds.append(ex.query)
+        db_ids.append(ex.db_id)
+        rows.append(
+            {
+                "db_id": ex.db_id,
+                "question": ex.question,
+                "gold": ex.query,
+                "pred": result.sql,
+                "raw": result.raw_output,
+            }
+        )
+    args.output.parent.mkdir(parents=True, exist_ok=True)
+    with args.output.open("w", encoding="utf-8") as f:
+        for r in rows:
+            f.write(json.dumps(r, ensure_ascii=False) + "\n")
+    metrics = compute_metrics(predictions, golds, db_ids, settings.databases_dir)
+    print(json.dumps(metrics, indent=2, ensure_ascii=False))
+if __name__ == "__main__":
+    main()

src/evaluation/metrics.py ADDED Viewed

	@@ -0,0 +1,89 @@

+"""Метрики Text-to-SQL: Exact Match и Execution Accuracy."""
+from __future__ import annotations
+import sqlite3
+from pathlib import Path
+from src.models.postprocess import normalize_sql
+def exact_match(predicted: str, gold: str, dialect: str = "sqlite") -> bool:
+    """Сравнение нормализованных SQL посимвольно. Грубая, но честная метрика."""
+    return normalize_sql(predicted, dialect) == normalize_sql(gold, dialect)
+def execution_accuracy(
+    predicted_sql: str,
+    gold_sql: str,
+    db_path: Path | str,
+    timeout_seconds: float = 5.0,
+) -> bool:
+    """Прогон обоих SQL на SQLite. True если результаты совпадают как множества."""
+    db_path = Path(db_path)
+    conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True, timeout=timeout_seconds)
+    try:
+        conn.text_factory = lambda b: b.decode("utf-8", errors="replace")
+        try:
+            pred_rows = _run(conn, predicted_sql)
+        except sqlite3.Error:
+            return False
+        try:
+            gold_rows = _run(conn, gold_sql)
+        except sqlite3.Error:
+            return False
+        return _rows_equal(pred_rows, gold_rows)
+    finally:
+        conn.close()
+def _run(conn: sqlite3.Connection, sql: str) -> list[tuple]:
+    cur = conn.cursor()
+    cur.execute(sql)
+    return cur.fetchall()
+def _rows_equal(a: list[tuple], b: list[tuple]) -> bool:
+    """Сравнение как мультимножеств — порядок не важен (если в SQL нет ORDER BY)."""
+    if len(a) != len(b):
+        return False
+    return sorted(map(_row_key, a)) == sorted(map(_row_key, b))
+def _row_key(row: tuple) -> tuple:
+    return tuple(str(x) for x in row)
+def compute_metrics(
+    predictions: list[str],
+    golds: list[str],
+    db_ids: list[str],
+    databases_dir: Path | str,
+) -> dict:
+    """Прогон по всему датасету. Возвращает dict с EM, EX, и счётчиками."""
+    databases_dir = Path(databases_dir)
+    n = len(predictions)
+    assert n == len(golds) == len(db_ids), "Mismatched lengths"
+    em_count = 0
+    ex_count = 0
+    parse_fail = 0
+    for pred, gold, db_id in zip(predictions, golds, db_ids):
+        if exact_match(pred, gold):
+            em_count += 1
+        db_path = databases_dir / db_id / f"{db_id}.sqlite"
+        if not db_path.exists():
+            parse_fail += 1
+            continue
+        if execution_accuracy(pred, gold, db_path):
+            ex_count += 1
+    return {
+        "n": n,
+        "exact_match": em_count / n if n else 0.0,
+        "execution_accuracy": ex_count / n if n else 0.0,
+        "parse_fail": parse_fail,
+    }

src/models/__init__.py ADDED Viewed

File without changes

src/models/inference.py ADDED Viewed

	@@ -0,0 +1,94 @@

+"""Загрузка модели + LoRA-адаптера и инференс.
+На десктопе/ноутбуке без GPU работает на CPU. Медленно, но достаточно для разработки и демо.
+На Kaggle/Colab — на GPU, быстрее.
+"""
+from __future__ import annotations
+from dataclasses import dataclass
+from pathlib import Path
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from src.config import settings
+from src.data.prompt import build_chat_messages
+from src.models.postprocess import postprocess
+@dataclass
+class GenerationResult:
+    sql: str
+    raw_output: str
+class InferenceEngine:
+    """Singleton-обёртка над моделью. Загружается один раз при старте API."""
+    def __init__(
+        self,
+        base_model_name: str | None = None,
+        lora_adapter_path: str | None = None,
+        device: str | None = None,
+    ):
+        self.base_model_name = base_model_name or settings.base_model_name
+        self.lora_adapter_path = lora_adapter_path or settings.lora_adapter_path
+        self.device = device or settings.device
+        self.tokenizer = None
+        self.model = None
+        self._loaded = False
+    def load(self) -> None:
+        """Лениво грузим модель. На CPU без квантизации."""
+        if self._loaded:
+            return
+        self.tokenizer = AutoTokenizer.from_pretrained(self.base_model_name)
+        # bfloat16 вдвое меньше float32 (~6 ГБ vs ~12 ГБ) и поддерживается на CPU
+        self.model = AutoModelForCausalLM.from_pretrained(
+            self.base_model_name,
+            dtype=torch.bfloat16,
+            device_map=self.device if self.device != "cpu" else None,
+        )
+        # Подцепляем LoRA-адаптер: сначала ищем локально, потом на HF Hub
+        adapter_path = Path(self.lora_adapter_path)
+        adapter_id = str(adapter_path) if adapter_path.exists() else self.lora_adapter_path
+        try:
+            from peft import PeftModel
+            self.model = PeftModel.from_pretrained(self.model, adapter_id)
+        except ImportError:
+            pass  # peft не установлен — работаем на базовой модели
+        self.model.eval()
+        self._loaded = True
+    def generate(
+        self,
+        schema: str,
+        question: str,
+        max_new_tokens: int | None = None,
+    ) -> GenerationResult:
+        """Принимает schema (текст DDL) и вопрос, возвращает SQL."""
+        if not self._loaded:
+            self.load()
+        messages = build_chat_messages(schema, question)
+        prompt = self.tokenizer.apply_chat_template(
+            messages, tokenize=False, add_generation_prompt=True
+        )
+        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
+        with torch.no_grad():
+            output_ids = self.model.generate(
+                **inputs,
+                max_new_tokens=max_new_tokens or settings.max_new_tokens,
+                do_sample=settings.do_sample,
+                temperature=settings.temperature if settings.do_sample else 1.0,
+                pad_token_id=self.tokenizer.eos_token_id,
+            )
+        new_tokens = output_ids[0][inputs["input_ids"].shape[1] :]
+        raw = self.tokenizer.decode(new_tokens, skip_special_tokens=True)
+        return GenerationResult(sql=postprocess(raw), raw_output=raw)

src/models/postprocess.py ADDED Viewed

	@@ -0,0 +1,50 @@

+"""Постобработка SQL: чистка вывода модели и базовая валидация через sqlglot."""
+from __future__ import annotations
+import re
+import sqlglot
+from sqlglot.errors import ParseError
+def strip_model_artifacts(text: str) -> str:
+    """Убирает markdown-блоки, префиксы, лишний текст после SQL."""
+    # ```sql ... ```
+    m = re.search(r"```(?:sql)?\s*(.*?)```", text, re.DOTALL | re.IGNORECASE)
+    if m:
+        text = m.group(1)
+    # Убираем "SQL:", "Ответ:" и т.п. в начале
+    text = re.sub(r"^\s*(?:SQL|Ответ|Answer)\s*:\s*", "", text, flags=re.IGNORECASE)
+    # Если есть несколько SQL — берём первый до точки с запятой
+    text = text.strip()
+    if ";" in text:
+        head, _, _ = text.partition(";")
+        text = head.strip() + ";"
+    return text.strip()
+def is_valid_sql(sql: str, dialect: str = "sqlite") -> bool:
+    """Парсится ли SQL через sqlglot."""
+    try:
+        sqlglot.parse_one(sql, dialect=dialect)
+        return True
+    except ParseError:
+        return False
+def normalize_sql(sql: str, dialect: str = "sqlite") -> str:
+    """Нормализация для Exact Match: единый регистр ключевых слов, пробелы."""
+    try:
+        return sqlglot.parse_one(sql, dialect=dialect).sql(dialect=dialect, pretty=False).lower()
+    except ParseError:
+        # Если не парсится — просто нижний регистр и схлопывание пробелов
+        return re.sub(r"\s+", " ", sql.lower()).strip().rstrip(";")
+def postprocess(raw_output: str) -> str:
+    """Полный pipeline постобработки."""
+    return strip_model_artifacts(raw_output)

streamlit_app.py ADDED Viewed

	@@ -0,0 +1,375 @@

+"""Streamlit-интерфейс утилиты Ru2SQL.
+Запуск:
+    streamlit run streamlit_app.py
+Что умеет:
+    - Подключиться к любой SQLite/PostgreSQL/MySQL базе данных
+    - Загрузить бизнес-словарь компании из YAML-файла или редактировать прямо в браузере
+    - Принять вопрос на русском → сгенерировать SQL → выполнить → показать результат
+    - Хранить историю запросов в текущей сессии
+"""
+from __future__ import annotations
+import sys
+import time
+from pathlib import Path
+import streamlit as st
+# Путь к src/
+ROOT = Path(__file__).resolve().parent
+sys.path.insert(0, str(ROOT))
+# ──────────────────────────────────────────────
+# Конфигурация страницы
+# ──────────────────────────────────────────────
+st.set_page_config(
+    page_title="Ru2SQL — Natural Language → SQL",
+    page_icon="🗄️",
+    layout="wide",
+    initial_sidebar_state="expanded",
+)
+# ──────────────────────────────────────────────
+# CSS
+# ──────────────────────────────────────────────
+st.markdown("""
+<style>
+    .sql-box {
+        background: #1e1e2e;
+        color: #cdd6f4;
+        font-family: 'Courier New', monospace;
+        font-size: 14px;
+        padding: 16px;
+        border-radius: 8px;
+        border-left: 4px solid #89b4fa;
+        white-space: pre-wrap;
+        margin: 8px 0;
+    }
+    .metric-card {
+        background: #313244;
+        padding: 12px 16px;
+        border-radius: 8px;
+        text-align: center;
+    }
+    .status-ok  { color: #a6e3a1; font-weight: bold; }
+    .status-err { color: #f38ba8; font-weight: bold; }
+    .history-item {
+        border-left: 3px solid #89b4fa;
+        padding: 8px 12px;
+        margin: 6px 0;
+        background: #1e1e2e;
+        border-radius: 0 6px 6px 0;
+    }
+</style>
+""", unsafe_allow_html=True)
+# ──────────────────────────────────────────────
+# Session state
+# ──────────────────────────────────────────────
+def _default_vocab_yaml() -> str:
+    example = ROOT / "configs" / "example_vocabulary.yaml"
+    if example.exists():
+        return example.read_text(encoding="utf-8")
+    return (
+        "company: Моя компания\n\n"
+        "terms:\n"
+        "  выручка: SUM(orders.amount) WHERE status = 'paid'\n\n"
+        "filters:\n"
+        "  только_оплаченные: orders.status = 'paid'\n\n"
+        "notes: []\n"
+    )
+def _init_state():
+    defaults = {
+        "history": [],
+        "model_loaded": False,
+        "engine": None,
+        "db_connector": None,
+        "db_executor": None,
+        "vocabulary": None,
+        "db_connection_string": "",
+        "vocab_yaml": _default_vocab_yaml(),
+    }
+    for k, v in defaults.items():
+        if k not in st.session_state:
+            st.session_state[k] = v
+_init_state()
+# ──────────────────────────────────────────────
+# Вспомогательные функции
+# ──────────────────────────────────────────────
+@st.cache_resource(show_spinner="Загружаю модель… (~30 с на первый раз)")
+def _load_engine():
+    from src.models.inference import InferenceEngine
+    engine = InferenceEngine()
+    engine.load()
+    return engine
+def _connect_db(cs: str):
+    from src.db.connector import DbConnector
+    from src.db.executor import SqlExecutor
+    connector = DbConnector(cs)
+    executor = SqlExecutor(cs)
+    return connector, executor
+def _load_vocab_from_yaml(yaml_text: str):
+    import tempfile
+    from src.business.vocabulary import BusinessVocabulary
+    tmp = Path(tempfile.mktemp(suffix=".yaml"))
+    tmp.write_text(yaml_text, encoding="utf-8")
+    vocab = BusinessVocabulary.from_yaml(tmp)
+    tmp.unlink(missing_ok=True)
+    return vocab
+# ──────────────────────────────────────────────
+# Боковая панель
+# ──────────────────────────────────────────────
+with st.sidebar:
+    st.title("⚙️ Настройки")
+    # ── Модель — загружается автоматически при старте ──
+    st.subheader("🤖 Модель")
+    if not st.session_state.model_loaded:
+        with st.spinner("Загружаю модель…"):
+            try:
+                st.session_state.engine = _load_engine()
+                st.session_state.model_loaded = True
+            except Exception as e:
+                st.error(f"Ошибка загрузки модели: {e}")
+    if st.session_state.model_loaded:
+        st.markdown('<span class="status-ok">✅ Модель готова</span>', unsafe_allow_html=True)
+    else:
+        st.markdown('<span class="status-err">⚠️ Модель не загружена</span>', unsafe_allow_html=True)
+    st.divider()
+    # ── База данных ──
+    st.subheader("🗄️ База данных")
+    db_type = st.radio("Тип подключения", ["SQLite файл", "Строка подключения"],
+                       horizontal=True)
+    if db_type == "SQLite файл":
+        uploaded = st.file_uploader("Загрузить .sqlite файл", type=["sqlite", "db"])
+        use_demo = st.checkbox("Использовать демо-базу", value=True)
+        if use_demo:
+            demo_path = ROOT / "data" / "demo" / "sales.sqlite"
+            cs = str(demo_path)
+        elif uploaded:
+            import tempfile
+            tmp_db = Path(tempfile.mktemp(suffix=".sqlite"))
+            tmp_db.write_bytes(uploaded.read())
+            cs = str(tmp_db)
+        else:
+            cs = ""
+    else:
+        cs = st.text_input(
+            "Строка подключения",
+            placeholder="postgresql://user:pass@localhost/mydb",
+            value=st.session_state.db_connection_string,
+        )
+    if cs and st.button("Подключиться к БД", use_container_width=True):
+        try:
+            connector, executor = _connect_db(cs)
+            tables = connector.list_tables()
+            st.session_state.db_connector = connector
+            st.session_state.db_executor = executor
+            st.session_state.db_connection_string = cs
+            st.success(f"Подключено! Таблиц: {len(tables)}")
+        except Exception as e:
+            st.error(f"Ошибка подключения: {e}")
+    if st.session_state.db_connector:
+        tables = st.session_state.db_connector.list_tables()
+        st.markdown('<span class="status-ok">✅ БД подключена</span>', unsafe_allow_html=True)
+        with st.expander("Таблицы"):
+            for t in tables:
+                st.code(t)
+    st.divider()
+    # ── Бизнес-словарь ──
+    st.subheader("📖 Бизнес-словарь")
+    vocab_yaml = st.text_area(
+        "YAML-конфигурация",
+        value=st.session_state.vocab_yaml,
+        height=260,
+        help="Определите термины вашей компании — модель будет их учитывать при генерации SQL",
+    )
+    st.session_state.vocab_yaml = vocab_yaml
+    if st.button("Применить словарь", use_container_width=True):
+        try:
+            st.session_state.vocabulary = _load_vocab_from_yaml(vocab_yaml)
+            st.success("Словарь применён!")
+        except Exception as e:
+            st.error(f"Ошибка в YAML: {e}")
+    if st.session_state.vocabulary:
+        v = st.session_state.vocabulary
+        st.markdown(f'<span class="status-ok">✅ Словарь: {v.company or "загружен"}</span>',
+                    unsafe_allow_html=True)
+        terms_count = len(v.terms)
+        if terms_count:
+            st.caption(f"{terms_count} терминов определено")
+# ──────────────────────────────────────────────
+# Основная область
+# ──────────────────────────────────────────────
+st.title("🗄️ Ru2SQL — Бизнес-аналитика на русском языке")
+st.caption("Задайте вопрос на русском → получите SQL и данные из вашей базы")
+tab_query, tab_schema, tab_history = st.tabs(["💬 Запрос", "📐 Схема БД", "🕓 История"])
+# ──────────── Вкладка: Запрос ────────────
+with tab_query:
+    ready = st.session_state.model_loaded and st.session_state.db_connector is not None
+    if not ready:
+        cols = st.columns(2)
+        with cols[0]:
+            if not st.session_state.model_loaded:
+                st.warning("⚠️ Загрузите модель в левой панели")
+        with cols[1]:
+            if st.session_state.db_connector is None:
+                st.warning("⚠️ Подключитесь к базе данных в левой панели")
+    question = st.text_area(
+        "Ваш вопрос",
+        placeholder="Например: Какая выручка за январь этого года?",
+        height=100,
+        disabled=not ready,
+    )
+    col_btn, col_hint = st.columns([1, 4])
+    with col_btn:
+        run_btn = st.button("▶ Выполнить", type="primary",
+                            disabled=not ready or not question.strip(),
+                            use_container_width=True)
+    with col_hint:
+        if ready:
+            st.caption("Модель сгенерирует SQL и выполнит его на вашей БД")
+    # Быстрые примеры
+    if st.session_state.db_connection_string and "sales" in st.session_state.db_connection_string:
+        st.caption("💡 Попробуйте:")
+        example_cols = st.columns(3)
+        examples = [
+            "Какая выручка за 2026 год?",
+            "Топ-5 клиентов по сумме заказов",
+            "Сколько заказов по каждому менеджеру?",
+        ]
+        for i, ex in enumerate(examples):
+            with example_cols[i]:
+                if st.button(ex, key=f"ex_{i}", use_container_width=True):
+                    question = ex
+                    run_btn = True
+    if run_btn and question.strip():
+        engine = st.session_state.engine
+        connector = st.session_state.db_connector
+        executor = st.session_state.db_executor
+        vocab = st.session_state.vocabulary
+        # Обогащаем вопрос бизнес-словарём
+        enriched_question = vocab.enrich_prompt(question) if vocab else question
+        # Получаем схему
+        schema = connector.render_schema(include_samples=True)
+        with st.spinner("Генерирую SQL…"):
+            t0 = time.time()
+            result = engine.generate(schema, enriched_question)
+            gen_time = time.time() - t0
+        st.subheader("Сгенерированный SQL")
+        st.markdown(f'<div class="sql-box">{result.sql}</div>', unsafe_allow_html=True)
+        col1, col2 = st.columns(2)
+        col1.metric("Время генерации", f"{gen_time:.1f} с")
+        # Выполняем SQL
+        if result.sql.strip():
+            with st.spinner("Выполняю запрос…"):
+                qr = executor.run(result.sql)
+            if qr.success:
+                col2.metric("Строк в результате", qr.row_count)
+                st.subheader("Результат")
+                if qr.rows:
+                    import pandas as pd
+                    df = pd.DataFrame(qr.rows, columns=qr.columns)
+                    st.dataframe(df, use_container_width=True)
+                else:
+                    st.info("Запрос выполнен успешно, результат пустой")
+            else:
+                col2.error("Ошибка выполнения")
+                st.error(f"SQL ошибка: {qr.error}")
+        # Добавляем в историю
+        st.session_state.history.append({
+            "question": question,
+            "sql": result.sql,
+            "success": qr.success if result.sql.strip() else False,
+            "rows": qr.row_count if result.sql.strip() and qr.success else 0,
+            "time": gen_time,
+        })
+# ──────────── Вкладка: Схема БД ────────────
+with tab_schema:
+    if st.session_state.db_connector is None:
+        st.info("Подключитесь к базе данных в левой панели")
+    else:
+        connector = st.session_state.db_connector
+        st.subheader("Структура базы данных")
+        show_samples = st.toggle("Показывать примеры строк", value=True)
+        schema_text = connector.render_schema(include_samples=show_samples)
+        for table in connector.get_schema(include_samples=show_samples):
+            with st.expander(f"📋 {table.name}  ({len(table.columns)} колонок)"):
+                st.code(table.to_ddl(), language="sql")
+                if show_samples and table.sample_rows:
+                    import pandas as pd
+                    cols = [c.name for c in table.columns]
+                    st.caption("Примеры строк:")
+                    st.dataframe(
+                        pd.DataFrame(table.sample_rows, columns=cols),
+                        use_container_width=True,
+                    )
+# ──────────── Вкладка: История ────────────
+with tab_history:
+    history = st.session_state.history
+    if not history:
+        st.info("История запросов пуста. Задайте первый вопрос на вкладке «Запрос».")
+    else:
+        st.subheader(f"История запросов ({len(history)})")
+        if st.button("Очистить историю"):
+            st.session_state.history = []
+            st.rerun()
+        for i, item in enumerate(reversed(history)):
+            status = "✅" if item["success"] else "❌"
+            with st.expander(f"{status} {item['question']}", expanded=(i == 0)):
+                st.markdown(f'<div class="sql-box">{item["sql"]}</div>', unsafe_allow_html=True)
+                cols = st.columns(3)
+                cols[0].metric("Время генерации", f"{item['time']:.1f} с")
+                cols[1].metric("Строк", item["rows"])
+                cols[2].metric("Статус", "OK" if item["success"] else "Ошибка")

tests/__init__.py ADDED Viewed

File without changes

tests/test_metrics.py ADDED Viewed

	@@ -0,0 +1,56 @@

+"""Тесты на метрики EM и EX."""
+import sqlite3
+from pathlib import Path
+import pytest
+from src.evaluation.metrics import exact_match, execution_accuracy
+def test_exact_match_simple():
+    assert exact_match("SELECT * FROM t", "select * from t")
+def test_exact_match_whitespace():
+    assert exact_match("SELECT  *  FROM  t", "SELECT * FROM t")
+def test_exact_match_negative():
+    assert not exact_match("SELECT a FROM t", "SELECT b FROM t")
+@pytest.fixture
+def tmp_sqlite(tmp_path: Path) -> Path:
+    db = tmp_path / "tiny.sqlite"
+    conn = sqlite3.connect(db)
+    conn.execute("CREATE TABLE users (id INT, name TEXT)")
+    conn.executemany("INSERT INTO users VALUES (?, ?)", [(1, "a"), (2, "b")])
+    conn.commit()
+    conn.close()
+    return db
+def test_execution_accuracy_match(tmp_sqlite: Path):
+    pred = "SELECT id FROM users ORDER BY id"
+    gold = "SELECT id FROM users ORDER BY id"
+    assert execution_accuracy(pred, gold, tmp_sqlite)
+def test_execution_accuracy_set_equal(tmp_sqlite: Path):
+    pred = "SELECT id FROM users ORDER BY id DESC"
+    gold = "SELECT id FROM users ORDER BY id ASC"
+    # Без ORDER BY проверки — как множества они равны
+    assert execution_accuracy(pred, gold, tmp_sqlite)
+def test_execution_accuracy_mismatch(tmp_sqlite: Path):
+    pred = "SELECT id FROM users WHERE id = 1"
+    gold = "SELECT id FROM users WHERE id = 2"
+    assert not execution_accuracy(pred, gold, tmp_sqlite)
+def test_execution_accuracy_invalid_pred(tmp_sqlite: Path):
+    pred = "SELEC bad sql"
+    gold = "SELECT id FROM users"
+    assert not execution_accuracy(pred, gold, tmp_sqlite)

tests/test_postprocess.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""Тесты на постобработку SQL."""
+from src.models.postprocess import (
+    is_valid_sql,
+    normalize_sql,
+    postprocess,
+    strip_model_artifacts,
+)
+def test_strip_markdown_block():
+    raw = "```sql\nSELECT * FROM users;\n```"
+    assert strip_model_artifacts(raw).startswith("SELECT")
+def test_strip_sql_prefix():
+    raw = "SQL: SELECT 1;"
+    assert strip_model_artifacts(raw).startswith("SELECT")
+def test_keeps_first_statement():
+    raw = "SELECT 1; SELECT 2;"
+    out = strip_model_artifacts(raw)
+    assert "SELECT 1" in out
+    assert "SELECT 2" not in out
+def test_valid_sql():
+    assert is_valid_sql("SELECT * FROM students WHERE id = 1")
+def test_invalid_sql():
+    assert not is_valid_sql("SELEC * FRM where")
+def test_normalize_em():
+    a = "SELECT  *  FROM  Users"
+    b = "select * from users"
+    assert normalize_sql(a) == normalize_sql(b)
+def test_postprocess_full():
+    raw = "```sql\nSELECT name FROM students WHERE group_id = 1;\nSELECT 2;\n```"
+    out = postprocess(raw)
+    assert out.startswith("SELECT name")
+    assert "SELECT 2" not in out

tests/test_prompt.py ADDED Viewed

	@@ -0,0 +1,32 @@

+"""Тесты на PromptBuilder."""
+from src.data.prompt import (
+    SYSTEM_PROMPT,
+    build_chat_messages,
+    build_training_example,
+    build_user_message,
+)
+def test_user_message_contains_parts():
+    msg = build_user_message("CREATE TABLE t (id INT);", "Покажи всё")
+    assert "Schema:" in msg
+    assert "Question:" in msg
+    assert "SQL:" in msg
+    assert "CREATE TABLE" in msg
+    assert "Покажи всё" in msg
+def test_chat_messages_have_system_and_user():
+    msgs = build_chat_messages("schema", "question")
+    assert len(msgs) == 2
+    assert msgs[0]["role"] == "system"
+    assert msgs[0]["content"] == SYSTEM_PROMPT
+    assert msgs[1]["role"] == "user"
+def test_training_example_has_assistant():
+    msgs = build_training_example("schema", "question", "SELECT 1")
+    assert len(msgs) == 3
+    assert msgs[2]["role"] == "assistant"
+    assert msgs[2]["content"] == "SELECT 1"

tests/test_schema.py ADDED Viewed

	@@ -0,0 +1,44 @@

+"""Тесты на SchemaRetriever."""
+import sqlite3
+from pathlib import Path
+import pytest
+from src.data.schema import SchemaRetriever
+@pytest.fixture
+def fake_databases_dir(tmp_path: Path) -> Path:
+    """Создаёт структуру databases/uni/uni.sqlite с двумя таблицами."""
+    db_id = "uni"
+    (tmp_path / db_id).mkdir()
+    db_path = tmp_path / db_id / f"{db_id}.sqlite"
+    conn = sqlite3.connect(db_path)
+    conn.execute("CREATE TABLE students (id INTEGER PRIMARY KEY, name TEXT)")
+    conn.execute("CREATE TABLE groups (id INTEGER PRIMARY KEY, faculty TEXT)")
+    conn.execute("INSERT INTO students VALUES (1, 'Иван')")
+    conn.execute("INSERT INTO groups VALUES (10, 'ПИ')")
+    conn.commit()
+    conn.close()
+    return tmp_path
+def test_list_databases(fake_databases_dir: Path):
+    r = SchemaRetriever(fake_databases_dir)
+    assert r.list_databases() == ["uni"]
+def test_get_tables(fake_databases_dir: Path):
+    r = SchemaRetriever(fake_databases_dir)
+    tables = r.get_tables("uni")
+    names = sorted(t.name for t in tables)
+    assert names == ["groups", "students"]
+def test_render_schema_contains_create(fake_databases_dir: Path):
+    r = SchemaRetriever(fake_databases_dir)
+    text = r.render_schema("uni")
+    assert "CREATE TABLE" in text
+    assert "students" in text
+    assert "groups" in text