nvidia
/

NVIDIA-Nemotron-Parse-v1.1-TC

@@ -1,14 +1,3 @@
----
-license: other
-license_name: nvidia-open-model-license
-license_link: >-
-  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
-pipeline_tag: image-text-to-text
-library_name: transformers
-tags:
-  - nvidia
-  - VLM
----
 # Nemotron-Parse-Lite Overview
 nemotron-parse-lite is a general purpose text-extraction model, specifically designed to handle documents. Given an image, nemotron-parse-lite is able to extract formatted-text, with bounding-boxes and the corresponding semantic class. This has downstream benefits for several tasks such as increasing the availability of training-data for Large Language Models (LLMs), improving the accuracy of retriever systems, and enhancing document understanding pipelines.
@@ -160,6 +149,67 @@ for bbox in bboxes:
   draw.rectangle((bbox[0], bbox[1], bbox[2], bbox[3]), outline="red")
 ```
 ## Training, Testing, and Evaluation Datasets:

 # Nemotron-Parse-Lite Overview
 nemotron-parse-lite is a general purpose text-extraction model, specifically designed to handle documents. Given an image, nemotron-parse-lite is able to extract formatted-text, with bounding-boxes and the corresponding semantic class. This has downstream benefits for several tasks such as increasing the availability of training-data for Large Language Models (LLMs), improving the accuracy of retriever systems, and enhancing document understanding pipelines.
   draw.rectangle((bbox[0], bbox[1], bbox[2], bbox[3]), outline="red")
 ```
+## Inference with VLLM
+### Install dependencies
+```bash
+uv venv --python 3.12 --seed
+source .venv/bin/activate
+uv pip install "git+https://github.com/amalad/vllm.git@nemotron_parse"
+uv pip install timm albumentations
+```
+### Inference example
+```python
+from vllm import LLM, SamplingParams
+from PIL import Image
+sampling_params = SamplingParams(
+    temperature=0,
+    top_k=1,
+    repetition_penalty=1.1,
+    max_tokens=9000,
+    skip_special_tokens=False,
+)
+llm = LLM(
+    model="nvidia/NVIDIA-Nemotron-Parse-v1.1-Lite",
+    max_num_seqs=64,
+    limit_mm_per_prompt={"image": 1},
+    dtype="bfloat16",
+    trust_remote_code=True,
+)
+image = Image.open("<YOUR-IMAGE-PATH>")
+prompts = [
+    {  # Implicit prompt
+        "prompt": "</s><s><predict_bbox><predict_classes><output_markdown>",
+        "multi_modal_data": {
+            "image": image
+        },
+    },
+    {  # Explicit encoder/decoder prompt
+        "encoder_prompt": {
+            "prompt": "",
+            "multi_modal_data": {
+                "image": image
+            },
+        },
+        "decoder_prompt": "</s><s><predict_bbox><predict_classes><output_markdown>",
+    },
+]
+outputs = llm.generate(prompts, sampling_params)
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Decoder prompt: {prompt!r}, Generated text: {generated_text!r}")
+```
 ## Training, Testing, and Evaluation Datasets:

generation_config.json CHANGED Viewed

@@ -9,5 +9,7 @@
   "do_sample": false,
   "num_beams": 1,
   "repetition_penalty": 1.1,
-  "transformers_version": "4.51.3"
 }

   "do_sample": false,
   "num_beams": 1,
   "repetition_penalty": 1.1,
+  "transformers_version": "4.51.3",
+  "top_k": 1,
+  "temperature": 0
 }

tokenizer_config.json CHANGED Viewed

@@ -18820,5 +18820,6 @@
   "truncation_side": "right",
   "truncation_strategy": "longest_first",
   "unk_token": "<unk>",
-  "vocab_file": null
 }

   "truncation_side": "right",
   "truncation_strategy": "longest_first",
   "unk_token": "<unk>",
+  "vocab_file": null,
+  "chat_template": "{%- for message in messages -%}{%- for part in message['content'] -%}{{ part['text'] if part['type'] == 'text' else '' }}{%- endfor -%}{%- endfor -%}"
 }