Historical Document Layout Detection Model (Co-DETR / DINO)
A fine-tuned Co-DINO (Vision Transformer-based detector via MMDetection) model for detecting layout elements in historical Swedish medical journal pages.
This model is a more advanced successor to earlier Mask R-CNN-based approaches cdhu-uu/SweMPer-layout-lite, offering improved detection performance and robustness on complex layouts.
This model was developed as part of the research project:
Communicating Medicine (SweMPer): Digitalisation of Swedish Medical Periodicals, 1781–2011
(Project ID: IN22-0017), funded by Riksbankens Jubileumsfond.
Model Details
- Model type: Co-DINO (Vision Transformer backbone)
- Framework: MMDetection
- Fine-tuned for: Historical document layout analysis
- Language of source documents: Swedish
- Strengths: Improved detection Precision on complex layouts
Supported Labels
| Label |
|---|
| Advertisement |
| Author |
| Header or Footer |
| Image |
| List |
| Page Number |
| Table |
| Text |
| Title |
Evaluation Metrics
The evaluation metrics for this model are as follows:
| AP | AP50 | AP75 | APs | APm | APl |
|---|---|---|---|---|---|
| 80.7 | 98.4 | 87.4 | 51.5 | 69.6 | 88.2 |
Usage
Installation
Find installation and finetuning instructions at:
https://github.com/Sense-X/Co-DETR?tab=readme-ov-file
Inference
import cv2
import layoutparser as lp
import matplotlib.pyplot as plt
from mmdet.apis import init_detector, inference_detector
# Configuration
config_file = "co_dino_5scale_vit_large_coco.py"
checkpoint_file = "SweMPer-layout.pth"
score_thr = 0.50
device = "cuda:0"
# Initialize model
model = init_detector(config_file, checkpoint_file, device=device)
# Get class names from model
def get_classes(model):
m = getattr(model, "module", model)
classes = getattr(m, "CLASSES", None)
if classes:
return list(classes)
meta = getattr(m, "dataset_meta", None)
if meta and isinstance(meta, dict) and "classes" in meta:
return list(meta["classes"])
return None
classes = get_classes(model)
# Convert MMDet results to LayoutParser layout
def mmdet_to_layout(result, classes, thr=0.50):
bbox_result = result[0] if isinstance(result, tuple) else result
blocks = []
for cls_id, dets in enumerate(bbox_result):
if dets is None or len(dets) == 0:
continue
cls_name = classes[cls_id].lower() if classes else str(cls_id)
for x1, y1, x2, y2, score in dets[dets[:, -1] >= thr]:
rect = lp.Rectangle(float(x1), float(y1), float(x2), float(y2))
blocks.append(
lp.TextBlock(block=rect, type=cls_name, score=float(score))
)
return lp.Layout(blocks)
# Run inference
image_path = "<path_to_image>"
result = inference_detector(model, image_path)
layout = mmdet_to_layout(result, classes, thr=score_thr)
# Print detected elements
for block in layout:
print(f"Type: {block.type}, Score: {block.score:.3f}, Box: {block.coordinates}")
# Visualize results
image = cv2.imread(image_path)[..., ::-1] # BGR to RGB
viz = lp.draw_box(image, layout, box_width=3, show_element_type=True)
plt.figure(figsize=(12, 16))
plt.imshow(viz)
plt.axis("off")
plt.show()
Acknowledgements
This work was carried out within the project:
Communicating Medicine (SweMPer): Digitalisation of Swedish Medical Periodicals, 1781–2011
(Project ID: IN22-0017), funded by Riksbankens Jubileumsfond.
We gratefully acknowledge the support of the funder and project collaborators.
This model builds upon the excellent work of:
We thank the contributors and maintainers of these projects for making their tools publicly available and supporting research.