Open-Set Surgical Instrument Segmentation with Endoscopic Vision-Language Model
Automatic segmentation of instruments from laparoscopic surgery images or videos plays a key role in providing advanced assistance to the clinical team, with emerging applications in enhancing surgery conducted by humans and ultimately in semi-autonomous robot-human delivered surgery and surgical imaging techniques. The state-of-the-art solution for it is to train deep convolutional networks from manually annotated datasets. Annotating such datasets from surgical videos is however tedious and time-consuming.
In order to achieve high robustness and accuracy of surgical segmentation, it is critical to make use of large amount of non-annotated data in the open-set (real world). Second, owing to the evolutive nature of surgical technology, the segmentation concepts (classes) actually increase, the open-set contains data with both base and new classes; in contrast, trained surgical instrument detectors can only detect base classes in the curated set. How to equip such detectors trained on base classes in a curated set with the ability to detect new classes in the open-set becomes another critical problem.
This project aims at overcoming the limitations of open-set surgical segmentation with the help of an endoscopic vision-language model. Vision-language model is a recently developed technique proved to be highly powerful to open-set visual recognition. It is originally pretrained on millions of natural image-text pairs over tremendous classes and is established mainly for image classification task. To use this technique in the medical imaging domain, we propose to train a new endoscopic vision-language model dedicated to robotic surgery given the access to the Cambridge-1. The model will be made public for the broader community to make use of it for many downstream autonomous endoscopic tasks.