Proposed architecture. (a) Proposed video-specific data augmentation. Here, W refers to one of the temporally consistent, static warping transformations, which corrupts (e.g. blur) a frame independently, but consistently, across a video. (b) Proposed curriculum learning framework.
Temporal Action Segmentation (TAS) of a surgical video is an important first step for a variety of video analysis tasks such as skills assessment, surgical assistance and robotic surgeries. Limited data availability due to costly acquisition and annotation makes data augmentation imperative in such a scenario. However, extending directly from an image-augmentation strategy, most video augmentation techniques disturb the optical flow information in the process of generating an augmented sample. This creates difficulty in training. In this paper, we propose a simple-yet-efficient, flow-consistent, video-specific data augmentation technique suitable for TAS in scarce data conditions. We observe that TAS errors commonly occur at the action boundaries due to their scarcity in the datasets. Hence, we propose a novel strategy that generates pseudo-action boundaries without affecting optical flow elsewhere. Further, we also propose a sample-hardness-inspired curriculum where we train the model on easy samples first with only a single label observed in the temporal window. Additionally, we contribute the first-ever non-robotic Neuro-endoscopic Trainee Simulator (NETS) dataset for the task of TAS. We validate our approach on the proposed NETS, along with publicly available JIGSAWS and Cholec T-50 datasets. Compared to without the use of any data augmentation, we report an average improvement of 7.89%, 5.53%, 2.80%, respectively, on the 3 datasets in terms of edit score using our technique. The reported numbers are improvements averaged over 9 state-of-the-art (SOTA) action segmentation models using two different temporal feature extractors (I3D and VideoMAE). On average, the proposed technique outperforms the best-performing SOTA data augmentation technique by 3.94%, thus enabling us to setup a new SOTA for action segmentation in each of these datasets.