International Journal of Machine Learning and Cybernetics, 2024 (SCI-Expanded)
Temporal Convolutional Network (TCN) has received extensive attention in the field of speech synthesis. Many researchers use TCN-based models for action segmentation since both tasks focus on contextual connections. However, TCN can only capture the long-term dependencies of information and ignores the short-term dependencies, which can lead to over-segmentation by dividing a single action interval into multiple action categories. This paper proposes Multi-Stage Linear-Index Dilated TCN (MSLID-TCN) model each of whic layer has an appropriate receptive field, allowing the video’s short-term and long-term dependencies to be passed to the next layer, thereby optimizing the over-segmentation problem. MSLID-TCN has a four-stage structure. The first stage is a LID-TCN, while the remaining stages are Single Stage TCNs (SS-TCNs). The I3D feature of the video is used as the input for MSLID-TCN. In the first stage, LID-TCN makes initial predictions on frame features to obtain predicted probability values. In the last three stages, these probability features are used as input to the network where SS-TCN corrects the predicted probability values from the previous stage, ultimately yielding action segmentation results. Comparative experiments show that our model performs excellently on the three datasets: 50salads, Georgia Tech Egocentric Activities (GTEA), and Breakfast.