MSLID-TCN: multi-stage linear-index dilated temporal convolutional network for temporal action segmentation

Gao, Suo; Wu, Rui; Liu, Songbo; Erkan, Uğur; TOKTAŞ, ABDURRAHİM; Liu, Jiafeng; Tang, Xianglong

doi:10.1007/s13042-024-02251-y

MSLID-TCN: multi-stage linear-index dilated temporal convolutional network for temporal action segmentation

Atıf İçin Kopyala

Gao S., Wu R., Liu S., Erkan U., TOKTAŞ A., Liu J., ...Daha Fazla

International Journal of Machine Learning and Cybernetics, cilt.16, sa.1, ss.567-581, 2025 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 16 Sayı: 1
Basım Tarihi: 2025
Doi Numarası: 10.1007/s13042-024-02251-y
Dergi Adı: International Journal of Machine Learning and Cybernetics
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC
Sayfa Sayıları: ss.567-581
Anahtar Kelimeler: Deep learning, Multi-stage temporal convolutional, Temporal action segmentation, Temporal convolutional network
Ankara Üniversitesi Adresli: Evet

Özet

Temporal Convolutional Network (TCN) has received extensive attention in the field of speech synthesis. Many researchers use TCN-based models for action segmentation since both tasks focus on contextual connections. However, TCN can only capture the long-term dependencies of information and ignores the short-term dependencies, which can lead to over-segmentation by dividing a single action interval into multiple action categories. This paper proposes Multi-Stage Linear-Index Dilated TCN (MSLID-TCN) model each of whic layer has an appropriate receptive field, allowing the video’s short-term and long-term dependencies to be passed to the next layer, thereby optimizing the over-segmentation problem. MSLID-TCN has a four-stage structure. The first stage is a LID-TCN, while the remaining stages are Single Stage TCNs (SS-TCNs). The I3D feature of the video is used as the input for MSLID-TCN. In the first stage, LID-TCN makes initial predictions on frame features to obtain predicted probability values. In the last three stages, these probability features are used as input to the network where SS-TCN corrects the predicted probability values from the previous stage, ultimately yielding action segmentation results. Comparative experiments show that our model performs excellently on the three datasets: 50salads, Georgia Tech Egocentric Activities (GTEA), and Breakfast.