Towards aligning Motion, IMU, EMG and Text

Aadeeb, Md Shadman

View/Open

Final Thesis (6.198Mb)

Date

2026-05-11

Author

Aadeeb, Md Shadman

Metadata

Show full item record

Abstract

Multimodal alignment and multimodal generation are two important technologies that are widely used in the field of Artificial Intelligence (AI). Multimodal alignment allows AI models to identify similarities between two different modalities of data by transforming them into a shared representation with the same dimensional structure. Multimodal generation, on the other hand, enables an AI model to generate one type of data from another using this common representation. Together, these technologies provide a useful way to compare information from different modalities and to transfer or generate data from one modality to another. In this research, multimodal alignment and multimodal generation were performed across four different modalities: three-dimensional keypoint sequences, inertial measurement unit signals, electromyography signals, and textual data. For this purpose, four encoders and four decoders were developed so that each modality could be mapped into a shared embedding space and then reconstructed or generated across modalities. To overcome the lack of fully paired datasets, an unpaired cross-modal training method was used. This approach led to the development of an Explainable AI-based system that can provide meaningful insights into a person’s movement and motion by connecting information from multiple sources. The results showed that the proposed models performed well, especially in preserving motion-related information across the different modalities. With the help of the developed encoders and decoders, the models were able to match or surpass several previous benchmarks. The text generation model achieved a BERTScore of 40.57, which was higher than previous models. For EMG-to-pose generation, the 3D keypoint position RMSE was 0.0873, which is very close to the performance reported by the authors of the dataset paper. Similarly, for IMU-to-pose generation, the average rotation error was approximately 17.6589◦, also close to the dataset authors’ result. Further analysis also revealed interesting patterns in how movement-related information is carried across different modalities. In addition, embedding arithmetic showed that two embeddings could be combined to produce mixed results, suggesting that the shared embedding space learned meaningful relationships between modalities. Overall, these findings show that the proposed multimodal framework can effectively capture,

URI

http://dspace.uiu.ac.bd/handle/52243/3470

Collections

M.Sc Thesis/Project [167]