Towards aligning Motion, IMU, EMG and Text

Aadeeb, Md Shadman

dc.contributor.author	Aadeeb, Md Shadman
dc.date.accessioned	2026-05-17T07:22:57Z
dc.date.available	2026-05-17T07:22:57Z
dc.date.issued	2026-05-11
dc.identifier.uri	http://dspace.uiu.ac.bd/handle/52243/3470
dc.description.abstract	Multimodal alignment and multimodal generation are two important technologies that are widely used in the field of Artificial Intelligence (AI). Multimodal alignment allows AI models to identify similarities between two different modalities of data by transforming them into a shared representation with the same dimensional structure. Multimodal generation, on the other hand, enables an AI model to generate one type of data from another using this common representation. Together, these technologies provide a useful way to compare information from different modalities and to transfer or generate data from one modality to another. In this research, multimodal alignment and multimodal generation were performed across four different modalities: three-dimensional keypoint sequences, inertial measurement unit signals, electromyography signals, and textual data. For this purpose, four encoders and four decoders were developed so that each modality could be mapped into a shared embedding space and then reconstructed or generated across modalities. To overcome the lack of fully paired datasets, an unpaired cross-modal training method was used. This approach led to the development of an Explainable AI-based system that can provide meaningful insights into a person’s movement and motion by connecting information from multiple sources. The results showed that the proposed models performed well, especially in preserving motion-related information across the different modalities. With the help of the developed encoders and decoders, the models were able to match or surpass several previous benchmarks. The text generation model achieved a BERTScore of 40.57, which was higher than previous models. For EMG-to-pose generation, the 3D keypoint position RMSE was 0.0873, which is very close to the performance reported by the authors of the dataset paper. Similarly, for IMU-to-pose generation, the average rotation error was approximately 17.6589◦, also close to the dataset authors’ result. Further analysis also revealed interesting patterns in how movement-related information is carried across different modalities. In addition, embedding arithmetic showed that two embeddings could be combined to produce mixed results, suggesting that the shared embedding space learned meaningful relationships between modalities. Overall, these findings show that the proposed multimodal framework can effectively capture,	en_US
dc.publisher	UIU	en_US
dc.subject	EMG	en_US
dc.subject	IMU	en_US
dc.subject	3d Keypoints	en_US
dc.subject	Encoder	en_US
dc.subject	XAI	en_US
dc.title	Towards aligning Motion, IMU, EMG and Text	en_US

Files in this item

Name:: Thesis Final Notebook.pdf
Size:: 6.198Mb
Format:: PDF
Description:: Final Thesis

View/Open

This item appears in the following Collection(s)

M.Sc Thesis/Project [167]

Show simple item record