Towards aligning Motion, IMU, EMG and Text
Abstract
Multimodal alignment and multimodal generation are two important technologies that
are widely used in the field of Artificial Intelligence (AI). Multimodal alignment allows AI
models to identify similarities between two different modalities of data by transforming
them into a shared representation with the same dimensional structure. Multimodal generation,
on the other hand, enables an AI model to generate one type of data from another
using this common representation. Together, these technologies provide a useful way to
compare information from different modalities and to transfer or generate data from one
modality to another.
In this research, multimodal alignment and multimodal generation were performed
across four different modalities: three-dimensional keypoint sequences, inertial measurement
unit signals, electromyography signals, and textual data. For this purpose, four
encoders and four decoders were developed so that each modality could be mapped into a
shared embedding space and then reconstructed or generated across modalities. To overcome
the lack of fully paired datasets, an unpaired cross-modal training method was used.
This approach led to the development of an Explainable AI-based system that can provide
meaningful insights into a person’s movement and motion by connecting information from
multiple sources.
The results showed that the proposed models performed well, especially in preserving
motion-related information across the different modalities. With the help of the developed
encoders and decoders, the models were able to match or surpass several previous
benchmarks. The text generation model achieved a BERTScore of 40.57, which was higher
than previous models. For EMG-to-pose generation, the 3D keypoint position RMSE was
0.0873, which is very close to the performance reported by the authors of the dataset paper.
Similarly, for IMU-to-pose generation, the average rotation error was approximately
17.6589◦, also close to the dataset authors’ result. Further analysis also revealed interesting
patterns in how movement-related information is carried across different modalities.
In addition, embedding arithmetic showed that two embeddings could be combined to
produce mixed results, suggesting that the shared embedding space learned meaningful
relationships between modalities. Overall, these findings show that the proposed multimodal
framework can effectively capture,
Collections
- M.Sc Thesis/Project [167]
