Text this: Skeletal Keypoint-Based Transformer Model for Human Action Recognition in Aerial Videos