Text this: Extracting urban spatial perception attributes and scene elements by integrating VGG-16 and CBAM