Text this: Time-Interval-Guided Event Representation for Scene Understanding