Text this: Self-supervised speech representation learning based on positive sample comparison and masking reconstruction