Text this: Multi-word units (and tokenization more generally): a multi-dimensional and largely information-theoretic approach