Cosine similarity is probably not the way to go for the reasons you mentioned, so an implementation that ignores semantic similarity is probably safer. Fuzzy matching with a dictionary of known good slugs wouldn’t handle every situation, but handle enough to be valuable. I don’t know enough about it to think of the specifics, but I’ve seen it in action for things like URL case handling.
Sign in with Google to reply.
Semantic similarity can be used as a helping metric, but not a deciding factor.