regular expression – Alex Eames' personal blog

I had a need to be able to remove accented characters from a string in order to be able to make order ID codes which will work properly as filenames in a unix system.

Now, strictly speaking, the unix system has no problem with accented characters in filenames. The problem arises when you try to use them in browsers as GET arguments at the end of the URL. e.g. http://tranfree.com/blah?6672837Ã¡Ã ÃŸ

My old WS-FTP software copes with these as filenames. My newer SSH software doesn’t – can’t see them, can’t delete them, just can’t cope with them at all. :no:

But the biggest problem of all is that when I give clients their download URLs, if their surname had accented (diacritics) in, their downloads usually wouldn’t work because of the way their browser or email client translates the url. Perhaps it’s a unicode thing? Who knows/cares?

Well, having spent a large part of the day failing to install a PERL module called TEXT::UNIDECODE, which is meant to be a red hot fix for this issue, I hit upon an incredibly simple way of getting the job done. In fact, using PERL pattern matching and substitution, you can get the whole job done very elegantly in one line of code. And it is indeed a beautiful line of code. :-*

$surname =~s/[^A-Za-z0-9]//g;

Which translates into English as. “If any character in this surname is not A-Z, a-z or 0-9, substitute it for nothing. This should make the system more robust, avoiding disappointed clients and extra product support emails. :yes:

That’s what you get when you write your own systems. You get the good and the bad. Still it was very nice to get a working solution in just one line of code (if you discount the lines above it that substitute Ã¡ for a and a lot of others manually, so the surname reads OK in the order ID – it also prevents Mr Ã¡Ã ÃŸÃ¡Ã ÃŸÃ¡Ã ÃŸÃ¡Ã ÃŸ from having a blank at the end of his OID :clap: ).

I was so pleased I blogged it.