The Sad State of HTML Email Input Fields and IDNs

CM30 · on Oct 23, 2016

This reminds me of another HTML input type that's really poorly implemented across browsers. Namely, date and time input fields.

In that case, it varies so much across browsers to be almost unusable. Some (like Firefox) don't seem to support the calendar aspect, some have a terrible calendar UI, some fallback to the device calendar UI, some let you select invalid dates despite greying them out and that's before you get to the actual data, how it's formatted or the lack of any events associated with the fields.

Honestly, it seems like today's browsers just can't seem to handle any of the newer input types in any reasonable way.

janci · on Oct 23, 2016

Not only date/datetime/time fields are broken. Also type=number does not work consistently across different locales and different browsers. Some don't let you type decimal comma.

zbraniecki · on Oct 24, 2016

Hi, I'm one of the engineers working on date/time fields [0] (you can see the design document for them here[1]) for Firefox. We're on the way to support them and hopefully support them well.

Now, I also happen to be working on refactor of our Intl code, so I should be able to help with the number field being inconsistent. If you can file a bug and prepare a minimized testcase, I'd be happy to fix it! :)

[0] https://bugzilla.mozilla.org/show_bug.cgi?id=888320 [1] https://bugzilla.mozilla.org/show_bug.cgi?id=1069609

err4nt · on Oct 24, 2016

Hey @zbraniecki, thanks for the work you're doing. I use type=number cautiously for self-built tools for my own use, but so far have avoided it in production code. Thanks to your effort maybe someday I'll be able to use this 'for real'!

freditup · on Oct 23, 2016

Also with type=number fields: some support showing a leading "0" while others don't. Seems like a small detail, but it all adds up to death by a thousand cuts and custom implementations of everything.

edoceo · on Oct 23, 2016

IE doesn't understand max attribute. Sometimes marks an error with large (but inconsistent) value.

Id est: use max="100000". Then on one IE/Edge 16000 is an error but 15000 is OK. Different IE/Edge box its 18000 cap.

I have to make a custom UI component to make this work (riotjs)

adrianratnapala · on Oct 23, 2016

Is a "decimal comma" a thausands separator, or is it a decimal-point for locales where that is expressed by a comma?

rhizome · on Oct 23, 2016

I'm not sure this is universal, but locales which use a period for a thousands separator use a comma as a decimal point, and vice-versa.

m_t · on Oct 24, 2016

It's not universal, that wouldn't be complicated enough!

heinrich5991 · on Oct 23, 2016

It's the separator between the whole and fractional parts of the number.

nattaylor · on Oct 23, 2016

I think this is because its so difficult to support all the different options, like languages and different calendars ...but you're right: getting English + Gregorian consistently right would be a start!

kuschku · on Oct 23, 2016

That alone is hard.

I’ve got a version of Chrome here that, if you query with .value from JS the value of a datetime-local field, returns you a String in "MM/DD YYYY" format. Despite the user’s locale being de_DE.

I’ve also discovered one browser only supporting ISO8601, and another only support it if one removes the time section of it. Another only if you have a time section, but remove the time zone part at the end.

It’s such a horrible mess.

jstanley · on Oct 23, 2016

> Despite the user’s locale being de_DE.

Surely you're not suggesting that the representation of the date that is returned by the browser should be locale-dependent? That would be a nightmare.

kuschku · on Oct 23, 2016

Well, it’d be less of a nightmare than returning it in a random locale!

At least when it’s locale-dependent I can use a look up table and parse it somehow, but getting it back in MM/DD YYYY?

I’d love to just get ISO8601 ideally, btw, but who knows if that will ever happen.

(Also, you should check out Microsoft Excel, its date type and CSV converter are both locale-dependent. Locale of the program opening the file, not the file itself. A file created in de_DE will be unusable in en_US)

rch · on Oct 23, 2016

Do not look at Excel when thinking about appropriate handling of dates and times.

mathw · on Oct 24, 2016

It makes a fantastic checklist of things you shouldn't do while handling dates though!

TeMPOraL · on Oct 23, 2016

One would think it should be simple in principle.

User side: just show/accept whatever the site owner wants, or if it's too difficult, stick to ISO 8601 (it's the standard, damn it).

JS side: just ISO 8601.

skrebbel · on Oct 23, 2016

Why ISO? JavaScript has had a Date type since forever.

TeMPOraL · on Oct 23, 2016

So Date type, then, but if we're talking input fields, they're also meant to be used with forms submitted without all that JavaScript cruft on top. So if you have to convert the date input to text for sending (or direct retrieval in code), I'd say just stick to ISO. Why? Because it's a standard.

skrebbel · on Oct 23, 2016

Ahyes, I'm so stuck in JS-o-world that I forgot there's an actual text format called HTML. Thanks :-)

hibbelig · on Oct 23, 2016

Why the format the site owner wants? Why not the format the user wants?

TeMPOraL · on Oct 23, 2016

That could be a reasonable default, but you need to give the programmer control over the format, or else everyone will keep using JS-powered replacements.

adrianratnapala · on Oct 23, 2016

WPF Datagrids have a "currency" format. This defaults to dollars regardless of your locale.

bazzargh · on Oct 23, 2016

The article points at the regexp but fails to point at what the standard says about how that's supposed to be used.

https://www.w3.org/TR/html5/forms.html#e-mail-state-(type=em...

    User agents may transform the value for display
    and editing; in particular, user agents should
    convert punycode in the value to IDN in the
    display and vice versa.

ie you're supposed to be able to type human-readable addresses but they're converted to punycode for submission, because that's what the server will need in order to use the address. The regexp is used to validate the punycode not what the user types.

Which isn't to say this isn't a thorny problem; more discussion of the issue here https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489

mike-cardwell · on Oct 23, 2016

"they're converted to punycode for submission, because that's what the server will need in order to use the address"

I disagree with that. Neither the client nor the server need to know anything about punycode. The only time punycode is required is at the very last moment when it comes time to actually send an email to that address. Whether it's the email sending library or the mail server it's self, it doesn't matter.

A user should be able to type "person@ü.example.com" into a form. I should receive "person@ü.example.com" on the server side. I should save "person@ü.example.com" to the database. I should be able to send "person@ü.example.com" back to the browser for display, and I should be able to pass "person@ü.example.com" as the To address to my mail sending library.

Punycode is an implementation detail that I shouldn't need to think about.

bazzargh · on Oct 23, 2016

That's a perfectly reasonable objection, and answered to an extent over here https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489#c18 tl;dr: "backwards compatibility"

I'd also agree that my code should be able to handle unicode, for a different reason: I never trust validation by a browser (old clients, bad clients, malicious clients)

mnarayan01 · on Oct 23, 2016

Having both "person@ü.example.com" and "person@xn--tda.example.com" be both valid and equivalent makes just abstracting over it more problematic.

Animats · on Oct 23, 2016

It's much worse than that. This doesn't even cover the rules for domain names, which are mapped to a subset of Unicode. Read the horrors of "Unicode IDNA Compatibility Processing".[1] This is an incredibly complicated scheme to deal with homoglyphs - characters which look the same, but are not. There's filtering for specific characters that look the same. There's detection of mixed left-to-right and right-to-left characters. This is all to prevent attacks where the domain name in a link looks like some trusted site, but isn't.

The "." in a domain name is no longer always a period. It can be any of

    U+002E ( . ) FULL STOP
    U+FF0E ( ． ) FULLWIDTH FULL STOP
    U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
    U+FF61 ( ｡ ) HALFWIDTH IDEOGRAPHIC FULL STOP

Validating an email address is quite hard if done right.

[1] http://www.unicode.org/reports/tr46/

STRML · on Oct 23, 2016

This is why we badly need an alternative to input's `type` attribute, as the type attribute encapsulates many different things:

1. Validation

2. Autofill hints

3. Native helper widgets (date calendar, number input spinner)

4. Mobile keyboard layouts

And, confusingly, while most `type`s are text inputs with differing values of the above, others are very, very different (radio, checkbox, select, file, etc). Still others use completely different tags (like <textarea>) and you even have to switch on `value` or `checked` for checkboxes.

Many people use `type="tel"` or `type="number"` just for the mobile layouts, and spend a long amount of time working around all the awful bugs in number inputs. Our own `<NumberInput>` React component works around multiple browser bugs and took weeks to get right. The incidental complexity even creates very hairy React bugs (https://github.com/facebook/react/issues/7253).

Even if you get around all the bizarre failures in differing implementations, you still have ridiculous spec bugs like the (intended) lack of selectionEnd/selectionStart on number inputs (https://www.w3.org/Bugs/Public/show_bug.cgi?id=24796) and the like.

I don't know if anyone is championing splitting these concerns into separate attributes, but the web really needs it.

snowl · on Oct 23, 2016

The problem goes much deeper than just HTML fields - sometimes the punycode will be converted to Unicode and then get rejected further down the chain. For example, Google accepts punycode domains for custom email domains but will convert it to a Unicode domain name when receiving it. It handles most IDN domains fine, but it fails at domains with Emoji within it (my domain which fails is http://xn--p38h.ws).

The real problem here is all the systems we use are more complex than people realize. Things like names (http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-b...) & time (http://infiniteundo.com/post/25326999628/falsehoods-programm...) just cause issues that most developers will end up just ignoring, because in the end its just edge cases that will happen maybe once or twice, ever. Why put in the effort?

AndyMcConachie · on Oct 23, 2016

The current standard for IDNs is IDNA2008, which does not allow emojis. There are some ccTLDs that don't abide by the IDNA standard, and I think you'll find different browsers will show the emoji and others won't. Permitted code points for gTLDs are listed here. http://www.iana.org/domains/idn-tables

So strictly speaking xn--p38h.ws is not an IDN, and Google is doing the right thing by not allowing it. This doesn't make your job any easier, of course.

If you really want to explore this rabbit hole some good reading. RFC 4290 RFC 5891 RFC 6912 RFC 7940

jacksonsabey · on Oct 23, 2016

Emojis are disallowed in IDNA2008. \u1F63B is in the disallowed but you can still register it on some TLDs https://whois.domaintools.com/xn--238h.com vs https://whois.domaintools.com/xn--238h.cf

  Gmail for example still allows me to email myself at \u1F63B.*.com or xn--238h.*.com

http://unicode.org/reports/tr46/#Table_Data_File_Fields

You can see the complete list of characters here:

ftp://ftp.unicode.org/%2F/Public/idna/8.0.0/IdnaMappingTable.txt

darkhorn · on Oct 23, 2016

I don't understand why he quotes from w3.org. HTML5 is being developed by whatwg.org

"The WHATWG was founded by individuals of Apple, the Mozilla Foundation, and Opera Software in 2004, after a W3C workshop. Apple, Mozilla and Opera were becoming increasingly concerned about the W3C’s direction with XHTML, lack of interest in HTML and apparent disregard for the needs of real-world authors. So, in response, these organisations set out with a mission to address these concerns and the Web Hypertext Application Technology Working Group was born."

gsnedders · on Oct 23, 2016

Because we're now in a horrendous mess where the W3C has a fork[0] of the WHATWG spec[1], which semi-regularly takes selected patches from the WHATWG spec… but at times end up with the spec in an inconsistent state. As it is, basically all implementers are working from the WHATWG spec.

[0]: https://w3c.github.io/html/

[1]: https://html.spec.whatwg.org/multipage/

kuschku · on Oct 23, 2016

Because the WHATWG is obviously incompetent, and has created this mess in the first place.

Sometimes we need someone to enforce standards, and not just say "anything is valid".

We need the W3C again.

kgwxd · on Oct 23, 2016

Validation is logic, it doesn't belong in markup. A rectangle to type text into is as far as the abstraction should go.

doytch · on Oct 23, 2016

No one's asking for bulletproof validation. What I want out of input[type="fancy"] elements is:

1. Better detection of the platform than I can do myself. That leads to a UI targeted at the platform (different default phone keyboards, etc).

2. Uniform UIs across websites. I hate having to learn a new calendar widget on every website, it's worse for regular users, not to mention any users that need accessibility features that are prooobably not included in most home-rolled (or even popular) widgets.

3. Basic validation. Just to help the user a bit. I'll validate it again on the back-end, and I might even be validating it myself on the front-end. But basic validation helps and, once again, that uniformity thing shows up again here since the user could be aware of how that widget reports errors on their platform already and could be looking for them.

matt4077 · on Oct 23, 2016

Logic is also logic, so most texts (your comment excepted) shouldn't be allowed in markup.

Note that nobody actually wants to embed the code for validation within the document, but you want to be able to properly identify a form-elements attributes so as to allow clients to make a smart choice about validation, presentation, defaults etc.

JPEG decoding is also logic, but it's also pretty useful for client to know that the binary stream it's reading may best be handled with the jpeg library.

grzm · on Oct 23, 2016

I'm torn. For the most part I agree with you.

The niggling part that doesn't is where we decide where our level of abstraction is. It can be very convenient to be able to declare what the type is. If you don't do it in the markup, you're likely going to want this in some kind of library. And that library will need to be extensible or you've just shifted the problem, and now it's the library. So that part of it needs to be well thought out and portable. But in the end, I do think that the solution is more tractable in logic rather than markup.

Though maybe markup tries to do too much. Look at all the additions to HTML5. It's great to be able to express so much more in the markup. Yet there are still going to be times where you're going to come across a situation that doesn't fit neatly into the existing elements. So you're going to add some extra domain-specific meaning to the markup you're using. You can work around this with class attributes, but that's just it: it's a work around.

Enough rambling.

I think there's a corresponding issue with display and behavior and the intersection of markup, CSS, and JavaScript, boiling down to inadequate separation of concerns.

Silhouette · on Oct 23, 2016

This is a reasonable concern, but far smaller IMHO than the main problem with email fields on web forms today: there is no way to verify that someone actually entered a working email address, because fear of spammers has meant all the plausible techniques for doing so get closed down.

We've occasionally had an unusual but syntactically correct address cause problems in the past because it wasn't processed properly, but we get problems with people who have accidentally mistyped their address when signing up and then can't log in to our systems all the time. If you have a system where the email address is the primary ID for a user's account and you're charging real money, this is not a trivial problem and it does not have a simple solution: requiring active confirmation at sign-up time does horrible things to conversion rates, but anything else is potentially vulnerable to security issues later.

Only the local part of an email address potentially being case sensitive causes us more headaches in this area today.

jacksonsabey · on Oct 23, 2016

you could inform your users that email addresses are case sensitive much like passwords usually are or just normalize the local part along with the host and use that as the primary ID and have less issues in the future

servers that have case sensitive mailboxes are more likely to be used for throwaways or the user may control the server anyways so they could still respond to normalized local parts

as for verifying the email at registration you could check to see what the remote smtp server responds to when you issue the RCTP command to check if they consider the email valid

Silhouette · on Oct 23, 2016

you could inform your users that email addresses are case sensitive much like passwords usually are or just normalize the local part along with the host and use that as the primary ID and have less issues in the future

We could normalise, and as far as I'm aware none of the largest e-mail services allow distinctions based on case so most of the time it would be OK. It would still be a security risk, though.

as for verifying the email at registration you could check to see what the remote smtp server responds to when you issue the RCTP command to check if they consider the email valid

You can, but many servers including some of the major services will just return a false positive for any mailbox to prevent that technique being used to collect addresses to spam, and even those that don't may consider such requests when not followed with a real message to be a black mark on the sender.

Buge · on Oct 24, 2016

Your blog post has a problem in that it seems to be converting double hyphens into a single en-dash.

boubiyeah · on Oct 23, 2016

I only use type=text inputs and implement custom logic on top of it; the others are crap, implemented in a rush and very inconsistently across browsers.

mike-cardwell · on Oct 23, 2016

What about type="number" ? There are clear benefits to using it when a user is entering a number and I'm not aware of any drawbacks? One of the benefits is the way some software keyboards display a number based keyboard instead of a qwerty one...

DCoder · on Oct 23, 2016

1. type=number only accepts the period as a decimal separator, and there are lots of locales using comma for that purpose. Customers often ask us to support commas, which requires type=text and manual validation.

2. One customer requested the ability to enter decimals in a field, but to keep up/down spinner in steps of whole numbers. You can't do that, the step attribute makes any values other than (min + N * step) invalid. I can see some sense in that, but I think the step attribute should only affect the increments/decrements done by the spinner, not completely reject certain values. (Though you can use step=any to mark all values valid and step by 1.0, which is good enough for common cases.)

robocat · on Oct 23, 2016

There is an awful UI problem on iOS and Android with type=number: they both allow invalid values to be entered, but the value given is just "" blank.

The user thinks they have typed in something valid, but JavaScript (or value from form submit) only gets to see "". E.g. use a thousands separator or paste in a trailing space and you will get an error "input must not be left blank".

There are worse issues with the less common input types.

kgwxd · on Oct 23, 2016

I may be remembering it wrong but, less than 2 years ago, there was some issue about Android not firing change events on those for some reason and it was in a version of the browser that a lot of users were going to be using for a long time to come because they're weren't getting OTA updates anymore. It may not be a real concern anymore but it's junk like that just makes these things a pain. Unfortunatley, there's no other good way to get the number pad, which I don't think is a real issue unless individual users will be filling out your forms a lot, like for a data entry application, in which case a custom number pad might be worth considering.

Illniyar · on Oct 23, 2016

Then you are purposely limiting your user's ux in mobile, different keyboards are very helpful.

Of course it's a valid choice, but many people will consider this a serious issue.

sildur · on Oct 23, 2016

You just have defined the whole HTML5 mess. But who needs the W3C, right?

jlebrech · on Oct 23, 2016

form/validation should be what HTML excels at, it should even one of the few things it does. and we should use another technology to do other things we've been commandeering it to do.

grzm · on Oct 23, 2016

Client-side validation can be very useful. At this point I think that the inconsistencies between implementations are the primary pain points, similar to what prompted the Web Standards Project.[0] Is there anything similar to the -{webkit,moz} in CSS for HTML? Or abstraction libraries comparable to jQuery?

It would also be very useful if the validation is extensible, as different projects require different validation rules.

I'd also want validation rules that are portable between the client and the server to reduce code duplication.

One combination I've been toying with is using clojure.spec[1] with ClojureScript/Clojure with varying levels of success. I imagine you can do similar things using server-side JavaScript.

> we should use another technology to do other things we've been commandeering it to do.

What types of things are you thinking of?

[0]: https://en.wikipedia.org/wiki/Web_Standards_Project

[1]: http://clojure.org/about/spec

jlebrech · on Oct 23, 2016

there should be a form language that works client/server and is just for classic web such as forms, something similar to meteor or volt. i'd even resort to using those in future or write a dsl on top that makes web development a breeze.

jacksonsabey · on Oct 23, 2016

client side validation is pointless if you can't validate server side

http://www.w3schools.com/html/tryit.asp?filename=tryhtml_inp...

the firefox implementation will gladly accept the invalid test.@example.com but won't accept the valid "test"@example.com

this is just a case of poor validation which we are now stuck with

jgalt212 · on Oct 23, 2016

The internet is interoperability. And IDN does not seem to be interoperable.

As such, it's just not a good idea use one for now.

As an aside, I don't even know why IDN's are allowed, it offers too many opportunities for domain spoofing and phishing.

kijeda · on Oct 23, 2016

Because a Latin-only Internet disenfranchises billions of people who are confortable with other writing systems?

IDNs will take years to be pervasive, just like Unicode before it, because it is a painful upgrade to something designed in a different era. It doesn't mean the endeavour isn't worthwhile.

jgalt212 · on Oct 23, 2016

> Because a Latin-only Internet disenfranchises billions of people who are confortable with other writing systems

You know that statement clearly is not true. Just look at keyboard designs used the world over.

mathw · on Oct 24, 2016

I'm not sure what their keyboards have to do with it. If you were a native Arabic speaker, wouldn't you want to to be able to see domain names in Arabic that are easy for you to read and type? If you're a company operating in an Arabic market and you want to put your domain name on the posters you're designing to display at bus stops, you probably would prefer an Arabic domain name so that your primarily Arabic-speaking audience are more likely to remember it instead of dealing with an alphabet which belongs to a completely different language they may not speak at all (or not very well).

I know English is increasingly pervasive, but there are still billions of people who don't speak it at all or speak it only very poorly (and it's not exactly very easy to learn). The internet is a global network, and so should allow people to communicate easily in their own languages. This current pain with IDN is just a legacy of the internet's origins in America and Western Europe.

Heck, the example in the article are just using one letter found in German. You don't need to go very far from English to find these problems.

jgalt212 · on Oct 24, 2016

All of the above are fair points, but IDN makes clicking on

www.cítíbank.com

instead of

www.citibank.com

a very real problem and will make the web more dangerous and will encourage the further growth of "walled gardens" on the internet which I think most here agree is a bad thing.