I am happy to announce that based on some research and a Greasemonkey hack to make people aware of the consequences, Yahoo! is now a search engine that has natural language search results.
HTML has a wonderful attribute called
lang that allows you to define the language of the text in the current HTML element. This seems a bit superfluous as it has nothing to do with the display of the language specific character set (which is the encoding and another issue). However, defining the language has other benefits.
The first one is that search engines and other robots know what language the text is in and thus have a much less harder job to differentiate between keywords and stopwords.
The second, and most important has to do with accessibility. If you do not see the text but you get it read out to you then the pronunciation is very important. Visually impaired surfers use screen readers to tell them what is on the current page, and by defining the language, you make this a lot easier. Screen readers have different voices for different languages with the correct pronunciation rules. This is best explained with an example.
Small attribute, great difference - with lang and without lang
The following files are recordings of what screen readers will tell visually impaired users on a search result page. They have been slowed down to make it easier for people that can see to follow, normally users will have the speed set a lot faster.
- Search result page in English with French content not flagged up with a lang attribute (mp3, 42 seconds, 670 KB)
- Search result page in English with French content flagged up as French using the lang attribute (mp3, 46 seconds, 720 KB)
- Italian search result page with French and English content without lang attributes (OGG Vorbis, 51 seconds, 400 KB)
- Italian search result page with French and English content with lang attributes (OGG Vorbis, 49 seconds, 300 KB
Thanks must go to Artur Ortega for testing with Jaws and Ryan Grove for adding the necessary information.