TL;DR version
To make all this process as simple as possible, a variation of the third approach (Rack middleware + Selenium Webdriver + no caching) is available here as a Gem. Drop it in your project, have the dependencies installed, and may the SEO gods bless you!
The whole story
Much has been said about how hard it is to build a single-page app that responds well to crawlers. Most search crawlers don’t support Javascript, which means those applications which rely completely on client-side rendering will serve blank pages.
Luckily, there are several approaches one can take to circumvent the lack of faith from certain crawlers – there’s obviously no “one size fits all” approach, so let’s take a minute to go through three of the most commonly used approaches, highlighting pros and cons of each one of them.
Render everything in two “modes” (the no script approach)
This strategy consists of rendering your app normally, BUT with pieces of “static” content already baked in (usually inside a <noscript>
block). In other words, the client-side templates your app servers to be rendered will have at least some level of server-side logic on them – hopefully, very little.
Although it’s a somewhat natural approach for a developer used to rendering templates and content on the server side, it leads to a scenario where everything has to be implemented twice – once in the javascript templates/views, once in the pages, making everything hard to maintain (and potentially out of sync) real quick. That Product detail view now includes a “featured review”? No problem, have its text rendered on the server-side (and returned by the JSON endpoint your client app will use to render it again in a Javascript template).
This DOES work well for mostly-static, single-content apps (eg.: blogs), where even the javascript app itself would benefit from having the content already pre-loaded (the javascript views/snippets/segments would effectively fiddle with blocks of content, instead of fetching them from the server).
It’s worth noting that you should NOT rely on rendering just bits of content when serving pages to bots, as some of them state that they expect the full content of the page. It’s also worth pointing that Google also the snapshots it takes when crawling to compose the tiny thumbnails you see on search results, so you want these to be as close to the real thing as possible – which just compounds on the maintenance issues of this approach.
The hash fragment approach
This technique is supported by Google bot alone (with limited support by some other minor search bots – Facebook’s bot works too, for instance) and is explained in detail here.
In short, the process happens as follows:
The search bot detects that there are hash parameters in your URL (eg.: www.example.com/something#!foo=bar
)
The search bot then makes a SECOND request to your server, passing a special parameter (_escaped_fragment_
) back. Eg.: www.example.com/something?_escaped_fragment_=foo=bar
– it’s now up to your server-side implementation to return a static HTML representation of the page.
Notice that for pages that don’t have a hash-bang on their URL (eg.: your site root), this also requires that you add a meta tag to your pages, allowing the bot to know that those pages are crawlable.
<meta name="fragment" content="!">
Notice the meta tag above is mandatory if your URLs don’t use hash fragments (which is becoming the norm these days, due to the amazing adoption of html5 across browsers) – analogously, this is probably the only technique of these three that will work if you depend on hash-bang urls on your site (please don’t!).
You still have to figure out a way to handle the _escaped_fragment_
requests and render the content these are supposed to return (like the previous approach), but at least it takes that content away from the body of the pages served to regular users (reducing, but not eliminating, the duplication issue). This works somewhat well on sites which part of the content is dynamic – not so much on single-page apps. No universal search bot support is also an obvious downside. Plus, you still have to pick a strategy to render the content without Javascript when the _escaped_fragment_
request arrives. Which leads us directly to the third approach…
Crawl your own content when the client is a bot
Although this seems counterintuitive at first, this is one of the approaches Google suggests when you’re dealing with sites whose majority of the content is generated via javascript (all single-page apps fall on this category).
The idea is simple: if the requester of your content is a search bot, spawn a SECOND request to your own server, render the page using a headless browser (thankfully, there are many, many options to choose from in Ruby) and return that “pre-compiled” content back to the search bot. Boom.
The biggest advantage of this approach is that you don’t have to duplicate anything: with a single intercepting script, you can render any different page and retrieve them as needed. Another positive point of this approach is that the content search bots will see is exactly what a final user would.
You can implement this rendering in several ways:
- A
before_filter
on your controllers checks the user-agent making the request, then fetches the desired content and return it. PROS: all vanilla-Rails approach. CONS: you’re hitting the entire rails stack TWICE. - Have a Rack middleware detect the user-agent and initiate the request for the rendered content. PROS: still self-contained on the app approach. CONS: need to be careful on which content will be served, since the middleware will intercept all requests.
- Have the web server (nginx, apache) handle the user-agent and send requests to a different server/endpoint on your server (eg.: example.com/static/original/route/here) that will serve the static content. PROS: only one request hits your app, CONS: requires poking around the underlying server infrastructure.
As for how to store the server-side rendered content (again, from worst to best):
- Re-render things on every request. PROS: no cache-validation hell, CONS: performance.
- Cache rendered pages, optimally with a reasonable expiration time PROS: much faster than re-fetching/rendering pages every time, CONS: cache maintenance might be an issue.
- Cache rendered pages in a way that the web server can fetch them directly (eg.: save them as temp files). PROS: INSANELY FAST. CONS: huge cache maintenance overhead.
There are a few obvious issues you have to keep in mind when using this approach:
- Every request made by a search bot will consume two processes on your web server (since you’re calling yourself back to get content to return to the bot)
- The render time is a major factor when search engines rank your content, so fast responses here are paramount (caching the static versions of pages, therefore, is very important).
- Some hosting services (I’m looking at you, Heroku) might not support the tools you use to render the content on the server side (capybara-webkit, selenium, etc). Either switch servers or simply host your “staticfying” layer somewhere else.