Yesod sitemap.xml generation (small size)

Posted on Oct 21, 2016 by Alexej Bondarenko

Yesod provides a simple sitemap.xml mechanism through the yesod-sitemap plugin. It's very good for a scenario in smaller projects (let's say a blog with 10 to 500 blog post entries). The following blog post will show you how to use the plug in.

First of all, let's create our scenario. We will have a (simplified) blog post model which contains all blog contents and a main blog site. We would like to collect all blog posts and add the corresponding links to our sitemap.xml with the publication date as the last modified tag. Furthermore we would like to attach a static URL for the main blog entry page to the sitemap.xml with the last modified tag set to the latest blog entry. And, as the last step we need to add the sitemap location to the robots.txt so any bot can discover it easily. Since we are in the world of type safety (Haskell) we would like to handle all links in a type safe manner so we don't mess up things if we change the links later.

Step 1: Building the skeleton

Let's create a skeleton Handler to see if our invocation of sitemap.xml and robots.txt will work properly. For that we need to add yesod-sitemap (Link) as a new dependency to our project. Afterwards we can create a new Handler like this:

module Handler.Sitemap where

import Import
import Yesod.Sitemap
import qualified Data.Text as T

--| Deliver a sitemap.xml to the client (usually GET request to /sitemap.xml)
getSitemapR :: Handler TypedContent
getSitemapR = do
  sitemapList []

--| Deliver a robots.txt to the client (usually GET request to /robots.txt)
getRobotsR :: Handler Text
getRobotsR = do
  return $ T.unlines
       [ "User-agent: *"

Sweet! Nothing special going on here. We render a sitemap with no entries (empty list []) and a very minimal robots.txt. Now we need to add the routes to our config/routes file like this:

/sitemap.xml SitemapR GET
/robots.txt RobotsR GET

Hint: You'll need to add the Handler to Application.hs so our functions getSitemapR and getRobotsR can be discovered.

To test our new routes we start our yesod application, open up a browser window and try to call /sitemap.xml and /robots.txt. You should see very simple responses popping up.

Step 2: Generating links

Let's fetch some data and create links. I will use Esqueleto but you can use any SQL fetching function you like. What we will get as a result is a Persistent Entity with its Key and Data. For readability we will create a function which will transform a data model into a URL (you can shortcut this if you like):

--| Create a blog post Url
blogPostResource :: BlogPost -> Route App
blogPostResource blogPost =
  BlogPostR (blogPostSlug blogPost)

Here the slug of the post is a part of the blog post URL. With this simple function (sidenote: it can get much more complicated if you add date and category to the blog post URL) we can now create an entry for our sitemap.xml:

--| Convert a BlogPost into a SitemapUrl
blogPostToSitemapUrl :: BlogPost -> SitemapUrl (Route App)
blogPostToSitemapUrl blogPost =
  SitemapUrl (blogPostResource blogPost) (Just $ blogPostPublishedAt blogPost) (Just Monthly) (Just 1.0)

The SitemapUrl takes four parameters, the SitemapUrl, an optional last modified date as UTCTime, an optional change frequency and an optional priority. I have set the change frequency and priority to some random values which you can adjust to your needs. We our now able to fetch a list of blog posts and map them over to a list of SitemapUrls which we can pass to our sitemapList function inside the Handler:

import qualified Database.Esqueleto as E
import           Database.Esqueleto ((^.))

--| Deliver a sitemap.xml to the client (usually GET request to /sitemap.xml)
getSitemapR :: Handler TypedContent
getSitemapR = do
  now <- lift (liftIO getCurrentTime)
  blogEntries <- runDB $
              $ E.from $ \entry -> do
                E.orderBy [E.desc (entry ^. BlogPostPublishedAt)]
                E.where_  (entry ^. BlogPostPublishedAt E.<=. E.val now)
                E.limit 1000
                E.offset $ 0
                return entry
  sitemapUrls <- pure $ map (\(Entity _ blogPost) -> blogPostToSitemapUrl blogPost) blogEntries
  sitemapList sitemapUrls

I have set two additional things to the query: 1. A filter on publication date (we don't wan't to add unpublished blog posts to appear in the sitemap.xml). 2. For demonstration I have set the limit to 1000 blog post entries. If you have more than this amount I suggest to switch to a background job which will create a GZIP compressed sitemap file on disk and let Yesod stream this file to the client.

By calling the /sitemap.xml in the browser you should now see a nicely rendered sitemap with the blog post links inside it.

Step 3: Work on the details

To complete our tutorial we would like to add a "static" link to our blog entry page with the last modified date set to the latest blog post publication date. And - of course - we would like to add the sitemap.xml link to our robots.txt:

--| Generate a last modification date
blogModificationDate :: [BlogPost] -> Maybe UTCTime
blogModificationDate blogPosts =
  if length blogPosts > 0 then
    Just $ blogPostPublishedAt $ unsafeHead blogPosts

We create a helper function to extract the first element of a list of blog posts and optionally return the publication date of this blog post. In our Handler function getSitemapR we can now add our blog entry page:

let staticRoutes = [ (SitemapUrl BlogR (blogModificationDate blogEntries') (Just Weekly) (Just 0.9))]
sitemapList (sitemapUrls ++ staticRoutes)

To add our sitemap URL to the robots.txt we need to render the SitemapR Route as text. We will use Yesods UrlRender to do so:

--| Render robots.txt
getRobotsR :: Handler Text
getRobotsR = do
   ur <- getUrlRender
   return $ T.unlines
       [ "Sitemap: " `T.append` ur SitemapR
       , "User-agent: *"

This completes our sitemap.xml tutorial. I hope you could follow along and it helped you to implement your own sitemap.xml to improve the search engine visibility of you project. If you have any comments, thoughts or questions feel free to use the section below.