Generating sitemap diagrams for massive websites
Software designers generally work on two types of projects: redesigning existing products or creating new ones from scratch. For an existing platform, understanding the current structure is one of the first steps a User Experience Designer working on a software project should take — you need to understand the content or functionality you’re going to be working with as well as who the users are.
A good way of doing this is by creating sitemaps or flow charts — sitemaps also contribute to UX Design processes beyond initial research. They can be used to communicate the site content and structure to others, and to inform and act as a reference for changes to the information architecture, navigation, site structure or the UI.
Great — but what happens when the site is too large to do this manually?
Spending a few hours creating these diagrams manually in a program like Omnigraffle is fine for a smaller site, and going through a site manually is useful for familiarising yourself with it deeply. However when the size of the platform reaches around 100 screens, manual sitemap creation starts to get untenable and a programatic solution is needed.
The size of the websites/software I’ve worked on has grown exponentially over the years. This is mostly due to company size but might also reflect progress in data transfer and storage. In 2014 I was working on a web app with screen states numbering in the hundreds — which was starting to hit the limit of manual sitemap creation. Between 2015–2016 I was working on a site with pages in the tens of thousands and next I’ll be working for an institution with a web site with pages numbering hundreds of thousands.
For the ‘tens of thousands’ site — I used Omnigraffle and an apple script (available here) to generate sitemaps. Although it took a couple of minutes after set up to run the script, that method worked okay. The benefit was the ability to customise the sitemap layout inside Omnigraffle, allowing the creation of nicer looking diagrams. The downside is that it’s an old script that hasn’t been updated for a while and it can’t handle a website with pages in the 100,000’s. Fortunately, websites this size can be mapped programmatically following this tutorial on generating sitemap diagrams with Python and graphviz.
The first step is crawling the website. For a really large site this is going to take a bit of time. You can create a script in python to do the crawling using a library such as beautiful soup (as the tutorial above does) but I’ve found that on Australian ‘broadband’, with a mediocre computer and a website with more than 100k pages that can take at least 12 hours, usually more. For such a big crawl I prefer to use a software crawler to manage any unforeseen stops or starts and to make intermittent saves. There are a few site mappers on the market, e.g. Dynomapper and PowerMapper. I use Screaming Frog SEO, despite the weird name (great SEO!), I find it has the best functionality and price for what it does, including generating useful reports about site depth and link structure.
After collecting and loading the urls, the script splits the folders and tokenizes them to generate levels in the graphviz diagrams. The outputs are fairly good for graphviz standards; it’s known for it’s pared down diagrams. For this particular website the pages surpassed 300k and there were an enormous amount of folders and subdirectories. The diagrams can get a bit dense and hard to read with more than 3 levels, which can be overcome by running each folder and subdirectory separately — generating a series of diagrams to represent the full site. Often multiple diagrams are needed to organise the site information in a digestible way.
One of these diagrams is shown here as an example of what the output looks like for 2 levels. There is an option for either light or dark styles and number of nested pages are shown on the connecting lines.
Some more examples of this particular website can be found in the repo I reproduced for this project here.
Customising layouts
There is a bit of an art in organising site map information like this. Usually, sitemap diagrams of websites end up being very broad and shallow where most of the IA is achieved through navigation and other UI links. Most of the 3 layer diagrams included in the examples above demonstrate this and are too long to include in one piece here. Usually some form of post-editing is required to arrange the information so it is easier to take in.
With the omnigraffle method this was as simple as editing in the program. For the python method, the diagrams can be saved as svg files for further editing in Illustrator or similar program. This allows the levels to be laid out with content and user groupings in mind, rather than just web-directory depth. Obviously, this is more fiddly compared to doing it in a purpose-built program like Omnigraffle, but it is an option.
Closing thoughts on sitemaps
Sitemap diagrams are a key part of the UX, UI design, content management and information architecture of a website. They are an incredibly useful tool for quickly learning about and communicating a website’s structure and content.
Obviously, web-based sitemaps don’t convey the entirety of the Information Architecture — how navigation is achieved in the UI — but they do represent a good overview of the type, bulk and nature of content and features and can act as a great template for thinking about and re-organising content; both in the UI and website itself if audits and cleanups are necessary. They can also be modified to represent how IA is achieved in the UI in various ways or used to create flow charts.
When dealing with very large websites I find they come in even more handy. Getting a good overall grasp of the contents of a very large website is almost impossible from simply browsing it — it’s easier when it’s condensed visually — which sitemaps, especially programmatically generated ones, are perfect for.