Clustering documentation websites

4 min readApr 5, 2018

This was one of my first python data analysis projects. At the time I was working in the Information Experience (IX) team at Australian software company Atlassian and had software documentation on my mind. BTW -’IX’ is an approach to tech writing that incorporates design thinking. I wanted to know what the standard was for documentation websites in the industry so I set out to do a competitor analysis.

For the analysis, I scraped the technical documentation websites of 17 top enterprise software companies. The web-scraper collected a mountain of data, but I was initially interested in how article lengths differed across and within sites.

There is an assumption in tech writing that programmer audiences prefer longer articles, and are happy to use ctrl-F to search through single long-form docs. On the other hand, documentation for more ‘consumer-facing’ products is assumed to be shorter and more succinct.

There’s also the issue of the size of the whole site — is it better to be as comprehensive as possible at the risk of content management problems — or is it better to be sparse?

The first step was to plot the distribution of article sizes to have a look at these factors — and they do show a lot about the websites. I decided to go with a rainbow spectrum colour scheme which would come in handy later for graphing the clusters. I used the matplotlib and seaborn libraries for the plots.

Article distributions for the documentation sites

I love distribution graphs! They are one of the first things I want to look at in data and rightly so, they can tell you whether to use an average and whether or not to use certain statistical tests.

Here we can see both the overall size of the websites and its articles, as well as how varied article size is in a site. Some of the documentation sites are massive, with thousands and thousands of pages. Some of them are so much smaller I wondered what was going — on inspection it’s mostly due to login requirements (you need to be a customer to access some docs).

The next step was to see if there were any patterns in these distributions — and whether or not they were related to the nature of the software being documented. The groups formed with KMeans clustering on the distribution metrics are plotted below.

Cluster Results

1. Oracle, Salesforce, Adobe, Autodesk, Slack
2. Intuit, Atlassian
3. Synopsis, Akamai
4. CA, Teradata, Trello, GitHub
5. Microsoft, Symantec, Nuance, Dropbox

The companies looked at here range from single to multi-product and vary a lot in other ways, like size and customer base, which is reflected in their documentation coverage.

The Silhouette test gave 3 as the best number of clusters, with 5 coming in second. The 3 cluster result didn’t give much interesting to explore so I ran with 5 clusters, with that disclaimer*. I was concerned about the small N however the classifier looks to have done as good a job of grouping the distributions as I would have done by eye.

Do high-tech readers prefer longer docs?

Getting back to the question of whether the more ‘technical’ the documentation = the longer the articles — the result here points to yes. Inuit, Atlasssian, Synopsis and Akamai are companies that have a high number of technical customers.

Cluster 1, containing Autodesk, Salesforce, Adobe etc, has comparatively tiny articles and appears to be formed of companies that serve professional, but not exclusively technical customers, like architecture, media, publishing and sales. Cluster 4 contains the other “short-article-companies” — with GitHub and Teradata appearing to be outliers in being more tech focused than the others, but perhaps with more diverse audiences.

With the exception of Cluster 3, containing Akamai and Synopsis, the two smallest sites, the overall size of the sites doesn’t seem to impact the clustering, even when run on un-normalised data.

The Cluster 5 distributions sit somewhere in the middle, with the widest range of short and long articles. These differences may have to do with the customer types the docs are aimed at. For example Dropbox documentation may be more directed to system admins than the regular day-to-day user. Microsoft likewise would cater to different types of users and has a broad distribution.

Overall, most of the companies included didn’t have a large number of 1000+ word pages in their technical documentation, indicating a trend toward condensed content rather than ‘mega-pages’ with loads of links and text. The results do seem to support the assumption that more complex, technical products, with technically-specialised audiences are being given longer docs — however whether the audience find them the best way to absorb information remains open.

Github repo can be found here

Clustering documentation websites

Written by A.L. Parker