“Websites as Graphs” is a cool idea to give a graphical depiction of the structure of a webpage. Each node in the graph represents a tag in the page. The graph is a tree with the HTML tag of the page as the root. The webpage parser can take any URL and make a graph from it. Not only the finished graph, but the process of making the graph is also informative as the structure unfolds before you, tag by tag.

The weakest part of this visualization is that you need to know the key to the color-coded nodes of the graph. I would suggest keeping the following legend visible while you examine the graphs:

 blue Links are blue. red Tables and table elements (table, tr, and td tags) are red. green The div tag is green. The div tag is used to control layout and organize the page in modern CSS-based designs. violet Images are violet. yellow Forms (form, input, textarea, select and option tags) are yellow. orange Paragraph type content (p, br, and blockquote tags) is orange. black The html tag, the root tag of the page, is black. It is often hard to find; look for a small, all gray flower next to it. gray All other tags are gray.

Above is the graph of michaelgalloy.com. It is a WordPress layout, a typical CSS layout with a lot of div tags. Several of the features of the page are easily identified. The sidebar is on the left, the header is in the middle sticking upwords, and the main content is entire right side of the graph.

The head tag, pictured on the left, contains basic metadata about the website and links to other resources needed by the page such as style sheets, scripts, etc. The black dot is the HTML tag, the root of the entire page. The navigation links, shown on the right, at the top of the page are easy to spot: blue links inside gray list elements. The calendar, shown on the left, is the only table in the page because WordPress (and the theme I use) are modern CSS-based design that uses div tags and CSS to do the layout. Only truly tabular content is displayed in tables. This is one of the easiest differences to spot using these graphs. The plugin I use for including sections of code in an article makes a span (gray) and br (orange) tag for each line of code. The long code example, MG_N_SMALLEST, from the IDLdoc tags article is pictured on the right.

Above is the graph of a typical IDLdoc generated page. There is a fair amount of red for tables, but it is content that is actually tables: of routines, of parameters and keywords, and of the complexity statistics. There are a fair number of div tags (green) to mark the various sections, but not a lot of paragraphs (orange) because of the nature of the content. You can see the thirteen routines in this file in several places. In a distant way, the graph represents the IDL code (by the transitive property, graph represents page which represents the code). Each routine is a spike from the big flower at the top of the graph. Each spike has one orange node (this is the paragraph containing the syntax of the routine). The number of blue nodes (links) from that orange node is the number parameters of the routine.

Hello 1999! Above is the graph of rsinc.com. All that red indicates a table-based design. One more reason not to like RSI’s ITT Visual Information Solution’s website.

It would be nice for the parser to give a total of the number of each type of node. The total number of tags is a measure of the complexity of the page, but I didn’t bother to count them all up (although the author does in his examples). The other feature that would be nice is to label those tags that have an id attribute.

Link via a discussion on Tufte’s website.