Information and Links
Join the fray by commenting, tracking what others have to say, or linking to it from your blog.
- Other Posts
- Beaten Senseless
- Next NYT Blog is Up
Reverse-Engineering the Internet
I recently read an article on www.physicsweb.org by Albert-Laszlo Barabasi titled “The Physics of the Web.†The full text of the article can be found here [http://physicsweb.org/articles/world/14/7/9]. While quite old when given the constantly evolving nature of the internet, a feature of Physics World in July 2001, I found the article immensely interesting. Mr. Barabaisi attempts to make sense of the physical and virtual web by means of analyzing the network using graph theory, a system that examines the many nodes and links that comprise a network.
First, he distinguishes between the terms “internet†and “world wide web,†or simply “web.†The internet is comprised of physical workstations and connections that carry information. The internet that we know is a system of computers, routers, switches, ethernet cables, phone lines, and fiber optic cable. Each node of the internet (a computer, router, etc) is connected to another node via a hard-wired or wireless data link. The web, however, is comprised of documents that are stored, transferred, and viewed on the internet. The web is the endless numbers of html, pdf, and other files that you view on your computer, often in a web browser. Nodes on the web (documents) are connected to other nodes by URL links. Unlike most networks, however, on the web a connection from one node to the other (a link from one site to another) does not imply a reverse link. For example, I have a link on this page to the University of Alabama, Huntsville, but UAH has no link to my page on their site. In contrast, all links on the internet are two-way links.
The study set out to determine if the seemingly random construction of the internet and web was actually random at all. New nodes are added to each network every second of every day, with no real motivation from anyone or the ability, for that matter, to keep track of it all. The study determined that both networks are what the author terms “scale-free,†meaning that the distribution of links among web pages and connections among internet nodes follows a power-law distribution rather than a binomial distribution. I understand this to mean the following: The author expected to find that most nodes on the network would have an average number of links to other nodes, let’s arbitrarily say five, and that the number of nodes with significantly more or less than five links would be small, and would get smaller the further away from five links you get—similar to a bell curve. Instead, they found that most nodes have very few links, and nodes with a great number of links are rare, but present in greater numbers than the first example.
This has a few implications. First, both the internet and the web were found to follow approximately the same pattern—that of scale-free networks. Even though they are intimately connected in our minds, there is no real reason I can think of that this should have happened. Second, this implies both networks are “clustered,†meaning that just a few nodes are responsible for a great many links. This makes sense in the light of search engines, which, according to the article, map a maximum of only 16% of the web as of 2001. I also believe this may be due to groups that share similar interests, such as a corporation or a gaming forum. These types of infrastructure have a vested interest in being connected, the former physically and the later virtually, to others that share their interests. For example, I would venture a guess that ISPs provide internet service to a vast multitude of home users that comprise the “very few links†category. In contrast, they probably also provide service to just a handful of large corporations that comprise the “very many links†category. On the web you can see a similar pattern. Sites like Google, Yahoo, ESPN, Digg, and various other news blogs provide a huge number of links to documents all over the web. In contrast, the sites they link to are often smaller sites (this one, perhaps) with a comparatively smaller number of links. In both networks the vast majority of nodes have just a few links, and it is the rare node that has an abundance of links.
Another interesting point is the principle of “19 clicks of separation.†Much like the six degrees of separation between all of us and Kevin Bacon (or so I’m told), on the world wide web an average of 19 links (or clicks) separate any two randomly selected web pages, assuming a path exists between the two (yes, a big assumption). The study took the number of web pages thought to exist in 1999, 800 million, and calculated that an average of 19 links separate any two. This number of links grows with the number of nodes on the network, so it is no doubt much larger today. The assumption that a path between any two pages exists is a pretty large assumption, but it’s still neat to think that you might be able to start at the home page of the British Embassy in South Africa and 20-21 clicks later (assuming the growth of the web since 1999) end up at Midnight in Iraq.
Also present in the web is a principle of “the rich get richer,†with respect to links and nodes, of course. The author found that some web pages were inherently more attractive to links than others because of their content. A well-designed page with regularly updated content will be linked to more often than a poorer page. These web sites often continue to grow by leaps and bounds at a pace that outscales that of slightly lesser sites. If you have a good web site, it will continue to grow and receive links, and this will only compound over time as long as it retains its attractiveness. Poor web sites, however, are locked into a vicious cycle of never having many links.
The study goes on to investigate the resilience of scale-free networks to the efforts of hackers. It found the internet surprisingly resistant, noting that a large number of random nodes have to be removed before the network will fragment into small, unconnected parts. It quotes the statistic that 3% of the internet’s routers are down at any given time, and this doesn’t even come close to fragmenting the network. Unfortunately, the web is a different story, as the removal of a few of the most-connected nodes can cripple the network. As we know, attacks from hackers are never random, and are usually directed at the targets of greatest value, ie those with the largest numbers of nodes connected to them.
So what does this all mean besides a mental workout? Well, I think there are some lessons to be learned here for publicizing web pages—the effort to become that rare node with a multitude of links. First, have a website with “attractive†content. Easier said than done, huh? If your site lacks good content it appears you will never get off the ground. However, the presence of good content will cause you to grow exponentially. Attractive content seems to be directly linked to new or regularly updated web pages. Second, try to be a site with lots of incoming links. I believe you achieve this by connecting different communities. When I have unrestricted hi-bandwidth access to the internet my surfing generally spans a few different genres: shooting and firearms, video gaming, electronics and technology, and news. Each of these genres has popular sites with many similar themed sites linked to them. If you can get sites from multiple genres to link to your site, then you have expanded your audience to the many web surfers of each of those genres. Third, get over the hump. The study quoted predicts that above a certain level of popularity (linked-ness) a website will flourish and grow exponentially, and below that threshold it will stagnate. I can’t tell you where the “hump†is, but apparently it’s an important obstacle to overcome.
The last point of the study I want to address was not really addressed in the original article at all, but is a product of my own reasoning. One quote from the article reads “we all feel that behind every complex system there is an underlying network with non-random topology.†This brought to mind the hotly debated topic of evolution versus intelligent design and what is allowed to be taught in public schools. With all the negative press attributed to intelligent design as of late, I found this statement intriguing. I believe most critically thinking individuals look at a system as complex as our world and know that “there is an underlying network with non-random topology.†Just like the internet, somebody made it. The second law of thermodynamics states (in a nutshell) that all natural systems will tend towards disorder over time. This begs the question, “If the earth is a natural system, why has it tended towards order in the form of the creation of life?†I’m not here to preach, but I couldn’t help but notice the correlation between finding a structure at the heart of the internet and world wide web, and the idea that there was and is structure present in this place we call earth.
Wow, interesting. Kinda hard to understand all that Internet vs. Web stuff though. :) I’ve always been under the impression that they were one and the same. Just goes to show how much I know.