Comments:"Tech-Foo: Visualising the Ubuntu Package Repository"
URL:http://tech-foo.blogspot.se/2013/01/visualising-ubuntu-package-repository.html
Like most geeks, I like data. I also like making pretty pictures. This weekend, I found a way to make pretty pictures from some data I had lying around.
The Data: Ubuntu Packages
Ubuntu is made up of thousands of packages. Each package contains a control file that provides some meta-data about the package. For example, here is the control data for the autopilot tool:Package: python-autopilot Priority: optional Section: universe/python Installed-Size: 1827 Maintainer: Ubuntu Developers Original-Maintainer: Thomi Richards Architecture: all Source: autopilot Version: 1.2daily12.12.10-0ubuntu1 Depends: python (>= 2.7.1-0ubuntu2), python (<< 2.8), gir1.2-gconf-2.0, gir1.2-glib-2.0, gir1.2-gtk-2.0, gir1.2-ibus-1.0, python-compizconfig, python-dbus, python-junitxml, python-qt4, python-qt4-dbus, python-testscenarios, python-testtools, python-xdg, python-xlib, python-zeitgeist Filename: pool/universe/a/autopilot/python-autopilot_1.2daily12.12.10-0ubuntu1_all.deb Size: 578972 MD5sum: c36f6bbab8b5ee10053b63b41ad7189a SHA1: 749cb0df1c94630f2b3f7a4a1cd50357e0bf0e4d SHA256: 948eeee40ad025bfb84645f68012e6677bc4447784e4214a5512786aa023467c Description-en: Utility to write and run integration tests easily The autopilot engine enables to ease the writing of python tests for your application manipulating your inputs like the mouse and keyboard. It also provides a lot of utilities linked to the X server and detecting applications. Homepage: https://launchpad.net/autopilot Description-md5: 1cea8e2d895c31846b8d3482f96a24d4 Bugs: https://bugs.launchpad.net/ubuntu/+filebug Origin: Ubuntu
As you can see, there's a lot of information here. The bit I'm interested in is the 'Depends' line. This lists all the packages that are required in order for this package to work correctly. When you install autopilot, your package manager will install all it's dependencies, and the dependencies of all those packages etc. This is (in my opinion), the best feature of a modern Linux distribution, compared with Windows.
Packages and their dependant packages form a directed graph. My goal is to make pretty pictures of this graph to see if I can learn anything useful.
Process: The Python
First, I wanted to extract the data from the apt package manager and create a graph data structure I could fiddle with. Using the excellent graph-tool library, I came up with this horrible horrible piece of python code:#!/usr/bin/env python from re import match, split from subprocess import check_output from debian.deb822 import Deb822 from graph_tool.all import * graph = Graph() package_nodes = {} package_info = {} def get_package_list(): """Return a list of packages, both installed and uninstalled.""" output = check_output(['dpkg','-l','*']) packages = [] for line in output.split('\n'): parts = line.split() if not parts or not match('[uirph][nicurhWt]', parts[0]): continue packages.append(parts[1]) return packages def get_package_info(pkg_name): """Get a dict-like object containing information for a specified package.""" global package_info if pkg_name in package_info: return package_info.get(pkg_name) else: try: yaml_stream = check_output(['apt-cache','show',pkg_name]) except: print "Unable to find info for package: '%s'" % pkg_name package_info[pkg_name] = {} return {} d = Deb822(yaml_stream) package_info[pkg_name] = d return d def get_graph_node_for_package(pkg_name): """Given a package name, return the graph node for that package. If the graph node does not exist, it is created, and it's meta-data filled. """ global graph global package_nodes if pkg_name not in package_nodes: n = graph.add_vertex() package_nodes[pkg_name] = n # add node properties: pkg_info = get_package_info(pkg_name) graph.vertex_properties["package-name"][n] = pkg_name graph.vertex_properties["installed-size"][n] = int(pkg_info.get('Installed-Size', 0)) return n else: return package_nodes.get(pkg_name) def get_sanitised_depends_list(depends_string): """Given a Depends string, return a list of package names. Versions are stripped, and alternatives are all shown. """ if depends_string == '': return [] parts = split('[,\|]', depends_string) return [ match('(\S*)', p.strip()).groups()[0] for p in parts] if __name__ == '__main__': # Create property lists we need: graph.vertex_properties["package-name"] = graph.new_vertex_property("string") graph.vertex_properties["installed-size"] = graph.new_vertex_property("int") # get list of packages: packages = get_package_list() # graph_nodes = graph.add_vertex(n=len(packages)) n = 0 for pkg_name in packages: node = get_graph_node_for_package(pkg_name) pkg_info = get_package_info(pkg_name) # purely virtual packages won't have a package info object: if pkg_info: depends = pkg_info.get('Depends', '') depends = get_sanitised_depends_list(depends) for dependancy in depends: graph.add_edge(node, get_graph_node_for_package(dependancy)) n += 1 if n % 10 == 0: print "%d / /%d" % (n, len(packages)) graph.save('graph.gml')
Yes, I realise this is terrible code. However, I also wrote it in 10 minutes time, and I'm not planning on using it for anything serious - this is an experiment!
Running this script gives me a 2.6MB .gml file (it also takes about half an hour - did I mention that the code is terrible?). I can then import this file into gephi, run a layout algorithm over it for the best part of an hour (during which time my laptop starts sounding a lot like a vacuum cleaner), and start making pretty pictures!
The Pretties:
Without further ado - here's the first rendering. This is the entire graph. The node colouring indicates the node degree (the number of edges connected to the node) - blue is low, red is high. Edges are coloured according to their target node.These images are all rendered small enough to fit on the web page. Click on them to get the full image.
The graph has several other distinct communities as well. I've produced a number of images below that show the various communities, along with a short comment.
C++
These two large nodes are libgcc1 (top), and libstdc++ (bottom). As we'll see soon, the bottom-right corder of the graph is dominated by C++ projects.Qt and KDE
This entire island of nodes is made of up the Qt and KDE libraries. The higher nodes are the Qt libraries (QtCore, QtGui, QtXml etc), and the nodes lower down are KDE libraries (kcalcore4, akonadi, kmime4 etc).Python
The two large nodes here are 'python' and 'python2.7'. Interestingly, 'python3' is a much smaller community, just above the main python group.System
Just below the python community there's a large, loosely-connected network of system tools. Notable members of this community include the Linux kernel packages, upstart, netbase, adduser, and many others.Gnome
This is GNOME. At it's core is 'libglib', and it expands out to libgtk, libgdk-pixbuf (along with many other libraries), and from there to various applications that use these libraries (gnome-settings-daemon for example).
Mono
At the very top of the graph, off on an island by themselves are the mono packages.Others
The wonderful thing about this graph is that the neighbourhoods are fractal. I've outlined several of the large ones, but looking closer reveals small clusters of related packages. For example: multimedia packages:This is Just the Beginning...
This is an amazing dataset, and really this is just the beginning. There's a number of things I want to look into, including:- Adding 'Recommends' and 'Suggests' links between packages - with a lower edge weight than Depends.
- Colour coding nodes according to which repository section the package can be found in.
- Try to categorise libraries vs applications - do applications end up clustered like libraries do?