This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature: PHP

Filesystem data visualization using JPGraph

By Glenn Mullikin on March 10, 2005 (8:00:00 AM)

Share    Print    Comments   

JPGraph is a set of programs written in PHP that plots data into a wide range of graphs and formats the results. Licensed under the Trolltech QPL License, JPGraph is now at Version 1.17. Whatever your data, JPGraph can help you to view it graphically, letting you to see relations in more clearly. Such data visualization may not be important to a computer, but, to a person, it can make a lot of difference to analysis.

To see what JPGraph can do, let's look at the executable binary files in the /usr/bin directory. I'll exclude the symbolic links. I'll also omit the over 130 files in the /usr/bin/X11 sub-directory. My purpose isn't to be comprehensive; just to show what JPGraph can do. Specifically, I'll be using JPGraph to look at three basic questions:

  • Is there a relation between a binary executable's file size and the number of shared libraries that it uses? I used ls -l to get file sizes. I also used ldd [filename] to count number of shared libraries used.

  • When was the last time each binary executable was accessed? To get the last access times, I used the command stat -c %X [filename]

  • How many files use the shared library files? I used the ldd command to get shared library listings and then counted up all the hits I got on each shared library.

I printed the data for the first two questions to plain text files. For the third, I used a MySQL database for more flexibility.

While looking at the graphs that result, I'll also comment on some of the formatting features offered by JPGraph. I'll jump around a bit, but seeing the features in action shows their usefulness better than talking about them separately.

Graph 1: Is there any relation between file size and the number of shared libraries used?

Figure 1 shows the results of running the ldd command on the /usr/bin directory. I've also used this graph to showcase some of the features JPGraph offers.

Mullikin-s1-scaled.png
Figure 1 Click on graph to see fullsize image

The blue line represents the filesize, and you can see how the filesize decreases. To see whether there is if the number of shared libraries decreases as the file sizes decrease, I created a second Y axis on the right-hand side of the graph. Once I collected the data, I was able to graph those same files in /usr/bin using the number of shared libraries they used.

But you'll notice that there are green, red and black circles, all in slightly different sizes. What's going on there? Well, JPGraph lets you do function callbacks in which you can alter the color and sizes of your plot points according to the Y-axis value, or, in this case, the Y2-axis value. The green circles represent files that use between 10 and 20 shared libraries. Black circles represent 0-9 links to shared libraries, and red circles over 20.

I could have simply used the same color (and size) for all the Y2-axis data points, but then the results wouldn't be so obvious. This way, you can immediately see that the green circled are heavily outnumbered by the red circles. In turn, the red circles are heavily outnumbered by the black circles.

Any circles in the pink cross-hatched area share 10-20 libraries. As well, because of the way I defined the callback function, any circles lying in the cross-hatched area are going to be green. Circles lying above the cross-hatched area represent executables that use more than 20 shared libraries -- although, of course, they don't all use the same ones.

Notice the Y-axis and how it uses a logarithmic scale. That was necessary because our filesizes range from less than 100 bytes all the way up to somewhere between one and 10 megabytes. One megabyte is 10 to the sixth power. JPGraph uses 10^6 to represent 1,000,000 because it is easier to read.

So, after gathering the data and figuring out how to present it, what did I learn? The first thing I learned is that there are about 1,253 files in my /usr/bin subdirectory -- excluding the /X11 sub-directory -- which are not symbolic links. It turns out that around 450-500 of these files are not dynamic executable files, but, presumably, text scripts that call other executable files. These files are represented by the black circles on the Y2-axis zero value line.

Perhaps I should have excluded such files from consideration. How, they do not affect the main idea that I get from looking at the graph. Although files linked to over 20 share libraries (the red circles) are slighly more numerous for the first 600 files than they are for the next 600, the pattern is not nearly consistent enough for us to say that smaller files are generally linked to fewer libraries. However, I can conclude that the majority of the binary executable files lie under the pin cross-hatched area, which means that they use less than ten share libraries.

Before moving on, note that the graph in Figure 1 also showcases many of JPGraph's features, including:

  • Using True Type fonts. I used one named Bazooka for my X- and Y-axis titles.

  • Shading under the line graph from one x value to another x value. I shaded under the graph in a subdued yellow color to highlight the files that lie between 10^4 bytes and 10^5 bytes in size.

  • Shading an entire vertical strip from one x value to another x value. I shaded in very light brown for all the files that lied between 10^3 and 10^4 bytes and had the computer figure out where to start and stop.

  • Using a gradient color scheme for the margins, while leaving the plot area in a solid color. The blend colors I used were red and black, but you can specify other colors.

  • Using Alpha blending to specify a transparency percentage between zero and one. Typical values that I might use are .5 or so. In Figure 1, you can see how I used it to allow the circles to show even though the areas had vertical fills in two different sections. If the vertical fills simply covered the circles up, that would defeat the purpose of the graph.

Graph 2: When were binary executables last accessed?

To answer these questions, I decided that a scatter plot would help us see when files were last accessed. I also decide to check file sizes, since a multi-megabyte file that hasn't been accessed in two years might be more of a candidate for deletion than one that only uses 100 kilobytes. To plot this information, what was needed was two Y-axes, one for the last access of each file (in Unix timestamp format, seconds since the epoch) and one for its size in bytes. To enhance to the graph, I added the Tux logo after tweaking it slightly in the GIMP.

Mullikin-s2-scaled.png
Figure 2 Click on graph to see fullsize image

The graph in Figure 2 is the result. It shows a large mass of red squares on the right-hand side that are stacked on or about the x value of January 6, 2005. This means that all of those files were last accessed on that date. The majority of the files in /usr/bin fall into this category. However, as we move to the left, another small cluster of red squares centers on September 12, 2004. The next large masses of red squares don't appear until in the interval between Dec 15, 2002 and Feb 11, 2003.

How many such files are there in that last cluster? Our Y2-axis was designed to answer that question. The blue triangles with the number above them shows that 293 files are represented by all the red squares stacked between Dec 15, 2002 and Feb 11, 2003.

More generally, one can see that the cumulative file count -- the orange area -- grows relatively slowly until the very far end of the graph at January 6, 2005. The graph shows that 477 files were last accessed before November 9, 2004. The remainder of the 1,253 files were accessed after that time.

What conclusions can be drawn from this graph? What strikes me is how long those vertical strips of squares are. It doesn't appear to make much of a difference what the filesize is (the y axis value). All files have the same distribution of last access times. I am not exactly sure why no files have a last access time earlier than Dec 15, 2002 but that may have been when I installed the system.

This particular graph allowed me to experiment with function callbacks for formatting text labels on the X-axis. When doing this graph, both the y and y2 axes were required to have the same x values so that the plots could be overlayed. However, I didn't want the dates in the form December 13, 2002 because JPGraph couldn't figure out how to order them by time. I had to use the Unix timestamp as the time value, and then use a callback function to reformat them into a human-understandable formatted date.

Figure 2 also allowed a few other niceties, such as:

  • Using the text feature of JPGraph to place text at an arbitrary location on the graph while specifying color and transparency (I used a transparency setting of 0.4 to allow any red squares to show through. Cumulative File Count is the text that I placed in white using a custom true type font.)

  • Printing only the lines on the major divisions of the Y-axis and making them dotted lines. This formatting was useful to maintain a semblance of the Y-value at each power of 10 since the graph was logarithmic on the Y-scale

  • Using the tab title feature to display the title of the graph. The text color, background color and frame color can all be set -- I used magenta, black and green.

Graph 3: How many files use the shared library files?

This graph highlights more of JPGraph's abilities. The previous graphs were 640 by 480 but for this graph, I needed more vertical space so I opted to make it 480 by 740. Even so, I had to confine myself to using only the top 50 shared libraries.

Mullikin-s3-scaled.png
Figure 3 Click on graph to see fullsize image

Among the top 50, are two shared libraries that are used by 832 files in /usr/bin. That is probably all the binary executables. Then we see libm.so.6 with 414 dependencies, libdl.so.2 with 250, and then the rest.

I designed the graph so that it would be readable and understandable without the need for X- and Y-axis titles. I chose a rotated type graph, a horizontal bar graph. The blended bars going from green to blue make the consecutive bars stand out from one another. I decided to put the value inside each bar, instead of to the side of it, because, the farther away you get from the top, the less you know what exact X-value you are at. I almost tilted the text names of the shared libraries at a five degree angle, but I decided that hurt readability slightly. The SetLabelAngle method takes one argument, the number of degrees, positive or negative.

If you look at the graph in Figure 3, you'll notice that, because the bars are decreasing in size, there's empty space on the right side. Rather than leave it blank, I placed a legend there. Red on a yellow background is what the JPGraph documentation and examples use and I saw no reason to change that.

One last thing: For the other two graphs, I used simple text files for my data storage. For this one, I used MySQL. If I wanted to change to graph more than the top 50 libraries, all I would have needed to do is change my query. That's the type of power and flexibility that MySQL can provide when working with JPGraph. With text files, varying the display would be much harder.

Conclusion

Of course, JPGraph doesn't have everything. For example, I would like the ability to do three-dimensional graphs. A three-dimensional bar graph can show relationships that might be impossible to observe in two-dimensional graphs, such as last access time versus filesize versus number of shared libraries. Another feature that needs enhancing is the callback function. In Figure 1, I used the callback function to set the color and size of the filled circles. The problem is that I was only able to use one number to determine both the size and the color. It would be nice if those could be determined by other arrays. Similarly, I would like to be able to use different colors on the individual bars of a single bar plot instead of having to use multiple bar plots. Overall, I am not disappointed with JPGraph's function, but a few coupld offer more fine-grain control.

You may not be interested in filesystems and how many shared libraries your system has, but you don't need my interests to appreciate JPGraph. No matter what your data, you might like to take advantage of JPGraph to perform your own data visualization. And with the comprehensive documentation and the hundreds of samples graphs that you can modify with your own data (located in http://localhost/jpgraph/src/Examples), JPGraph can have you up and running in no time.

Download JPGraph and see for yourself.

Glenn Mullikin is a professional Linux journalist.

Share    Print    Comments   

Comments

on Filesystem data visualization using JPGraph

Note: Comments are owned by the poster. We are not responsible for their content.

Chartjunk

Posted by: Anonymous Coward on March 11, 2005 03:09 AM
These charts do look useful, but they contain lots of what Edward Tufte[1] would describe as chartjunk. I would suggest the JPGraph authors limit these interior decoration features, thus encouraging the user to maximize the data-ink of the charts.

[1] http://www.edwardtufte.com/tufte/books_vdqi

#

Re:Chartjunk

Posted by: orv on March 11, 2005 06:20 PM
True, but I think the point of the article is showing off the features rather than producing simple clean graphs.

#

Re:Chartjunk

Posted by: Anonymous Coward on April 09, 2005 04:55 AM
The actual elements on the chart is completely determined by the actual scripts using JpGraph. In the examples above the author of this article has chosen to use a lot of what you call "chartjunk" by his own construction.

If You review the JpGraph page there are many example of very "pure" graphs with no distraction.

#

Source?

Posted by: Anonymous Coward on March 13, 2005 03:44 AM
Any chance of the source for these examples? Would be interesting to see.

#

Re: Different colors on bars

Posted by: Anonymous Coward on April 09, 2005 04:59 AM
The article wrongly states that the feature of having different colors on different bars isn't available.

This feature exists and is easily accessed by submitting an array of colors to the SetColor() method.

This feature is documented both in the manual and in the FAQ.

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya