This is a read-only archive. Find the latest Linux articles, documentation, and answers at the new Linux.com!

Linux.com

Feature: Desktop Software

PSPP brings an industry standard statistical tool to Linux

By Andrew Choens on October 16, 2008 (7:00:00 PM)

Share    Print    Comments   

Today's information systems give organizations and governments the ability to collect and access metaphorical mountains of information. But, this information is completely useless unless we are able to find and understand the relationships and trends hidden in these mountains. For projects involving complex research protocols, high-end statistical analysis tools such as SPSS and SAS are useful, but they come with high price tags and proprietary licenses. PSPP is an open-source clone of SPSS, one of the most commonly used proprietary statistical packages.

Major distributions like Fedora and Ubuntu include PSPP in their package repositories, but today they include an outdated version. Upcoming versions of Ubuntu, Fedora, and openSUSE include 0.6.0. Until they are released, if you want to try PSPP, you can compile the current version from source, or look on PSPP's wiki to see if a volunteer has provided a binary for the distribution you're using.

SPSS: A proprietary standard

Before introducing you to PSPP, I want to introduce you to SPSS. SPSS was originally designed for researchers in the social sciences but is now used in many other fields, and by analysts working for federal and state agencies, large corporations, and academia. SPSS is a remarkable tool because it offers a robust programming language for the analysis of complex data and a user interface that gives less technical users unfettered access to the power of the underlying system.

SPSS's intuitive GUI makes it accessible to users who have little or no programming experience. Other statistical packages such as R (open source) and SAS (proprietary) are used almost exclusively by experienced programmers. SPSS provides users with an interface that is remarkably similar to a spreadsheet, with which analysts can design complex data transforms or build mind-bogglingly detailed cross-tabs.

Although the GUI is one of the package's killer features, SPSS also provides many opportunities for programmers to write scripts. SPSS Syntax is an easy to use functional programming language designed specifically for data analysis. As the expectations of programmers have evolved, SPSS has offered additional programmability through plugins and language enhancements.

Newer versions of SPSS work on Linux, thanks to the cross-platform magic of Java, but a fully enabled (non-student) license costs nearly $1,700, and annual maintenance costs an additional $425. Worse, the license is time-limited -- in 2011 my legally purchased license of SPSS 11 will expire. The cost and licensing provisions of SPSS create an opportunity for our community to develop an alternative.

PSPP: An open source alternative

As an open source alternative, PSPP 0.6 is an incomplete yet compelling product that should grab the attention of developers and end users alike. It gives Linux a general purpose data analysis tool with the accessibility advantages of its proprietary cousin. If you're in the market for an open source statistical package, there are two reasons PSPP should be on your short list: its new GUI, psppire, and its high degree of compatibility with SPSS syntax.

For many users, the newly introduced GUI is probably the most important new feature in the 0.6 series. Earlier versions of PSPP were command-line only, which limited the software's appeal to programmers. The new GUI mimics the familiar dialog boxes found in SPSS's interface, making the transition easier. As in SPSS, psppire's interface gives non-programmers full access to the power of the underlying system. The dialog boxes are clear and easy to use. For repetitive analysis, writing a script will always be easier, but psppire gives users access to the same tools available to the programmers.

I could only find one significant limitation in psppire. PSPP still lacks many statistical tools found in similar products. Naturally, the GUI is impacted by this limitation, and users familiar with SPSS will notice that psppire's menus are somewhat empty.

PSPP's compatibility with SPSS syntax is as important as the GUI. SPSS syntax is a widely understood standard in many companies and government agencies. In my job I often help state governments calculate a set of complex outcome measures developed by the federal government as part of the Child and Family Services Review. The Feds publish the thousands of lines of syntax necessary to compute the measures, but only for SPSS users. The code could be ported to another tool, but this task is decidedly non-trivial. More importantly, porting the syntax could easily introduce subtle errors in the calculation of the measures. It is ironic that most states use a proprietary product to run code that is available for free on the Internet.

Usable today

Although the 0.6 series of PSPP is not a finished product, it can perform data transformations and is able to perform basic statistical analysis. Users can also create tables of univariate statistics, or create complex cross-tabs of multiple variables. The ability to easily weight cases according to a variable works as expected. As PSPP continues to mature it will help more professionals and students who are comfortable with SPSS convert to open source.

There are some limitations to PSPP. Many advanced statistical analysis methods, such as MANOVA, are not yet implemented. Tables and charts produced by PSPP are less customizable than the output generated by SPSS or R. Most importantly, version 0.6.0 incorrectly calculates linear regressions. On October 10, the project released version 0.6.1 to fix the regression error. Improving the tables and charts generated by PSPP is a high priority for future releases, and the developers are working hard to expand the suite of tools.

The manual for PSPP is available at the GNU Web site, with detailed documentation for programmers for each implemented function. The manual also includes a full list of SPSS functions not yet implemented in PSPP. Unfortunately, the current manual is singularly focused on the implementation of the programming language. The PSPP community has not produced a similar manual for the psppire GUI.

PSPP has an active mailing list on which the developers participate. Discussions often focus on compiling PSPP, but other questions are welcome. The project also welcomes help; developers with a strong background in statistics are especially needed.

Andrew Choens, MSW, works as a research policy analyst for a consulting firm specializing in human services. He abandoned Windows in 2000 for SUSE Linux. He can sometimes be found on the Ubuntu Forums as gunksta.

Share    Print    Comments   

Comments

on PSPP brings an industry standard statistical tool to Linux

Note: Comments are owned by the poster. We are not responsible for their content.

Off-Topic

Posted by: Anonymous [ip: unknown] on October 16, 2008 07:51 PM
Considering the above comment about Politics, wouldn't a link to flag "improper content" be useful?

#

PSPP brings an industry standard statistical tool to Linux

Posted by: Anonymous [ip: 151.59.50.63] on October 16, 2008 08:52 PM
I wonder if a history of SPSS is available somwewhere. I have the feeling it once had a real community working on its development - or at least lots of students. The old manuals by Mrs. Norusis are still the best school for illiterates like me. Today, SPSS ships with a manual that talks about menus and where to click... there are more good reasons for PSPP than just SPSS' price.

#

PSPP brings an industry standard statistical tool to Linux

Posted by: Anonymous [ip: 24.17.159.247] on October 16, 2008 09:34 PM
It would be nice to know how it compares to R, another open source statistical package which can be found here:

http://www.r-project.org/

The current stable release of R is 2.7.2

AC

#

Re: PSPP brings an industry standard statistical tool to Linux

Posted by: Andy Choens on October 16, 2008 11:28 PM
Unfortunately, the editors removed the parts of the article that did compare PSPP to R more thoroughly. Here's the slimmed down conclusion. If you already know how to use R, and don't need compatibility with SPSS, you should stick with R. It is a more mature project and has more capacity.

However, if you need compatibility with SPSS syntax (which R does not provide, it can only import/export data) or need a tightly integrated GUI, PSPP is worth looking at.

In my job, compatibility with SPSS Syntax is absolutely critical. But, R has a much more mature feature set, as long as you don't mind writing the code to make it work.

#

Re(1): PSPP brings an industry standard statistical tool to Linux

Posted by: Anonymous [ip: 38.100.42.25] on October 17, 2008 02:50 AM
You can do more statistical analysis in R, but it performs dismally when presented with huge datasets. Whereas PSPP is designed to accommodate even the most enormous data without affecting performance. Even data which exceeds the capacity of the machine's memory. Also, many people think that PSPP/SPSS is better for visualising and manipulating data.

#

Re: PSPP brings an industry standard statistical tool to Linux

Posted by: Anonymous [ip: 87.61.53.234] on October 23, 2008 06:34 PM
pspp doesn't hold a candle to R period! And frankly, IMHO that's true of spss too. I use R with the RKWard. RKWard turns R into an IDE for all your statistical work. I've used spss and sas for years - still using them occasionally, R+RKWard beats most commercial packages hands down! In R you have the newest versions of the methods you need. You are not forced to live with incorrectly applied or outdated methods for years and you can create your own versions according the absolutely latest in the literature or if someone else has been there before you and made a package of it, simply install their package and you are set to go. That's progress you can believe in ;-)

Btw, interestingly R's import function for spss files comes directly from the pspp project - in that sense they are related which is the most interesting about pspp. It only supports a small subset of spss 7 functions while spss is now at version 17 so to me the pspp project, as much as I admire it, is a bit past its time. Spss is in many ways outdated (regarding statistics) so producing a clone is not as important as it was when pspp started many years ago.

#

PSPP brings an industry standard statistical tool to Linux

Posted by: Anonymous [ip: 91.67.242.59] on October 17, 2008 12:22 PM
Why don't use R in PSPP? This way they would have even more methods than SPSS. The GUI could be like SPSS, only the scripts won't be compatible to SPSS, but heck SPSS 16 is also supporting R. I think this would really ease great adoption.

#

Re: PSPP brings an industry standard statistical tool to Linux

Posted by: Anonymous [ip: 91.67.242.59] on October 17, 2008 12:36 PM
Update: As a compromise PSPP could integrate R in a way you will have two modes like "Native mode" = R and "SPSS compatible mode" = PSPP engine. As would it would be good to unite the statistical free software movement with a rich and SPSS alike GUI for R. Beside that PSPP is a really great name :) and many thanx for the good work.

#

Speed of R

Posted by: Anonymous [ip: 10.241.128.10] on October 17, 2008 02:19 PM
-- You can do more statistical analysis in R, but it performs dismally when presented with huge datasets

Can you provide some benchmarks for this? Most I have seen have R outperforming SPSS and SAS by miles. I have never seen an R vs PSPP comparison. I take on board your comment about datasets in R being confined to size of physical memory.

I'm genuinely interested, not trolling. I use a bit of R for simple stats at work and most statisticians I have spoken to are very positive about it.

#

Re: Speed of R

Posted by: Anonymous [ip: 87.61.53.234] on October 23, 2008 07:08 PM
Hi (I'm not the one you're quoting)
I've used both and I've seen the issue with large datasets in R ('huge' probably means millions of records).

My experience is that as long as you have enough memory you are not likely to see any problems. But if you run our of memory R fails most disgracefully. When R is running close the memory limit it slows down tremendously.

It's a well known problem that stems from two factors. 1) R needs all its data in memory 2) R doesn't do 'call by reference' which means every time an object is passed to a function a new copy is made in memory. This is further aggravated by the typical coding style of R where you often end up with many levels of nested functions necessitating a new copy at every level. In some instances I had to untangle the whole process to bypass the most memory intensive steps or I had to write my own custom versions to make my analyses fit the available memory.

I know there has been some efforts regarding R's performance with large datasets though I haven't looked into it recently. Spss (at least until version 15) had a lot of trickery to deal with large datasets, I didn't know that about pspp before so I'm going to check it out.

#

PSPP brings an industry standard statistical tool to Linux

Posted by: Andy Choens on October 17, 2008 03:28 PM
There is no reason for PSPP to use the R engine. Go read the PSPP FAQ @
http://www.gnu.org/software/pspp/faq.html

This page offers a pretty good explanation for why PSPP should not simply use R to process data.

I would like to see PSPP offer the ability to add-on R into PSPP, in a way similar to recent versions of SPSS but I do not want to see PSPP lose the ability to process SPSS Syntax. For me, _the_ biggest selling point for PSPP is that it can process SPSS Syntax. I work in a field where SPSS is the most commonly used tool. I must be able to use SPSS syntax sent to me by others in the office and elsewhere. I often must design my analysis in SPSS Syntax, so that others in the office can check it for bugs / errors. I am the only person in the office who can use R. Switching to R, or any other tool that relies on R is an absolute non-starter for the office I work in. I would like to convert these folks to linux and PSPP's current feature set makes that more likely (in the future) than any R based tool ever will.

#

Re: PSPP brings an industry standard statistical tool to Linux

Posted by: Anonymous [ip: 10.241.128.10] on October 17, 2008 05:33 PM
To take things slightly offtopic (but prob appropriate for Linux.com), yet again I gasp in disbelief at the FSF attitude towards copyright:

"Many of these other programs have numerous authors, each holding copyright on their respective parts. By contrast, the copyright in PSPP is help by one entity (the Free Software Foundation), which makes for easier copyright tracking and enforcement. Merging with another program would mean losing this advantage, or needing to ask scores of authors to re-assign their copyright. This would be neither practical nor polite."

R (and gretl) are released under the _GPL_, for pity's sake - the very licence the FSF have devised to absorb such problems. There is no need or requirement to reassign copyright but the FSF continue to insist on this for contributions to their projects. Thus they could (however fanciful it may seem) close the project in the future or relicense under a scheme which would not be acceptable to the original authors.

Sorry for the rant. i take the other points on board.

#

Re(1): PSPP brings an industry standard statistical tool to Linux

Posted by: Anonymous [ip: 130.95.80.194] on October 18, 2008 02:43 AM
Firstly, the FSF don't "insist" on this and never have done. Gretl and R are both examples of GNU projects whose copyright is not FSF owned. However, in the case of projects with lots of contributers they do encourage the practise of having the copyright owned by a single body.

Secondly, after the "SCO-owns-the-copyright-to-the-linux-kernel" episode, I'm astounded that anyone can still question the wisdom of this.

#

Re(2): PSPP brings an industry standard statistical tool to Linux

Posted by: Anonymous [ip: 91.84.13.8] on October 18, 2008 11:20 AM
And did SCO-own-the copyright-to-the-linux-kernel?

No.

If every kernel hacker had reassigned copyright to Linus, would it have made any difference to the court case?

No.

The FSF's attitude towards copyright is crippling. For example I have followed the mailing lists for GnuPDF. This project does not have "lots of contributors", being very early in development. But they have been happy to turn away willing coders over copyright reassignment. If contributors are willing to offer code under "GPLvX or above" this should be enough.

FSF/GNU remains a cathedral rather than a bazaar.

#

PSPP brings an industry standard statistical tool to Linux

Posted by: Anonymous [ip: 79.109.214.169] on October 17, 2008 04:52 PM
For me Gretl is a very useful program !!! Must have a look at !!

#

This story has been archived. Comments can no longer be posted.



 
Tableless layout Validate XHTML 1.0 Strict Validate CSS Powered by Xaraya