Reading the NCBI's GEO microarray SOFT files in R/BioConductor

The parent page discusses how to load GEO SOFT format microarray data from the Gene Expression Omnibus database (GEO) into R/BioConductor using GEOquery by Sean Davis.

There are of course other options. Saurin Jani at the MUSC Proteogenomics Facility had written some code which relied on the command line utility grep (e.g. here). After the NCBI file format changes in 2005 (new !dataset_table_begin and !dataset_table_end "marker" lines before and after the expression data), he changed to something based my code below. See Method 14 on this page.

There is also some discussion of parsing GEO SOFT files buried in this presentation by Vince Carey at the CSC (Finnish IT center for science), including extracting the "soft meta data" as he calls it from the header section (the text before the expression data).

My parser based on AnnBuilder's queryGEO

The BioConductor package AnnBuilder has a function queryGEO which appears to try and download a GEO GPL file from its identifier, and parse it. It can't download GDS files (which use a different URL), but the rest of their code can be adapted to load a GEO SOFT file from the harddisk:

library(Biobase)

loadGeoFile <- function(geoFilename) {
    temp <- readLines(geoFilename)    # Load the file
    temp <- temp[grep("\t", temp)]    # Keep only lines with tabs
    temp <- gsub("\t$", "\tNA", temp) # Deal with NA
    temp <- strsplit(temp, "\t")      # Split the strings at each tab
    temp <- t(sapply(temp, unlist))   # Turn each line into a vector, transpose
    colnames(temp) <- temp[1, ]
    rownames(temp) <- temp[ ,1]
    #Remove the row/col names from the data, and return it.
    #Note that all the entries are strings/characters, not numeric!
    temp[-1,-1]
}

esetFromGeoFile <- function(geoFilename) {
    temp <- loadGeoFile(geoFilename)
    if (colnames(temp)[1] == "IDENTIFIER") {
        #Remove the IDENTITY column, which
        #seems to be the GenBank Accession number
        temp <- temp[,-1]
    }
    #Make the remaining columns (the samples) numeric...
    #This is clumsy, there should be a better way.
    data <- matrix(as.numeric(temp), nrow(temp), ncol(temp))
    colnames(data) <- colnames(temp)
    rownames(data) <- rownames(temp)
    rm(temp)
    #Setup simple phenotype data (just a sample number)...
    pheno <- read.phenoData(NULL, colnames(data), FALSE)
    #Turn data into an expression set...
    eset <- new("exprSet", exprs = data, phenoData=pheno)
    rm(data)
    #return the new expression set:
    eset
}

I had lots of ideas for improvements (e.g. logarithms) but now plan to use GEOquery instead.

Using the Expression Set object

If you copy & paste my bit of code in at the R command prompt, you should then be able to load a GEO SOFT file into an expression set object as follows:

> eset <- esetFromGeoFile("GDS858.soft")
> eset
Expression Set (exprSet) with
        22283 genes
        19 samples
                 phenoData object with 1 variables and 19 cases
         varLabels
                sample: arbitrary numbering

> pData(eset)
         sample
GSM14498      1
GSM14499      2
GSM14500      3
GSM14501      4
GSM14513      5
GSM14514      6
GSM14515      7
GSM14516      8
GSM14506      9
GSM14507     10
GSM14508     11
GSM14502     12
GSM14503     13
GSM14504     14
GSM14505     15
GSM14509     16
GSM14510     17
GSM14511     18
GSM14512     19

> geneNames(eset)[1:10]
 [1] "1007_s_at" "1053_at"   "117_at"    "121_at"    "1255_g_at" "1294_at"
 [7] "1316_at"   "1320_at"   "1405_i_at" "1431_at"

> sampleNames(eset)
 [1] "GSM14498" "GSM14499" "GSM14500" "GSM14501" "GSM14513" "GSM14514"
 [7] "GSM14515" "GSM14516" "GSM14506" "GSM14507" "GSM14508" "GSM14502"
[13] "GSM14503" "GSM14504" "GSM14505" "GSM14509" "GSM14510" "GSM14511"
[19] "GSM14512"

Note that there is no proper phenotype information extracted from the GEO file, unlike the expression set object created by GEOquery.

Loading the GEO Annotation file

In addition to loading a GDS file to get the expression levels, you can also load the associated platform annotation file. In the case of GDS858, its GPL96, Affymetrix GeneChip Human Genome U133 Array Set HG-U133A which is available in compressed form as GPL96.annot.gz.

Again, GEOquery does a much nicer job of this. However, my simple code is sufficient for most purposes...

> anntGPL96  <- loadGeoFile("GPL96.annot")
> anntGPL96[1:5,1:3]
           Gene                                           Unigene      UniGene title      
1007_s_at  "discoidin domain receptor family, member 1"   ""           ""                 
1053_at    "replication factor C (activator 1) 2, 40kDa"  ""           ""                 
117_at     "heat shock 70kDa protein 6 (HSP70B')"         ""           ""                 
121_at     "paired box gene 8"                            "Hs.469728"  "Paired box gene 8"
1255_g_at  "guanylate cyclase activator 1A (retina)"      ""           ""

> colnames(anntGPL96)
 [1] "Gene"              "Unigene"           "UniGene title"     "Nucleotide"       
 [5] "Protein"           "GI"                "GenBank Accession" "Gene symbol"      
 [9] "Platform_CLONEID"  "Platform_ORF"      "Platform_SPOTID"   "Platform_SPACC"   
[13] "Platform_PTACC"

Once you have identified a gene of interest from your expression data, say 121_at, you can use the annotation table to find out more about it:

> anntGPL96["121_at","Gene"]
[1] "paired box gene 8"

> anntGPL96["121_at","GenBank Accession"]
[1] "X69699"

> anntGPL96["121_at","Unigene"]
[1] "Hs.469728"

> anntGPL96["121_at","Gene symbol"]
[1] "PAX8"

> anntGPL96["121_at","GI"]
[1] "38425"

Instead of loading the GEO annotation file for GPL96/HG-U133A, we could use an existing annotation package from the BioConductor annotation sets. This is discussed here.