UNDERSTANDING STATISTICS CANADA'S ESHELF PRODUCTS IN ASCII FORMAT: Statistics Canada's EShelf products of aggregate census statistics in flat-ascii format are organized according to the following: - the unit of observation is the geographic area. One or more physical records are used to describe each unit of observation. - one dimension (variable) is chosen to define the columns which comprise each record. - additional dimensions (variables) define the individual records that together make up a group of records (a logical record) that contains all the statistics for one unit of observation. Eg a file with one column dimension and 2 row dimensions (the first with 3 categories (eg total, male, and female), and the second with 5 categories) will have the following structure: Geog_code Row_dim1 Row_dim2 col_cat1 col_cat2 col_cat3 ... col_catn 00 0001 0001 [nnn] [nnn] [nnn] [nnn] 00 0001 0002 [nnn] [nnn] [nnn] [nnn] 00 0001 0003 [nnn] [nnn] [nnn] [nnn] 00 0001 0004 [nnn] [nnn] [nnn] [nnn] 00 0001 0005 [nnn] [nnn] [nnn] [nnn] 00 0002 0001 [nnn] [nnn] [nnn] [nnn] 00 0002 0002 00 0002 0003 etc 00 0002 0004 00 0002 0005 00 0003 0001 00 0003 0002 00 0003 0003 00 0003 0004 00 0003 0005 10 0001 0001 10 0001 0002 10 0001 0003 10 0001 0004 10 0001 0005 etc Ie, the first 15 rows describe the first geographic entity '00' (unit of observation), the next 15 records describe the second geographic entity '10', etc. The row_dim[n] fields indicate the order in which the row dimension categories cycle through the physical records that make up one logical record. Each physical record contains the geographic identifier, an indicator of the category on row dimension 1, an indicator of the category on row dimension 2, and the statistics on the column dimension categories that pertain to that combination of geography and row dimension categories. Eg. if row_dim1 is 'sex' and the values are respectively 1='total', 2='male' and 3='female' And row_dim2 is 'marital status', and the values are respectively: 1='total' 2='single' 3='married' 4='divorced' 5='widowed' And columns are defined by 'mother tongue', where col_cat1='total' col_cat2='English only' col_cat3='French only' col_cat4='English and French' all the way through to col_catn='other mother tongue n.e.s.' Then in the above example, - the 1st number in the col_cat1 column is: the number of total sex, total marital statuses, total mother tongues - the 2nd number in the col_cat1 column is: the number of total sex, single, total mother tongues - the 3rd number in the col_cat1 column is: the number of total sex, married, total mother tongues - the 4th number in the col_cat1 column is: the number of total sex, divorced, total mother tongues - the 5th number in the col_cat1 column is: the number of total sex, widowed, total mother tongues - the 6th number in the col_cat1 column is: the number of males, total marital statuses, total mother tongues - the 7th number in the col_cat1 column is: the number of males, single, total mother tongues - the 7th number in the col_cat2 column is: the number of males, single, English only mother tongue - the 7th number in the col_cat3 column is: the number of males, single, French only mother tongue - the 7th number in the col_catn column is: the number of males, single, other mother tongue n.e.s. etc UNCOMPRESSING THE ORIGINAL ZIP FILE FROM STATISTICS CANADA RESULTS IN THE FOLLOWING FILES: Colst001.eng Colst001.fra - contain the substantive definitions of the column dimension categories (English and French) - one file in English, one in French DatRec.txt - overall record structure EShelf.dat - contains the data - since every one of these products contains the data in the file called 'EShelf.dat', it is a very good idea to rename the file to the product code. EShelf.def - filename, date and time created - definition (and number of categories in) the column dimension - definition (and number of categories in) the row dimension(s) GeoRec.txt - contains the column layout of the Geostub.txt file Geostub.txt - contains the full geographic identifiers of each geographic area, including names (eg FED names, etc), and data quality codes Notes.eng Notes.fra - contain variable definitions, variable labels, and notes pertaining to specific dimensions and items. Rowst001.eng Rowst001.fra - contain dimension identifier and row dimension category labels - one file in English, one in French - there may be additional Rowst00[n] files, one for each row dimension WRITING AN SPSS SYNTAX FILE Note: the procedure to write a syntax file for SAS or Stata would be similar, but is not covered in this document. It is assumed that users who prefer to use SAS or Stata are sufficiently conversant with how the syntax files are structured that they can follow the procedures outlined here. 1. The table identifier is given in the Eshelf.def file. I normally use this on the title record of the SPSS (or SAS or Stata) syntax file: title 'Census of Canada, 1006 : EA-level 95f018696003'. subtitle SPSS syntax by L Ruus, UT/DLS 2010-05-11. 2. Number of records per geographic unit = number of non-0 rows defined in Rowst00[n].eng or .fra files. eg. Rowst001.eng contains 0000Age(110) -- we ignore this record, and start with the next 0001Total - Age 00020-4 0003 under 1 0004 1 0005 2 0006 3 0007 4 00085-9 0009 5 0010 6 0011 7 0012 8 0013 9 001410-14 0015 10 0016 11 0017 12 0018 13 0019 14 002015-19 0021 15 0022 16 0023 17 0024 18 0025 19 002620-24 0027 20 0028 21 0029 22 0030 23 0031 24 003225-29 0033 25 0034 26 0035 27 0036 28 0037 29 003830-34 0039 30 ... 009880-84 0099 80 0100 81 0101 82 0102 83 0103 84 010485-89 0105 85 0106 86 0107 87 0108 88 0109 89 011090+ -- this means there are 110 records for each unit of observation (geographic unit). And the resulting SPSS data list statement has: data list file='Eshelf.dat' records=110 3. The geographic portion of the data record is defined in DatRec.txt: GeoCode char(20) Dim1 Numeric(4) Col1 char(10) Col2 char(10) Col3 char(10) translates into SPSS as: EAuid 1-8 [or] prov 1-2 fed87 3-5 ea 6-7 dim1 21-24 col1 25-34 col2 35-44 col3 45-54 Columns 9-20 of the enumeration area level data set should be blank, but do check to make sure. Files at other levels of geography (eg CSD-level) which have geographic fields that contain names may have data in these columns. Note also that this example is based on an enumeration area level record. The layout of the geographic portion of each record in eg a census tract or census subdivision level record will be different. You are expected to know that census tract records have cma(3)+census tract name(f7.2) or cma(3)+census tract code(4) census subdivision records have prov(2)+cd(2)+csd(3) 4. Next each variable, on each of the records that makes up one unit of observation, needs to have a unique name, and its column location defined. One way to do this is to use the sequential numbering in the RowSt001.eng file to define rows, and the column numbering in ColSt001.eng to define columns. Ie name each variable according to the row and column in which it is located in the logical record. Eg variab 'r13c3' is on record (or row) 13 column 3 of the physical records that make up one logical record. The advantage of this approach is that you can use the Rowst001.eng file to define the variable labels later on. The lines in Rowst001.eng that read like this: 0001Total - Age 00020-4 0003 under 1 0004 1 0005 2 0006 3 0007 4 00085-9 0009 5 0010 6 0011 7 0012 8 0013 9 are edited to conform to the rules of SPSS syntax, to look like this: /1 r1c1 25-34 r1c2 35-44 r1c3 45-54 /2 r2c1 25-34 r2c2 35-44 r2c3 45-54 /3 r3c1 25-34 r3c2 35-44 r3c3 45-54 /4 r4c1 25-34 r4c2 35-44 r4c3 45-54 /5 r5c1 25-34 r5c2 35-44 r5c3 45-54 /6 r6c1 25-34 r6c2 35-44 r6c3 45-54 /7 r7c1 25-34 r7c2 35-44 r7c3 45-54 /8 r8c1 25-34 r8c2 35-44 r8c3 45-54 /9 r9c1 25-34 r9c2 35-44 r9c3 45-54 /10 r10c1 25-34 r10c2 35-44 r10c3 45-54 /11 r11c1 25-34 r11c2 35-44 r11c3 45-54 /12 r12c1 25-34 r12c2 35-44 r12c3 45-54 /13 r13c1 25-34 r13c2 35-44 r13c3 45-54 ....... /108 r108c1 25-34 r108c2 35-44 r108c3 45-54 /109 r109c1 25-34 r109c2 35-44 r109c3 45-54 /110 r110c1 25-34 r110c2 35-44 r110c3 45-54 Note that the formard slash '/' instructs SPSS to read the next physical record, and the number immediately following it confirms to SPSS the record number it should be reading. This number must always be less than or equal to the number of records given in the 'records=[n]' portion of the 'data list' statement. 5. Once all the variables have been defined, the next step is to define the variable labels. With a large file like this, I prefer to do this in a Unix/Linux environment, in which I can define a discrete file for each column, 'cat' them into one file, and sort them. In this example, we have three columns (total, males, and females) for each year of age and 5-year age groups. The records in Rowst001.eng which started off looking like this: 0001Total - Age 00020-4 0003 under 1 0004 1 0005 2 0006 3 0007 4 00085-9 0009 5 0010 6 0011 7 0012 8 0013 9 001410-14 0015 10 0016 11 0017 12 0018 13 0019 14 are edited to become SPSS variable labels matching the respective variable names assigned above: r1c1 'Total - Age' r2c1 'Age 0-4 - total' r3c1 'Age under 1 - total' r4c1 'Age 1 - total' r5c1 'Age 2 - total' r6c1 'Age 3 - total' r7c1 'Age 4 - total' r8c1 'Age 5-9 - total' r9c1 'Age 5 - total' r10c1 'Age 6 - total' r11c1 'Age 7 - total' r12c1 'Age 8 - total' r13c1 'Age 9 - total' r14c1 'Age 10-14 - total' r15c1 'Age 10 - total' r16c1 'Age 11 - total' r17c1 'Age 12 - total' r18c1 'Age 13 - total' r19c1 'Age 14 - total' a second file looks like this: r1c2 'Total - Age - males' r2c2 'Age 0-4 - males' r3c2 'Age under 1 - males' r4c2 'Age 1 - males' r5c2 'Age 2 - males' r6c2 'Age 3 - males' r7c2 'Age 4 - males' r8c2 'Age 5-9 - males' r9c2 'Age 5 - males' r10c2 'Age 6 - males' r11c2 'Age 7 - males' r12c2 'Age 8 - males' r13c2 'Age 9 - males' r14c2 'Age 10-14 - males' r15c2 'Age 10 - males' r16c2 'Age 11 - males' r17c2 'Age 12 - males' r18c2 'Age 13 - males' r19c2 'Age 14 - males' and a third file like this: r1c3 'Total - Age - females' r2c3 'Age 0-4 - females' r3c3 'Age under 1 - females' r4c3 'Age 1 - females' r5c3 'Age 2 - females' r6c3 'Age 3 - females' r7c3 'Age 4 - females' r8c3 'Age 5-9 - females' r9c3 'Age 5 - females' r10c3 'Age 6 - females' r11c3 'Age 7 - females' r12c3 'Age 8 - females' r13c3 'Age 9 - females' r14c3 'Age 10-14 - females' r15c3 'Age 10 - females' r16c3 'Age 11 - females' r17c3 'Age 12 - females' r18c3 'Age 13 - females' r19c3 'Age 14 - females' In Unix/Linux, use 'cat' to combine the 3 files into one. Then use 'sort' to sort them into the right order. This is not really necessary for SPSS, but does make editing the file by 'eyeballing' easier. They can then be added to the SPSS syntax file, after a 'variable labels' command. 6. The final SPSS syntax file looks like this: title 'Census of Canada, 1006 : EA-level 95f018696003'. subtitle SPSS syntax by L Ruus, UT/DLS 2010-05-11. data list file='Eshelf.dat' records=110 /1 eauid 1-8 prov 1-2 fed 3-5 ea 6-8 r1c1 25-34 r1c2 35-44 r1c3 45-54 /2 r2c1 25-34 r2c2 35-44 r2c3 45-54 /3 r3c1 25-34 r3c2 35-44 r3c3 45-54 /4 r4c1 25-34 r4c2 35-44 r4c3 45-54 /5 r5c1 25-34 r5c2 35-44 r5c3 45-54 /6 r6c1 25-34 r6c2 35-44 r6c3 45-54 /7 r7c1 25-34 r7c2 35-44 r7c3 45-54 /8 r8c1 25-34 r8c2 35-44 r8c3 45-54 /9 r9c1 25-34 r9c2 35-44 r9c3 45-54 /10 r10c1 25-34 r10c2 35-44 r10c3 45-54 /11 r11c1 25-34 r11c2 35-44 r11c3 45-54 /12 r12c1 25-34 r12c2 35-44 r12c3 45-54 /13 r13c1 25-34 r13c2 35-44 r13c3 45-54 /14 r14c1 25-34 r14c2 35-44 r14c3 45-54 /15 r15c1 25-34 r15c2 35-44 r15c3 45-54 /16 r16c1 25-34 r16c2 35-44 r16c3 45-54 /17 r17c1 25-34 r17c2 35-44 r17c3 45-54 /18 r18c1 25-34 r18c2 35-44 r18c3 45-54 /19 r19c1 25-34 r19c2 35-44 r19c3 45-54 /20 r20c1 25-34 r20c2 35-44 r20c3 45-54 /21 r21c1 25-34 r21c2 35-44 r21c3 45-54 /22 r22c1 25-34 r22c2 35-44 r22c3 45-54 ... /104 r104c1 25-34 r104c2 35-44 r104c3 45-54 /105 r105c1 25-34 r105c2 35-44 r105c3 45-54 /106 r106c1 25-34 r106c2 35-44 r106c3 45-54 /107 r107c1 25-34 r107c2 35-44 r107c3 45-54 /108 r108c1 25-34 r108c2 35-44 r108c3 45-54 /109 r109c1 25-34 r109c2 35-44 r109c3 45-54 /110 r110c1 25-34 r110c2 35-44 r110c3 45-54. variable labels r1c1 'Total - Age' r1c2 'Total - Age - males' r1c3 'Total - Age - females' r2c1 'Age 0-4 - total' r2c2 'Age 0-4 - males' r2c3 'Age 0-4 - females' r3c1 'Age under 1 - total' r3c2 'Age under 1 - males' r3c3 'Age under 1 - females' r4c1 'Age 1 - total' r4c2 'Age 1 - males' r4c3 'Age 1 - females' ... r109c1 'Age 89 - total' r109c2 'Age 89 - males' r109c3 'Age 89 - females' r110c1 'Age 90 plus - total' r110c2 'Age 90 plus - males' r110c3 'Age 90 plus - females'. execute. 7. If you wish to do a match-merge with the geographic information in the Geostub.txt file, you now have a file that has the same geographic level, ie both files now have one record per unit of observation. Laine G.M. Ruus University of Toronto. Data Library Service 2010-05-11