************************; * Contents of directory ; ************************; Title : The Consequences of Industrialization: Evidence from Water Pollution and Digestive Cancers in China Date : November 2009 Purpose : In this readme file, I trace out the paths of each of the data sets for the project, and the programs underlying each of the empirical results. First, I list out the subdirectories of ~/research/pollution. Then, I reference the dofile underlying each of the paper's tables. These files are sufficient for replication, but you would have to download the data and change the paths to fit your local machine. Note : To view the contents of the directory, from your browser's address bar you should delete readme.txt and hit return. That allows you to access the files. Enjoy! FILES found in ~/research/pollution/ (symbolically linked with "pollution") **************************; * Data directories ; **************************; 1. GIS. The GIS directory is where most of the data are stored. This is roughly similar to C:/GIS_DATA/Scott/ on my PC. The subdirectories are: a. water_points. This contains the 484 water quality monitoring points assigned to river basins. See Pollutants_Counties_watersheds.dbf. b. air_pollution. This contains the measures of long-term particulates by level 6 river basin. See longterm_particulatesbybasin.csv. c. rainfall. This contains precipitation rates by river basin. See precipitation_by_basinlevel6.csv. d. stream_lengths. This is the stream data, which identifies how far each stream segment is from the headwaters/outlet. e. stream_nodes. This is the data containing the points of intersection of the streams. f. Hydro1k. This is the base data from the Hydro1k project with China divided into rivers basins. The key file here is china_hydro1k_pfafcoded.csv, which is converted into hydro1k_pfaf.dta, which has for each river basin the upstream/downstream basin at the level3 through level6 aggregation. Note that currently the upstream manufacturing instrument is based on the level4 basin upstream of each dsp point. The data in hydro1k_pfaf.dta has 1,709 points, corresponding to the river basins in the pan-Asia region. g. Dissolved_Watersheds. This is the dissolved Hydro1k data for the purpose of analyzing river basins at higher levels of aggregration than the level 6 basins. h. county_output. This is the county output point data (provided by Jing Cao with a latitude/longitude marker) assigned to river basins. This is how I figure out how much manufacturing output occurs in each basin. The key file I later use is Manufacturing_Points_wHydro1k_FullJoin_NoBad.csv, which has each of the 3,470 points assigned to a river basin, as well as the manufacturing output by industry recorded at each point. i. county_points. This is original Harvard Geospatial library copy of the census, wich each county assigned information on what river basin it is located, when the center of the country "centroid" is chosen as the location of the county. Note that since there 2,873 counties and only 989 river basins, this is usually not critical. See countycentroids_watersheds.dbf/dta, which has 2,873 points and is how I assign each DSP point to a particular location, using a correspondence between the DSP location and the county (gbcode). See the DSP directory for an explanation of how this is executed (gbcode.do, dsplabels.xls). I also assign each county to its nearest water point and nearest stream. j. death_points. This is just a spatial join of each of the 145 DSP points (using the county centroid for the point) with the closest water monitoring station. Not used. k. fertilizer. This is provincial fertilizer use in terms of tonnage of nitrogen and phosphate. See 2004fertilizer_clean.csv, which I am not currently using. l. basemap. This is the slightly-tweaked copy of China's 2000 census shapefile provied by the Harvard Geospatial Library (HGL). See ch2000longfinal.shp. *****************************; * Executable files directory ; *****************************; 2. dofiles. This is the executable files to make the data, tables, and figures for the paper. An extensive readme file is in this directory. *********************************************************; * Intermediate data files used in the analysis directory *********************************************************; 3. datafiles. This is the data for the paper. The key data set here is dsp_basins.dta. That has the 145 DSP points and all the other information assigned to them, sufficient to recreate almost the entire paper. See also levies_dumping_output_1992to2002.dta, which is used for Table 9. The basin-level data is also included in this directory for the water quality-dumping regressions. 4. dumping. This directory contains the .csv files for the water dumping data by province and year (e.g. 2005data.csv), and the STATA data sets created from these files. This also has the dumping by industry data. It is a direct subdirectory of the program directory because it is parallel on my machine and on Ali's PC. The integration of the dumping data is executed from the dofiles directory but the data are saved here as levies_dumping_output_1992to2002.dta. 4b. industries. This directory contains the industrial data which is mostly not used in the paper. This is here as historical artifact. **************************************************************; * Data files that I need to bring to windows (my home machine); ***************************************************************; 5. outfiles. Here is where I outsheet the regressions, and the data for making maps. It's also where I upload the older cancer data. (cancer73.csv). ************************************; * Literature review for the project ; ************************************; 6. references. These are papers I directly cite in the paper, in pdf form. In some cases, my citations were newspaper articles (with weblinks) or books (which have neither). 7. litreview. These are files that are relevant to the project, which or may or may not be directly cited. ************************************; * Data Appendix ; ************************************; 8. data_appendix. This contains a cleaner version of the files that I reference throughout the data appendix in the text. This is the easiest way for an average user to exploit the data in the paper. ************************************; * Powerpoints ; ************************************; 8. powerpoints. My slides from presentations of this material. ****************************************************************; * Accounting for each table, (mostly) in the order of the paper ; ****************************************************************; * I list for each table the outfile or Prints to Screen = PTS * There are 9 main tables and 7 appendix tables: * To recreate the tables, execute the following: 1. tables.do. PTS 2. tables.do. PTS 3. olsregs.do ->olsregs.out. 4. sexregs.do. PTS 5. tables. PTS 6 tributaryregs.do. ->tributaryregs.out. 7. regs_cd.do. PTS 8. wqregs.do. ->chemicalregs2004.out 9. levyregs.do. ->levyregs.out 10. Internal calculations and 2001 Yearbook [operating expenses+10% of equipment value]. 11. tables.do. PTS 12. waterpoints.do. 13. olsregs_step.do. 14. tables.do. 15. county_dumping.do. 16. cancer73-finish.do. (appendix table 6 is currently not used).