The Need for an International Data Base for Arsenic

15 Feb 98 draft of a talk given at: International Conference on Arsenic Pollution of Ground Water in Bangladesh: Causes Effects and Remedies, February 1998

Richard Wilson, D. Phil, Harvard University

Introduction

It is nearly 30 years since the data from Taiwan on the incidence and prevalence of skin cancers became available and were attributed to arsenic in the well waters (Tseng et al 1968). It is 10 years since Professor Chien- Jen Chen (Wu and Chen 198x, Chen et al 1988, Chen and Wang 1990) shocked us all by his reports that in the same area there seemed to be an increase in fatal skin cancers, and more important, that fatal cancers appeared in several other sites.

After the initial shock wore off, many scientists throughout the world started looking for places where the same effects might be occurring, and other scientists in less fortunate parts of the world began to realize that some of the problems that were present in their countries might be due to arsenic. Several conferences are now taking stock of the situation, and it seems now appropriate to discuss what might be achieved by having an internationally available data base, or set of data files that might be called a data network.

My personal interest in airborne arsenic dates back 20 years when I started thinking about comparison of carcinogenic potency between species. Arsenic seemed to be, and still does seem to be an exception to the "usual" rules which were admittedly developed from a study of organic carcinogens.

In 1991 I and others (Byrd et al) were trying to understand acceptable levels of natural arsenic from mine tailings when we came across Dr Chen's we were looking at interspecies. On a trip to Taiwan I called on Dr Chen and reported at once that his results seemed to me completely believable. We did our own numerical analysis which we reported at a medical meeting in April 1992 and published sometime later (Byrd et al 1996). Professor Smith came to similar conclusions.

On another trip to Taiwan in 1994, Dr Chen put me in touch with Dr Luo in Huhhot, who had been one of those who was following the literature and deduced that the problems in Inner Mongolia were probably due to arsenic. Since then we have been trying to help to bring Dr Luo's data to international attention, (Luo et al 1995) and to get international help for remediation. I have also been urging public health and environmental authorities in the USA to take the data seriously. This has been hard because there is a tendency in the USA to worry about man made sources of pollution but not to worry about natural ones.

This talk will be my own responsibility but the ideas therein are gained from the people who have worked in the field and helped me put them into perspective: firstly Professor Chen, then the various scientists in Mongolia led by Dr Luo, and the americans in the Inner Mongolia Cooperative group that are working with him, and Dr Alan Smith. The talk has been modified in the first two and a half days I have been here because of further interactions with the delegates here. I acknowledge their ideas freely.

Examples of Data Bases

There are many examples of internationally available data bases to which one can refer. I select a few.

In basic physics and chemistry there are enormous data compilations. The best known is perhaps the Handbook of Physics and Chemistry - called affectionately when I was a student the Rubber Bible because it was compiled by the Chemical Rubber Publishing Co, and used like a bible: also printed with the same soft cover that Oxford University Press used for bibles.

In my first field, high energy particle physics, there are now huge collaborations of 600 scientists from 30 or more countries. To keep in touch the World Wide Web was invented (by the computer division of the European Centre for Nuclear Research - CERN) Data for millions of particle interactions are stored on a computer accessible to all participants. To satisfy the international aspect fast data links were established all over the world, to Beijing, Tokyo, Geneva, London and now Moscow. This data base often has 100 Gigabytes of memory - which is cheap at 2 cents per Megabyte.

Cancer registries in the world are a, loosely coupled data base. The existence of compatible data files throughout the world enabled comparisons to be made and enabled, for example, both Doll and Higginson to explain why the origin of most cancers is probably environmental rather than genetic.

In 1975 the National Toxicology program of the USA started a program of animal cancer bioassays for over 400 chemical substances. These data, including pathology on more than 400,000 animals, is on computer and available for analysis. My colleagues and I have found it to be a gold mine from which we are steadily extracting valuable information.

In contrast all the data from the village of Chandripur we saw on Sunday should be less than 1 Megabyte and the whole arsenic data on the world less than 100 Gigabyte. It should be easy to put all the information onto a computer data base to make them available.

Purposes of a Data Base

The first reason for a data base are similar to the reasons for publishing scientific papers: to distribute information about the problems so that they may be known to a wider community than the local scientists, and the wider community can help to understand them. One writes a paper because people keep asking questions and it seems best to answer them all at once. But every paper raises further questions: in particular, they invite questions about the data that underlie the conclusions. To answer these it is now simplest to provide access to the raw data.
Other scientists than the original investigators can go back to the raw data; follow the assumptions of the first investigators and confirm that there is no accidental error. They can try different assumptions and confirm that the results are robust.
Scientists in geographically isolated locations can now be linked up by the Internet with the World Wide Web and can participate. It took me nearly 30 hours to get here. The data can travel at the speed of light and be here in less than a second. We have high energy physicists contributing from obscure locations such as Nigeria or Nablus. Why not arsenic experts?
Ease of ensuring and demonstrating that methodological errors are not made in analysis. For example in epidemiology one must be careful to go through three independent steps:

- choose the study population
- define the outcome (or outcomes)
- measure the dose

If the study population is restricted AFTER the outcome is defined and the dose measured there is a great danger of creating a tautology and rendering any normal statistical analysis invalid. (A good example was shown by Feynman in his freshman lectures (Goodstein 1989). By putting on computer ALL the data with a larger study population than actually selected, it is easier to demonstrate that invalid selection procedures were not used.
If the data are only presented in epidemiological papers the editors of these specialized journals cut out the detail needed for an "outsider" to understand the subject. Access to the original data enables another group of scientists to contribute: the modelers. The modelers have an overall concept or model which needs to be tested quantitatively. For that they do not want P values: they want data.

With computers the setting up and even the use of a data base is easy. The computer handles data more easily than a person. It makes very few mistakes. Computer memory is now about 5 times cheaper than paper (2 cents a Megabyte). The rules of setting up the data base are simple but not obvious to someone who has not done it. I go through a few here.

Because the handling of data is simpler, it pays to record more data than might be considered necessarily IF it is cheap to do so. It is also very easy (often too easy) to find a DELETE button to throw away what is not used in a particular application. So:

imagine all the questions that might be asked later and record the data that might be useful to answer them.
record on computer all comments that are on the paper record, or if the data are being put directly on the computer, what would be on paper if the older method were used.
in so far as the technology permits, take measurements with devices that permit direct computer entry.

Examples of (1)

We want a dose response relationship for each of the outcomes. This may not be possible in an absolute way if there is doubt about the dose. But one can ask about the comparison of dose response relationships between the various outcomes. Since there are several outcomes, one cannot put all possibilities on paper. If the original data are on computer anyone can go back and make this comparison s and when it seems important.

We may want to ask for correlation and anticorrelation between outcomes. Does a keratosis have to be preceded by melanosis? In the same location or another location? Is either a precursor for lung cancer or a different end point? The possibilities need a computer to understand so the sooner the data are on computer the better. My group are now finding correlations in cancer produced by chemicals that have not been seen before and would not have been possible without access to the NTP data base (Linkov et al in press)

One hypothesis is that arsenic is only hazardous in areas where selenium concentrations are low. This suggests it be included (if available) on the data base

The Ease of a Modern Data Base

An example of (2): a poster presentation (PP21) of the EGIS system may be seen below. They have data on thousands of wells. It is on computer with each well identified by latitude and longitude measured by a Geographical Information System. The data are presumably available.

Another example of (2). A simple device for rapid measurement of arsenic content in water is proposed in PP- 7 (Moderegger et al.). The data from this would go into a computer automatically

A third example of (3). As noted in discussion of paper OP-18 (Ali et al), where an X ray analytic technique has been tested, I noted that there are two portable hand held instruments (by Abdul Rahman Nussainov, of Petersburg Nuclear Physics Institute, Gatchina, Russia and Grodzins of MIT, USA) for measuring trace metals by X ray scattering, each of which can store 500 measurements on a chip for downloading at night. If these prove useful they would enable large quantities of reliable data to be recorded.

Suggested Data Files

As noted above all data must be entered exactly as taken in order to prevent errors. A data base might then have several files that are correlated by computer only. I outline a possible set of files for discussing epidemiological problems.

Clearly there will be data on each and every person involved in the population. This should have a name, and an identification number IN THIS STUDY. The name would of course be removed in most copies to ensure confidentiality. This file will need to have:

(A) POPULATION File Name Identification number Date of Birth Place of Birth Residence Usual well Years of Residence Occupation Name of person recording the data Date of survey

This population file might be different or it might be the same as the CASE file. In many cases (and for reasons noted above hopefully in most cases) the CASE file will only include a subset of the people in the POPULATION file. There might be several population files and an important task for later analysis will be to reconcile them: a mere mixing will be inadequate because often the same person will be recorded in different ways and appear twice in a combined file. We all know examples of this when (as happened to me on coming here) British Airways recorded the name WILSON once when there were two WILSONs on the same plane! fortunately there was space and we both flew.

(B) WELL DATA FILE

Well identification number Well Latitude and Longitude (from GIS) Well size Well depth Knowledge if any of aquifer Name of surveyor

Well number Arsenic concentration Trivalent/pentavalent ratio (if measured in spite of Alan Smith) Humic acid (and this must be specified in more than the usual detail) Fluorine since this may contribute to cancers Selenium.

Of course one should also record any of the usual set of chemicals from nitrates to cadmium that happen to be measured at the same time.

The water data file might be the same file as the well data file (the present situation in Inner Mongolia). It will depend upon how the data are recorded. A computer can easily connect the two whenever desired provided that the identification number is consistent. Use of latitude and longitude by GIS would make this easy to do.

(D) Occupancy record

Person number Location (village name) House - (latitude and longitude) Well used (number ) Date of start of use Date of end of use (REPEAT for all wells used: hopefully there will be occupancy records covering entire life)

From the product of (C) and (D) one may estimate EXPOSURE (and hence - with assumptions - DOSE). But it is now conventional to try to use biomarkers when possible. Three are discussed - URINE, NAILS and HAIR. This suggests three further files (E), (F), and (G).

(E) Biomarker file - Urine

Person identification number Arsenic level in urine Trivalent/Pentavalent ratio if measured Creatinine concentration (for normalization since creatinine is produced at a constant rate) Selenium Date of Measurement Time of Measurement Measured by: Instrument for measurement Calibrated by: Calibrated when: Comment

Arsenic is rapidly secreted in urine, so that it records the dose for that day: it is not therefore surprising that in Bangladesh a good correlation has been found between urine and water concentrations. However, I note that the slope deduced is different for the villages of Samta and Ramganj Thana are 30% different. Are they the same for Inner Mongolia? or Chile? Hopefully a comparison will tell us. Since the urine, like the water combined with occupancy, only gives us an "instantaneous" exposure record, there seems little to choose between them. But when well concentrations are low, the urine measurements may still be high because they include arsenic exposure by ingestion of food.

(F) Biomarkers file - nails (G) Biomarkers file - hair. These can be similar but one should note that as usually done they integrate exposure over a period of time. However one should note that by X ray scattering one can measure the variation of arsenic in hair along the length and estimate PAST exposure. This does not seem to be in anyone's data files yet.

(H) Medical Record Person ID Number Date of Examination Skin Hyperpigmentation yes/no Skin Hypopigmentation yes/no Keratosis yes/no Severe? yes/No Year of Occurrence Repeat for all Lesions at each location. Examining physician or medical assistant: Physician Comments

Note one record for each examination and for each examining physician: the computer can compare and correlate later. Medical examiners may keep records differently. But they should be recorded exactly as each examiner prefers, and (later) a translation code prepared.

An example of this came up in paper OP-5 yesterday (Oshikawa et al presented by Alan Geater). He assigned the cases into four grades of arseniosis. Grades I thorough IV. These were from various combinations of symptoms. The file with this assignment should NOT be the primary data file: the grouping of symptoms into grades should be a subsequent computer program that can be varied by others or by the same investigators later. That is simpler and more accurate. It also enables different algorithms for the groupings, and examinations by different physicians to be compared to see whether that affects any conclusions.

(I) Cancer Record Person ID number Type of Cancer (Name, International code) Location Date of cancer Examining Physician (note that this record may be derived later from a cancer registry)

The Special Case of Cancer

Because cancer is a very late reaction (and none were seen at Chandripur) it is likely that a separate data record will be needed for them. Here it is useful to recognize that they are of especial interest for developed countries (where the skin lesions do not occur) and therefore can be a source of funding in their own self interest. This is because many theories of cancer and the present US regulatory schemes suggest that cancer incidence is proportional to dose and is therefore important even at cumulative doses where melanoses and keratoses do not seem to occur. My interest in these (noted already in my introduction) was shown most clearly in a paper by Byrd et al (1996). Figure 2 from that paper (reproduced here) shows my fit to the skin caner PREVALENCE of Tseng et al. The error bars are NOT 95th percentiles but the (smaller) one standard deviation that is used by physical scientists. It can easily be seen that a threshold at about 120 ppb (ug/l) in the water is easily accommodated. Since skin cancer is fatal only 10% of the time, the data on fatal cancers in men from the data of Wu and Chen are necessarily less precise but could also indicate a threshold. But the shocker to me and to others was the graph that fatal bladder cancers (figures 7 and 8 from Byrd et al.), both male and female, fit an excellent straight line without a threshold. The dose response relationships appear to be different.

Of course these are "ecological" studies and epidemiologists warn us of the "ecological fallacy". It is impossible to DERIVE a dose-response relationship from such data. But one CAN check hypotheses - and one CAN (and should) search for other data. In 1991 Byrd et al were aware of the studies of the persons who were given Fowler's solution in England. We pointed out that these data showed a slight increase which was statistically insignificant. Therefor we were delighted to see the follow up by Evans et al (paper Op-2 at this conference) and intrigued that these data, with an accumulated dose similar to drinking water with 30 ppb (ug/l) of water for 40 years, gives a significant result with an odds ratio of 5/1.6. Smith et al (paper Il-4) has given us the results in Chile at levels 20 times higher. It will be important to put these into alternate MODELS to check whether the models fit or do not fit. One model might be that bladder cancer incidence is directly proportional to dose without a threshold (As suggested by the fit to the "ecological" data) and another might be that there is a threshold an one can ask what is the possible range of values for this threshold. One of many postulated models might be that skin cancers truly display a threshold with respect to dose, and that internal cancers display a true linearity at least down to levels below what is normally seen.

For this reason it is interesting to be sure that cancer records are collected especially in the affected regions.

While one can and must argue about the relationship of concentrations to dose, the difference in the dose response of the two outcomes discussed in Wu and Chen and above seems real. We should not be surprised at such differences and must expect them and the data base must be set up to study them. On one well studied agent, ionizing radiation, the dose response for inducing Acute Myelogenous Leukemia (AML) appears to be different from that for inducing Chronic Myelogenous Leukemia (CML) (Preston and Pierce (198x)

Miscellaneous Problems

A data base is only as useful as the people who use it. Therefore anyone offering a file for that data base (or as before noted data network) will want to ensure that it is in a form that is easily understood by others. I suggest that it will therefore be useful if everyone choose someone else to help go over the data to hope make their use transparent. I note that the EGIS data files do not yet seem to have been sued by others, and I urge work to make theme easily available, on one side and to understand them on the other.

An important question arises in all endeavors including scientific ones, of intellectual property rights. It is important that those who collected the data be and contribute them to the data base be recognized for their work, both financially and in scientific and public recognition. One a data base is established this is likely to be achieved automatically. Funding agencies are likely to give adequate funding and insist that the data be available: and the more people use the data (and refer to them) the more the recognition.

Funding is always a severe problem. Agencies tend to be compartmentalized. But a multi-purpose data base that is easily available to all scientists is likely to attract funds. Several agencies might combine to make it a reality.

This the international "funds" are primarily interested in water supplies and perhaps remediation. They might fund (As UNDP has funded EGIS) the water part of the base. The funds might include the World Bank. the Asian Development bank, and since some nations with large Islamic population are concerned, the Kuwait Fund for Social and Economic Development. The resources available for the medical examinations are probably less, but the WHO should be interested. As noted above the US Environmental Protection Agency (EPA), and the US Center for Disease Control (CDC) will be interested if work on cancer at concentrations below 5 ppb (ug/l) is included. It is great pity Dr Luo from Huhhot, Inner Mongolia is not here. He and his scientists at his Anti Epidemic Station, with Professor He and Dr Piao from Beijing and Dr Lamm from USA are making a data base which we hope will soon be available. In the discussions above I have benefitted from what they are doing.

Conclusion

There are several reasons for a good international data base or network of data files. To communicate information in the most reliable and simplest way: to enable data from different countries locations and times to be compared and to enable persons in other, isolated, countries to participate.

I was very pleased to hear the last talk in the conference by Professor Quasi Quamruzzaman about his "Rapid Action Program" (paper OP-36). The plan seems to be to record data from 200 villages along the lines I suggested in the talk. I am pleased that he plans to have all data immediately recorded on computer. This should enable the data to be widely disseminated both nationally and internationally.

But as we carry out these essential tasks of detail it is important to remember at all times why we are here. The field trip on the first day reminded us that we are discussing the suffering and sadness of many people in many villages. We are here because we share their sadness in some small measure and want to reduce it as much as we can. We must remember the priorities:

To help each villager get pure water
To alleviate the suffering if we can
To ensure, in so far as is possible, that the exposure is not prolonged
That it will NEVER be repeated, here or elsewhere in the world

It is to these ends, especially the last, that the data base can and must be addressed.