The Need for an International Data Base for Arsenic
15 Feb 98 draft of a talk given at: International Conference on Arsenic Pollution of Ground
Water in Bangladesh: Causes Effects and Remedies, February 1998
Richard Wilson, D. Phil,
It is nearly 30 years since the data from Taiwan on the incidence and
prevalence of skin cancers became available and were attributed to arsenic
in the well waters (Tseng et al 1968). It is 10 years since Professor Chien-
Jen Chen (Wu and Chen 198x, Chen et al 1988, Chen and Wang 1990) shocked
us all by his reports that in the same area there seemed to be an increase
in fatal skin cancers, and more important, that fatal cancers appeared
in several other sites.
After the initial shock wore off, many scientists throughout the world
started looking for places where the same effects might be occurring, and
other scientists in less fortunate parts of the world began to realize
that some of the problems that were present in their countries might be
due to arsenic. Several conferences are now taking stock of the situation,
and it seems now appropriate to discuss what might be achieved by having
an internationally available data base, or set of data files that might
be called a data network.
My personal interest in airborne arsenic dates back 20 years when I
started thinking about comparison of carcinogenic potency between species.
Arsenic seemed to be, and still does seem to be an exception to the "usual"
rules which were admittedly developed from a study of organic carcinogens.
In 1991 I and others (Byrd et al) were trying to understand acceptable
levels of natural arsenic from mine tailings when we came across Dr Chen's
we were looking at interspecies. On a trip to Taiwan I called on Dr Chen
and reported at once that his results seemed to me completely believable.
We did our own numerical analysis which we reported at a medical meeting
in April 1992 and published sometime later (Byrd et al 1996). Professor
Smith came to similar conclusions.
On another trip to Taiwan in 1994, Dr Chen put me in touch with Dr Luo
in Huhhot, who had been one of those who was following the literature and
deduced that the problems in Inner Mongolia were probably due to arsenic.
Since then we have been trying to help to bring Dr Luo's data to international
attention, (Luo et al 1995) and to get international help for remediation.
I have also been urging public health and environmental authorities in
the USA to take the data seriously. This has been hard because there is
a tendency in the USA to worry about man made sources of pollution but
not to worry about natural ones.
This talk will be my own responsibility but the ideas therein are gained
from the people who have worked in the field and helped me put them into
perspective: firstly Professor Chen, then the various scientists in Mongolia
led by Dr Luo, and the americans in the Inner Mongolia Cooperative group
that are working with him, and Dr Alan Smith. The talk has been modified
in the first two and a half days I have been here because of further interactions
with the delegates here. I acknowledge their ideas freely.
Examples of Data Bases
There are many examples of internationally available data bases to which
one can refer. I select a few.
In basic physics and chemistry there are enormous data compilations.
The best known is perhaps the Handbook of Physics and Chemistry - called
affectionately when I was a student the Rubber Bible because it was compiled
by the Chemical Rubber Publishing Co, and used like a bible: also printed
with the same soft cover that Oxford University Press used for bibles.
In my first field, high energy particle physics, there are now huge
collaborations of 600 scientists from 30 or more countries. To keep in
touch the World Wide Web was invented (by the computer division of the
European Centre for Nuclear Research - CERN) Data for millions of particle
interactions are stored on a computer accessible to all participants. To
satisfy the international aspect fast data links were established all over
the world, to Beijing, Tokyo, Geneva, London and now Moscow. This data
base often has 100 Gigabytes of memory - which is cheap at 2 cents per
Cancer registries in the world are a, loosely coupled data base. The
existence of compatible data files throughout the world enabled comparisons
to be made and enabled, for example, both Doll and Higginson to explain
why the origin of most cancers is probably environmental rather than genetic.
In 1975 the National Toxicology program of the USA started a program
of animal cancer bioassays for over 400 chemical substances. These data,
including pathology on more than 400,000 animals, is on computer and available
for analysis. My colleagues and I have found it to be a gold mine from
which we are steadily extracting valuable information.
In contrast all the data from the village of Chandripur we saw on Sunday
should be less than 1 Megabyte and the whole arsenic data on the world
less than 100 Gigabyte. It should be easy to put all the information onto
a computer data base to make them available.
Purposes of a Data Base
- The first reason for a data base are similar to the reasons for publishing
scientific papers: to distribute information about the problems so that
they may be known to a wider community than the local scientists, and the
wider community can help to understand them. One writes a paper because
people keep asking questions and it seems best to answer them all at once.
But every paper raises further questions: in particular, they invite questions
about the data that underlie the conclusions. To answer these it is now
simplest to provide access to the raw data.
- Other scientists than the original investigators can go back to the
raw data; follow the assumptions of the first investigators and confirm
that there is no accidental error. They can try different assumptions and
confirm that the results are robust.
- Scientists in geographically isolated locations can now be linked up
by the Internet with the World Wide Web and can participate. It took me
nearly 30 hours to get here. The data can travel at the speed of light
and be here in less than a second. We have high energy physicists contributing
from obscure locations such as Nigeria or Nablus. Why not arsenic experts?
- Ease of ensuring and demonstrating that methodological errors are not
made in analysis. For example in epidemiology one must be careful to go
through three independent steps:
- - choose the study population
- - define the outcome (or outcomes)
- - measure the dose
- If the study population is restricted AFTER the outcome is defined
and the dose measured there is a great danger of creating a tautology and
rendering any normal statistical analysis invalid. (A good example was
shown by Feynman in his freshman lectures (Goodstein 1989). By putting
on computer ALL the data with a larger study population than actually selected,
it is easier to demonstrate that invalid selection procedures were not
- If the data are only presented in epidemiological papers the editors
of these specialized journals cut out the detail needed for an "outsider"
to understand the subject. Access to the original data enables another
group of scientists to contribute: the modelers. The modelers have an overall
concept or model which needs to be tested quantitatively. For that they
do not want P values: they want data.
With computers the setting up and even the use of a data base is easy.
The computer handles data more easily than a person. It makes very few
mistakes. Computer memory is now about 5 times cheaper than paper (2 cents
a Megabyte). The rules of setting up the data base are simple but not obvious
to someone who has not done it. I go through a few here.
Because the handling of data is simpler, it pays to record more data
than might be considered necessarily IF it is cheap to do so. It is also
very easy (often too easy) to find a DELETE button to throw away what is
not used in a particular application. So:
- imagine all the questions that might be asked later and record the
data that might be useful to answer them.
- record on computer all comments that are on the paper record, or if
the data are being put directly on the computer, what would be on paper
if the older method were used.
- in so far as the technology permits, take measurements with devices
that permit direct computer entry.
Examples of (1)
We want a dose response relationship for each of the outcomes. This
may not be possible in an absolute way if there is doubt about the dose.
But one can ask about the comparison of dose response relationships between
the various outcomes. Since there are several outcomes, one cannot put
all possibilities on paper. If the original data are on computer anyone
can go back and make this comparison s and when it seems important.
We may want to ask for correlation and anticorrelation between outcomes.
Does a keratosis have to be preceded by melanosis? In the same location
or another location? Is either a precursor for lung cancer or a different
end point? The possibilities need a computer to understand so the sooner
the data are on computer the better. My group are now finding correlations
in cancer produced by chemicals that have not been seen before and would
not have been possible without access to the NTP data base (Linkov et al
One hypothesis is that arsenic is only hazardous in areas where selenium
concentrations are low. This suggests it be included (if available) on
the data base
The Ease of a Modern Data Base
An example of (2): a poster presentation (PP21) of the EGIS system may
be seen below. They have data on thousands of wells. It is on computer
with each well identified by latitude and longitude measured by a Geographical
Information System. The data are presumably available.
Another example of (2). A simple device for rapid measurement of arsenic
content in water is proposed in PP- 7 (Moderegger et al.). The data from
this would go into a computer automatically
A third example of (3). As noted in discussion of paper OP-18 (Ali et
al), where an X ray analytic technique has been tested, I noted that there
are two portable hand held instruments (by Abdul Rahman Nussainov, of Petersburg
Nuclear Physics Institute, Gatchina, Russia and Grodzins of MIT, USA) for
measuring trace metals by X ray scattering, each of which can store 500
measurements on a chip for downloading at night. If these prove useful
they would enable large quantities of reliable data to be recorded.
Suggested Data Files
As noted above all data must be entered exactly as taken in order to
prevent errors. A data base might then have several files that are correlated
by computer only. I outline a possible set of files for discussing epidemiological
Clearly there will be data on each and every person involved in the
population. This should have a name, and an identification number IN THIS
STUDY. The name would of course be removed in most copies to ensure confidentiality.
This file will need to have:
(A) POPULATION File Name Identification number Date of Birth Place of
Birth Residence Usual well Years of Residence Occupation Name of person
recording the data Date of survey
This population file might be different or it might be the same as the
CASE file. In many cases (and for reasons noted above hopefully in most
cases) the CASE file will only include a subset of the people in the POPULATION
file. There might be several population files and an important task for
later analysis will be to reconcile them: a mere mixing will be inadequate
because often the same person will be recorded in different ways and appear
twice in a combined file. We all know examples of this when (as happened
to me on coming here) British Airways recorded the name WILSON once when
there were two WILSONs on the same plane! fortunately there was space and
we both flew.
(B) WELL DATA FILE
Well identification number Well Latitude and Longitude (from GIS) Well
size Well depth Knowledge if any of aquifer Name of surveyor
(C) WATER data File
Well number Arsenic concentration Trivalent/pentavalent ratio (if measured
in spite of Alan Smith) Humic acid (and this must be specified in more
than the usual detail) Fluorine since this may contribute to cancers Selenium.
Of course one should also record any of the usual set of chemicals from
nitrates to cadmium that happen to be measured at the same time.
The water data file might be the same file as the well data file (the
present situation in Inner Mongolia). It will depend upon how the data
are recorded. A computer can easily connect the two whenever desired provided
that the identification number is consistent. Use of latitude and longitude
by GIS would make this easy to do.
(D) Occupancy record
Person number Location (village name) House - (latitude and longitude)
Well used (number ) Date of start of use Date of end of use (REPEAT for
all wells used: hopefully there will be occupancy records covering entire
From the product of (C) and (D) one may estimate EXPOSURE (and hence
- with assumptions - DOSE). But it is now conventional to try to use biomarkers
when possible. Three are discussed - URINE, NAILS and HAIR. This suggests
three further files (E), (F), and (G).
(E) Biomarker file - Urine
Person identification number Arsenic level in urine Trivalent/Pentavalent
ratio if measured Creatinine concentration (for normalization since creatinine
is produced at a constant rate) Selenium Date of Measurement Time of Measurement
Measured by: Instrument for measurement Calibrated by: Calibrated when:
Arsenic is rapidly secreted in urine, so that it records the dose for
that day: it is not therefore surprising that in Bangladesh a good correlation
has been found between urine and water concentrations. However, I note
that the slope deduced is different for the villages of Samta and Ramganj
Thana are 30% different. Are they the same for Inner Mongolia? or Chile?
Hopefully a comparison will tell us. Since the urine, like the water combined
with occupancy, only gives us an "instantaneous" exposure record,
there seems little to choose between them. But when well concentrations
are low, the urine measurements may still be high because they include
arsenic exposure by ingestion of food.
(F) Biomarkers file - nails (G) Biomarkers file - hair. These can be
similar but one should note that as usually done they integrate exposure
over a period of time. However one should note that by X ray scattering
one can measure the variation of arsenic in hair along the length and estimate
PAST exposure. This does not seem to be in anyone's data files yet.
(H) Medical Record Person ID Number Date of Examination Skin Hyperpigmentation
yes/no Skin Hypopigmentation yes/no Keratosis yes/no Severe? yes/No Year
of Occurrence Repeat for all Lesions at each location. Examining physician
or medical assistant: Physician Comments
Note one record for each examination and for each examining physician:
the computer can compare and correlate later. Medical examiners may keep
records differently. But they should be recorded exactly as each examiner
prefers, and (later) a translation code prepared.
An example of this came up in paper OP-5 yesterday (Oshikawa et al presented
by Alan Geater). He assigned the cases into four grades of arseniosis.
Grades I thorough IV. These were from various combinations of symptoms.
The file with this assignment should NOT be the primary data file: the
grouping of symptoms into grades should be a subsequent computer program
that can be varied by others or by the same investigators later. That is
simpler and more accurate. It also enables different algorithms for the
groupings, and examinations by different physicians to be compared to see
whether that affects any conclusions.
(I) Cancer Record Person ID number Type of Cancer (Name, International
code) Location Date of cancer Examining Physician (note that this record
may be derived later from a cancer registry)
The Special Case of Cancer
Because cancer is a very late reaction (and none were seen at Chandripur)
it is likely that a separate data record will be needed for them. Here
it is useful to recognize that they are of especial interest for developed
countries (where the skin lesions do not occur) and therefore can be a
source of funding in their own self interest. This is because many theories
of cancer and the present US regulatory schemes suggest that cancer incidence
is proportional to dose and is therefore important even at cumulative doses
where melanoses and keratoses do not seem to occur. My interest in these
(noted already in my introduction) was shown most clearly in a paper by
Byrd et al (1996). Figure 2 from that paper (reproduced here) shows my
fit to the skin caner PREVALENCE of Tseng et al. The error bars are NOT
95th percentiles but the (smaller) one standard deviation that is used
by physical scientists. It can easily be seen that a threshold at about
120 ppb (ug/l) in the water is easily accommodated. Since skin cancer is
fatal only 10% of the time, the data on fatal cancers in men from the data
of Wu and Chen are necessarily less precise but could also indicate a threshold.
But the shocker to me and to others was the graph that fatal bladder cancers
(figures 7 and 8 from Byrd et al.), both male and female, fit an excellent
straight line without a threshold. The dose response relationships appear
to be different.
Of course these are "ecological" studies and epidemiologists
warn us of the "ecological fallacy". It is impossible to DERIVE
a dose-response relationship from such data. But one CAN check hypotheses
- and one CAN (and should) search for other data. In 1991 Byrd et al were
aware of the studies of the persons who were given Fowler's solution in
England. We pointed out that these data showed a slight increase which
was statistically insignificant. Therefor we were delighted to see the
follow up by Evans et al (paper Op-2 at this conference) and intrigued
that these data, with an accumulated dose similar to drinking water with
30 ppb (ug/l) of water for 40 years, gives a significant result with an
odds ratio of 5/1.6. Smith et al (paper Il-4) has given us the results
in Chile at levels 20 times higher. It will be important to put these into
alternate MODELS to check whether the models fit or do not fit. One model
might be that bladder cancer incidence is directly proportional to dose
without a threshold (As suggested by the fit to the "ecological"
data) and another might be that there is a threshold an one can ask what
is the possible range of values for this threshold. One of many postulated
models might be that skin cancers truly display a threshold with respect
to dose, and that internal cancers display a true linearity at least down
to levels below what is normally seen.
For this reason it is interesting to be sure that cancer records are
collected especially in the affected regions.
While one can and must argue about the relationship of concentrations
to dose, the difference in the dose response of the two outcomes discussed
in Wu and Chen and above seems real. We should not be surprised at such
differences and must expect them and the data base must be set up to study
them. On one well studied agent, ionizing radiation, the dose response
for inducing Acute Myelogenous Leukemia (AML) appears to be different from
that for inducing Chronic Myelogenous Leukemia (CML) (Preston and Pierce
A data base is only as useful as the people who use it. Therefore anyone
offering a file for that data base (or as before noted data network) will
want to ensure that it is in a form that is easily understood by others.
I suggest that it will therefore be useful if everyone choose someone else
to help go over the data to hope make their use transparent. I note that
the EGIS data files do not yet seem to have been sued by others, and I
urge work to make theme easily available, on one side and to understand
them on the other.
An important question arises in all endeavors including scientific ones,
of intellectual property rights. It is important that those who collected
the data be and contribute them to the data base be recognized for their
work, both financially and in scientific and public recognition. One a
data base is established this is likely to be achieved automatically. Funding
agencies are likely to give adequate funding and insist that the data be
available: and the more people use the data (and refer to them) the more
Funding is always a severe problem. Agencies tend to be compartmentalized.
But a multi-purpose data base that is easily available to all scientists
is likely to attract funds. Several agencies might combine to make it a
This the international "funds" are primarily interested in
water supplies and perhaps remediation. They might fund (As UNDP has funded
EGIS) the water part of the base. The funds might include the World Bank.
the Asian Development bank, and since some nations with large Islamic population
are concerned, the Kuwait Fund for Social and Economic Development. The
resources available for the medical examinations are probably less, but
the WHO should be interested. As noted above the US Environmental Protection
Agency (EPA), and the US Center for Disease Control (CDC) will be interested
if work on cancer at concentrations below 5 ppb (ug/l) is included. It
is great pity Dr Luo from Huhhot, Inner Mongolia is not here. He and his
scientists at his Anti Epidemic Station, with Professor He and Dr Piao
from Beijing and Dr Lamm from USA are making a data base which we hope
will soon be available. In the discussions above I have benefitted from
what they are doing.
There are several reasons for a good international data base or network
of data files. To communicate information in the most reliable and simplest
way: to enable data from different countries locations and times to be
compared and to enable persons in other, isolated, countries to participate.
I was very pleased to hear the last talk in the conference by Professor
Quasi Quamruzzaman about his "Rapid Action Program" (paper OP-36).
The plan seems to be to record data from 200 villages along the lines I
suggested in the talk. I am pleased that he plans to have all data immediately
recorded on computer. This should enable the data to be widely disseminated
both nationally and internationally.
But as we carry out these essential tasks of detail it is important
to remember at all times why we are here. The field trip on the first day
reminded us that we are discussing the suffering and sadness of many people
in many villages. We are here because we share their sadness in some small
measure and want to reduce it as much as we can. We must remember the priorities:
- To help each villager get pure water
- To alleviate the suffering if we can
- To ensure, in so far as is possible, that the exposure is not prolonged
- That it will NEVER be repeated, here or elsewhere in the world
It is to these ends, especially the last, that the data base can and
must be addressed.