Most servers log every access. The log usually includes the IP address and/or host name, the time of the download, the user's name (if known by user authentication or obtained by the identd protocol), the URL requested (including the values of any variables from a form submitted using the GET method), the status of the request, and the size of the data transmitted. Some browsers also provide the client the reader is using, the URL that the client came from, and the user's e-mail address. Servers can log this information as well, or make it available to CGI scripts. Most WWW clients are probably run from single-user machines, thus a download can be attributed to an individual. Revealing any of those datums could be potentially damaging to a reader.
For example, XYZ.com downloading financial reports on ABC.com could signal a corporate takeover. The accesses to a internal job posting reveals who might be interested in changing jobs. The time a cartoon was downloaded reveals that the reader is misusing company resources. A referral log entry might contain something like:
file://prez.xyz.com/hotlists/stocks2sellshort.html -> http://www.xyz.com/
The pattern of accesses made by an individual can reveal how they intend to use the information. And the input to searches can be particularly revealing.
Another way Web usage can be revealed locally is via browser history, hotlists, and cache. If someone has access to the reader's machine, they can check the contents of those databases. An obvious example is shared machines in an open lab or public library.
Proxy servers used for access to Web services outside an organization's firewall are in a particularly sensitive position. A proxy server will log every access to the outside Web made by every member of the organization and track both the IP number of the host making the request and the requested URL. A carelessly managed proxy server can therefore represent a significant invasion of privacy.
If you are a government site, you may be required by law to protect the privacy of your readers. For example, U.S. Federal agencies are not allowed to collect or publish many types of data about their clients.
In most U.S. states, it is illegal for libraries and video stores to sell or otherwise distribute records of the materials that patrons have checked out. While the courts have yet to apply the same legal standard to be applied to electronic information services, it is not unreasonable for users to have the same expectation of privacy on the Web. In other countries, for example Germany, the law explicitly forbids the disclosure of online access lists. If your site chooses to use the Web logs to populate your mailing lists or to resell to other businesses, make sure you clearly advertise that fact.
The easiest way to avoid collecting too much information is to use a server that allows you to tailor the output logs, so that you can throw away everything but the essentials. Another way is to regularly summarize and discard the raw logs. Since the logs of popular sites tend to grow quickly, you probably will need to do that anyway.
You can protect outsiders by summarizing your logs. You can help protect insiders by:
If your site does not want to reveal certain Web accesses from your site's domain, you may need to get Web client accounts from another Internet provider that can provide anonymous access.