Data proliferation

Data proliferation refers to the prodigious amount of data, structured and unstructured, that businesses and governments continue to generate at an unprecedented rate and the usability problems that result from attempting to store and manage that data. While originally pertaining to problems associated with paper documentation, data proliferation has become a major problem in primary and secondary data storage on computers.

While digital storage has become cheaper, the associated costs, from raw power to maintenance and from metadata to search engines, have not kept up with the proliferation of data. Although the power required to maintain a unit of data has fallen, the cost of facilities which house the digital storage has tended to rise. [1]

At the simplest level, company e-mail systems spawn large amounts of data. Business e-mail – some of it important to the enterprise, some much less so – is estimated to be growing at a rate of 25-30% annually. And whether it’s relevant or not, the load on the system is being magnified by practices such as multiple addressing and the attaching of large text, audio and even video files.

—IBM Global Technology Services[2]

Data proliferation has been documented as a problem for the U.S. military since August of 1971, in particular regarding the excessive documentation submitted during the acquisition of major weapon systems.[3] Efforts to mitigate data proliferation and the problems associated with it are ongoing.[4]

Contents

Problems caused

The problem of data proliferation is affecting all areas of commerce as the result of the availability of relatively inexpensive data storage devices. This has made it very easy to dump data into secondary storage immediately after its window of usability has passed. This masks problems that could gravely affect the profitability of businesses and the efficient functioning of health services, police and security forces, local and national governments, and many other types of organization.[2] Data proliferation is problematic for several reasons:

  • Difficulty when trying to find and retrieve information. At Xerox, on average it takes employees more than one hour per week to find hard-copy documents, costing $2,152 a year to manage and store them. For businesses with more than 10 employees, this increases to almost two hours per week at $5,760 per year.[5] In large networks of primary and secondary data storage, problems finding electronic data are analogous to problems finding hard copy data.
  • Data loss and legal liability when data is disorganized, not properly replicated, or cannot be found in a timely manner. In April 2005, Ameritrade Holding Corporation told 200,000 current and past customers that a tape containing confidential information had been lost or destroyed in transit. In May of the same year, Time Warner Incorporated reported that 40 tapes containing personal data on 600,000 current and former employees had been lost en route to a storage facility. In March 2005, a Florida judge hearing a $2.7 billion lawsuit against Morgan Stanley issued an "adverse inference order" against the company for "willful and gross abuse of its discovery obligations." The judge cited Morgan Stanley for repeatedly finding misplaced tapes of e-mail messages long after the company had claimed that it had turned over all such tapes to the court.[6]
  • Increased manpower requirements to manage increasingly chaotic data storage resources.
  • Slower networks and application performance due to excess traffic as users search and search again for the material they need.[2]
  • High cost in terms of the energy resources required to operate storage hardware. A 100 terabyte system will cost up to $35,040 a year to run—not counting cooling costs.[7]

Proposed solutions

  • Applications that better utilize modern technology
  • Reductions in duplicate data (especially as caused by data movement)
  • Improvement of metadata structures
  • Improvement of file and storage transfer structures
  • User education and discipline[3]
  • The implementation of Information Lifecycle Management solutions to eliminate low-value information as early as possible before putting the rest into actively managed long-term storage in which it can be quickly and cheaply accessed.[2]

See also

References


Wikimedia Foundation. 2010.

Look at other dictionaries:

  • Data governance — is an emerging discipline with an evolving definition. The discipline embodies a convergence of data quality, data management, data policies, business process management, and risk management surrounding the handling of data in an organization.… …   Wikipedia

  • Data mining — Not to be confused with analytics, information extraction, or data analysis. Data mining (the analysis step of the knowledge discovery in databases process,[1] or KDD), a relatively young and interdisciplinary field of computer science[2][3] is… …   Wikipedia

  • Data & Analysis Center for Software — The Data Analysis Center for Software (DACS) is one of several United States Department of Defense (DoD) sponsored Information Analysis Centers (IACs), administered by the Defense Technical Information Center (DTIC). It is technically managed by… …   Wikipedia

  • Magnetic tape data storage — uses digital recording on to magnetic tape to store digital information. Modern magnetic tape is most commonly packaged in cartridges and cassettes. The device that performs actual writing or reading of data is a tape drive. Autoloaders and tape… …   Wikipedia

  • Computer data storage — 1 GB of SDRAM mounted in a personal computer. An example of primary storage …   Wikipedia

  • Nuclear proliferation — World map with nuclear weapons development status represented by color.   Five nuclear weapons states from the NPT …   Wikipedia

  • Small arms proliferation — is a term used by organizations and individuals advocating the control of small arms and their trade; the term has no precise definition. Users of the term have notably included Kofi Annan, ex Secretary General of the United Nations. Some… …   Wikipedia

  • Healthcare error proliferation model — The Healthcare Error Proliferation Model is the adaptation of James Reason’s Swiss Cheese Model designed to illustrate the complexity inherent in the contemporary healthcare delivery system and the attribution of human error within these systems …   Wikipedia

  • Concurrent data structure — In computer science, a concurrent data structure is a particular way of storing and organizing data for access by multiple computing threads (or processes) on a computer. Historically, such data structures were used on uniprocessor machines with… …   Wikipedia

  • Electronic data capture — An Electronic Data Capture (EDC) system is a computerized system designedfor the collection of clinical data in electronic format for use mainly in human clinical trials.Typically, EDC systems provide: * a graphical user interface component for… …   Wikipedia


Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”

We are using cookies for the best presentation of our site. Continuing to use this site, you agree with this.