| advertise add site services publishers database health videos | ![]() | about toolbar stats live show health store more stuff JOIN/LOGIN |
Scrape Scrape: Dental Excavators fitnesshealthandwellnessc... | Web Site for Doctors, Surgeons & Health Organizations - Web Page Design,... advancedosm.com | www.lecb.com | Web hosting services by EarthLink Web Hosting lecb.com | Web hosting services by EarthLink Web Hosting drmavisjaworski.com |
Web scraping (also called Web harvesting or Web data extraction) is a computer software technique of extracting information from websites. Usually, such software programs simulate human exploration of the Web by either implementing low-level Hypertext Transfer Protocol (HTTP), or embedding certain full-fledged Web browsers, such as the Internet Explorer (IE) and the Mozilla Web browser. Web scraping is closely related to Web indexing, which indexes Web content using a bot and is a universal technique adopted by most search engines. In contrast, Web scraping focuses more on the transformation of unstructured Web content, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet. Web scraping is also related to Web automation, which simulates human Web browsing using computer software. Exemplary uses of Web scraping include online price comparison, weather data monitoring, website change detection, Web research, Web content mashup and Web data integration.
[edit] Techniques for Web scrapingWeb scraping is the process of automatically collecting Web information.[1] Web scraping is a field with active developments sharing a common goal with the semantic Web vision, an ambitious initiative that still requires breakthroughs in text processing, semantic understanding, artificial intelligence and human-computer interactions. Web scraping, instead, favors practical solutions based on existing technologies even though some solutions are entirely ad hoc. Therefore, there are different levels of automations that existing Web-scraping technologies can provide:
[edit] Legal issuesWeb scraping may be against the terms of use of some websites. The enforceability of these terms is unclear.[3] While outright duplication of original expression will in many cases be illegal, in the United States the courts ruled in Feist Publications v. Rural Telephone Service that duplication of facts is allowable. Also, in a February, 2006 ruling, the Danish Maritime and Commercial Court (Copenhagen) found systematic crawling, indexing and deep linking by portal site ofir.dk of real estate site Home.dk not to conflict with Danish law or the database directive of the European Union.[4] U.S. courts have acknowledged that users of "scrapers" or "robots" may be held liable for committing trespass to chattels,[5][6] which involves a computer system itself being considered personal property upon which the user of a scraper is trespassing. However, to succeed on a claim of trespass to chattels, the plaintiff must demonstrate that the defendant intentionally and without authorization interfered with the plaintiff's possessory interest in the computer system and that the defendant's unauthorized use caused damage to the plaintiff. Not all cases of web spidering brought before the courts have been considered trespass to chattels.[7] In Australia, the Spam Act 2003 outlaws some forms of web harvesting.[8][9] [edit] Technical measures to stop botsA web master can use various measures to stop or slow a bot. Some techniques include:
[edit] See also
[edit] Notes
[edit] References
[edit] External links
|
| ↑ top of page ↑ | about thumbshots |