Thursday 21 September 2017

Deep Web Datamining

Top acadamic research states that more than 99% of the entire World Wide Web internet traffick remains hidden in the Deep Web.

To illustrate the difference in accessibility to data and information; this means tens of trillions of pages opposite to the mere billions on the surface of the "regular" internet.

The reason for the difference in finding deep websites or not finding them, lies in the programming and functionality of regular search engines; they are simply not made to browse the Deep Web effectively.

Datamining the Deep Web creates more chances for success because it includes any website that cannot be detected by the Google, Yahoo and other similar search tools (‘crawlers’) to search the internet for sites to fill its results pages.

The Deep Web excists mainly of database-driven websites, and any part of a website that’s past a login page. Going into the Deep Web you can reach temporary sites, sites that are blocked (by local webmasters) and even sites with special formats.

Datamining as appealing as it may seem in the Deep Web is an impossible endeavor to do by hand, because of the vastness of the available data.


There are some bots today however, that can work around the problem and it is worth the effort to do some homework before you begin.

Here are the names of some helpful tools to remember, when considering datamining in the Deep Web:

Tor: short for The Onion Router
 
HiWE (Hidden Web Exposer)
Stanford's prototype engine

Infoplease

PubMed

University of California's Infomine
 
BrightPlanet’s Big Data Mining tool: the Deep Web Monitor

The MAS (Multi-agent Information System) a newly proposed deep web data mining algorithm, currently being further developed by researchers from Hebei University in China.

Lifehack Finders Nice to Know:

Keep in mind that Tor (although widely known and used) does not use Javascript, making it difficult for analytics software to mine.

No comments:

Post a Comment