2017-05-20

Bots, Metadata, big data, linkity and that time I broke the family googles.

I guess this started a long time ago when companies started collecting data from users on the internet.  What triggered me to write linkity was our government deciding to make ISPs collect and store our metadata.

I wrote linkity to create lots of metadata.  It looks at lots of things at random on the internet.  I wanted a program that didn't download much and so cost the user much, and didn't cause much traffic on any remote sites but went to lots of random websites and generated lots of metadata.  I believe it's completely legal to use this and hope it stays that way.

If our government wants to store our metadata, this program helps give them more.  It increases the signal to noise ratio on your metadata.  It's a bit of security by obscurity and probably an annoyance only.  Don't rely on it to do anything much.  It may give you a bit of plausible deniability.  If many people use it, it will create a massive haystack to look for needles in, so I want the program to scale without being too annoying on the internet.

I thought it would be relatively easy at first.  I would run a few searches for random words on google and scrape follow the links. 

That seemed to go well, but while I was just testing and tuning it, google objected and banned my house from searching for using a bot.  My family were not impressed.  I hadn't noticed because I don't use google that much.  They told me "fix the googles!"  Who'd've thought that google didn't like people using bots on google.  Google, who run more and more powerful bots than anyone else in the world or at least the public world.

I decided to stop and rethink my plan.  The googles came back to our house after a day or so.  Family happy again.

I guess google uses its massive machine learning to try and improve its responses to people's questions and runs a lot of checks to make sure it's really people.  I hadn't realised or hadn't thought about the massive numbers of robots and robotic activity out there in the wild internet doing all sorts of things, probably related to making money in good or bad ways. 

I had planned to make scrapers for other search engines, so I bought those plans forward.  I changed the program so it collected the links in a database instead of just viewing them and throwing them away.  Then I found I could harvest links from any page I visited, so I didn't have to do go to search sites very much. 

I did manage to break the googles again but this time their AI only seemed to have blocked my script briefly and not the rest of the household.  Clever googles.

Now I have it working, I can watch how my script wanders around the net. Sometimes it fixates on certain sites.  Sites that have a lot of subdomains.  It's fascinating how websites are networked through links.

Bug reports, ideas and comments welcome.

Oh and lynx developers, why have a non removable error message when you have a non-lynx useragent?  Why are we forced to hand over information about ourselves when we use the web? 

Technical details:
Download: linkity

It runs on linux systems, BSD or Mac.  It uses some unix utilities: lynx, sqlite3, perl and with perl it uses the URI::Escape module. 

On debian/ubuntu systems you can install them like this:
sudo apt-get install sqlite3 lynx liburi-perl

On a mac with MacPorts:  On a Mac with MacPorts:
sudo port install lynx sqlite3
I'm pretty sure MacPorts installs perl5 automatically and URI::Escape is:
sudo port install p5.24-uri

You could use CPAN, to get URI::Escape but I can't really help you with that.



No comments: