This series is written by a representative of the latter group, which is comprised mostly of what might be called "productivity users" (perhaps "tinkerly productivity users?"). Though my lack of training precludes me from writing code or improving anyone else's, I can, nonetheless, try and figure out creative ways of utilizing open source programs. And again, because of my lack of expertise, though I may be capable of deploying open source programs in creative ways, my modest technical acumen hinders me from utilizing those programs in what may be the most optimal ways. The open-source character, then, of this series, consists in my presentation to the community of open source users and programmers of my own crude and halting attempts at accomplishing computing tasks, in the hope that those who are more knowledgeable than me can offer advice, alternatives, and corrections. The desired end result is the discovery, through a communal process, of optimal and/or alternate ways of accomplishing the sorts of tasks that I and other open source productivity users need to perform.

Wednesday, January 23, 2013

11th installment: lynx; your own personal google scraper

Ok, I'll admit it: there's certainly hyperbole in this entry's title. What I'm doing with the text-mode browser lynx isn't really scraping--it's just something that bears some conceptual (in my view) similarities. It might appear similar because what I've done is to come up with a way of invoking lynx (or any other text-mode browser for that matter), with search terms already entered, from the command line. The end product is just the text results google finds relative to your query--sans all the bells and whistles google's search portal has been foisting on us in recent years. Why is this a significant accomplishment? Well, consider the following.


Have you found google's search portal to be increasingly cluttered and bothersome? I certainly have. Things like pop-out previews do nothing for me but create distraction, and auto-completion is far more often an irritation to me than a help: as a liberal estimate, perhaps 25% of my searches have benefited from the auto-completion feature. For what it's worth, if google wished to provide better service to users like me, they would create two separate search portals: one would be a fuzzy-feely search portal for those who might be uncertain as to what they're seeking and who could benefit from auto-completion and/or pop-out previews; the other would be google's old, streamlined search page and would involve little more than short text summaries and relevant links.

Once upon a time there was a google scraper site at itself more as a search anonymizer than as an interface unclutterer--that provided a results page pretty much like the old google one. I used to use scroogle in the days before google introduced some of the more irritating "enhancements" that now plague their site, and came to appreciate above all its spartan appearance. But, alas, scroogle closed its doors in mid-2012 and so is no longer an option. I've been stuck since, resentfully, using google.

In a recent fit of frustration, I decided to see whether there might be any other such scrapers around. As I searched, I wondered as well whether one might not be able to set up their own, personal scraper, on their own personal computer: I had certainly heard and read about the possibilities for conducting web searches from the command line, and this seemed a promising avenue for my query. I ended up finding some results that, while providing but a primitive approximation, look like they may nonetheless have given me a workable way to do the sort of pseudo-scraping I need. Thus, the following entry.

More about the task

Conducting web searches from the command line is another way of describing the task I aimed to accomplish. Granted, doing this sort of thing is nothing especially new. surfraw, for example, created by the infamous Julian Assange, has been around for a number of years and more properly fits into the category of web-search-from-the-command-line utilities than does the solution I propose--which just invokes a text-mode browser. There are actually several means of doing something that could be classified as "searching the web from the command line" (google that and you'll see), including the interesting "google shell" project, called "goosh."

Still, the solution I've cobbled together using bits found in web searches, and which involves a bash function that calls the text-mode browser lynx, seemed on-target enough and something worth writing an entry about. Details below.

The meat of the matter: bash function

To begin with, some credits. The template I cannibalized for my solution is found here: I only did some minor modifications to that code so that it would work more to my liking. There's another interesting proposition in that same thread, by the way, that uses lynx--though it pipes output through less. I tried that one and it got me thinking in the direction of using lynx for this. But I liked the way the output looked in lynx much more than when piped through less, so I decided to try further adapting the bash function for my uses and came up with the following.

The bash function outlined at that site actually uses google search and calls a graphical browser to display the output. The graphical browser part was the one I was trying to obviate so that would be the first change to make. I mostly use elinks these days for text-mode browsing, but having revisited lynx while experimenting with the other solution posed there, I decided I would try it out. And I must say that it does have an advantage over elinks in that URL's can be more easily copied from within lynx (no need to hold down the shift key).

I could not get the google URL given in that example to work in my initial trials, however. This is likely owing to changes google has made to its addressing scheme in the intervening interval since that post was made. So I first used a different URL from the search engine startpage.

After some additional web searching and tweaking, I was finally able to find the correct URL to return google search results. Though that URL is likely to change in the future, I include it in the example below.

What I have working on this system results from the code below, which I have entered into my .bashrc file:

Once that has been entered, simply issue . .bashrc so that your system will re-source your .bashrc file, and you're ready for command-line web searching/pseudo-scraping. To begin searching, simply enter the new terminal command you just created, search, followed by the word or phrase you wish to search for on google: search word, search my wordsearch "my own word", search my+very+own+word, or seemingly just about any other search term or phrase you might otherwise enter into google's graphical search portal seem to work fine.

lynx will then open in the current terminal to the google search results page for your query. You can have a quick read of summaries or follow results links. Should any of the entries merit graphical inspection, you can copy and paste the URL into your graphical browser of choice.

You'll probably want to tell lynx (by modifying the relevant option in lynx.cfg) either to accept or reject all cookies so as to save yourself some keystrokes. If you do not do so, it will, on receiving a cookie, await your input prior to displaying results. Of course you could use any other text-mode browser--such as w3m, the old links or xlinks, retawq, netrik, or any other text-mode-browser candidates as well.

Suggestions for improvements to my solution or offerings of alternative approaches will be appreciated. Happy pseudo-scraping/command-line searching!

AFTERTHOUGHT: I happened upon some other interesting-looking bash functions at another site that are supposed to allow other types of operations from the command line; e.g., defining words, checking weather, translating words. These are rather dated, though (2007), and I couldn't get them to work. Interpreting their workings and determing where the problem(s) lie is a bit above my pay grade: anyone have ideas for making any of these functions once again operable?

No comments:

Post a Comment