Ok, I'll admit it: there's certainly hyperbole in this entry's title. What I'm doing with the text-mode browser lynx isn't really scraping--it's just something that bears some conceptual (in my view) similarities. It might appear similar because what I've done is to come up with a way of invoking lynx (or any other text-mode browser for that matter), with search terms already entered, from the command line. The end product is just the text results google finds relative to your query--sans all the bells and whistles google's search portal has been foisting on us in recent years. Why is this a significant accomplishment? Well, consider the following.
Background
Have you found google's search portal to be increasingly cluttered and bothersome? I certainly have. Things like pop-out previews do nothing for me but create distraction, and auto-completion is far more often an irritation to me than a help: as a liberal estimate, perhaps 25% of my searches have benefited from the auto-completion feature. For what it's worth, if google wished to provide better service to users like me, they would create two separate search portals: one would be a fuzzy-feely search portal for those who might be uncertain as to what they're seeking and who could benefit from auto-completion and/or pop-out previews; the other would be google's old, streamlined search page and would involve little more than short text summaries and relevant links.
Once upon a time there was a google scraper site at www.scroogle.org--billing itself more as a search anonymizer than as an interface unclutterer--that provided a results page pretty much like the old google one. I used to use scroogle in the days before google introduced some of the more irritating "enhancements" that now plague their site, and came to appreciate above all its spartan appearance. But, alas, scroogle closed its doors in mid-2012 and so is no longer an option. I've been stuck since, resentfully, using google.
In a recent fit of frustration, I decided to see whether there might be any other such scrapers around. As I searched, I wondered as well whether one might not be able to set up their own, personal scraper, on their own personal computer: I had certainly heard and read about the possibilities for conducting web searches from the command line, and this seemed a promising avenue for my query. I ended up finding some results that, while providing but a primitive approximation, look like they may nonetheless have given me a workable way to do the sort of pseudo-scraping I need. Thus, the following entry.
More about the task
Conducting web searches from the command line is another way of describing the task I aimed to accomplish. Granted, doing this sort of thing is nothing especially new. surfraw, for example, created by the infamous Julian Assange, has been around for a number of years and more properly fits into the category of web-search-from-the-command-line utilities than does the solution I propose--which just invokes a text-mode browser. There are actually several means of doing something that could be classified as "searching the web from the command line" (google that and you'll see), including the interesting "google shell" project, called "goosh."
Still, the solution I've cobbled together using bits found in web searches, and which involves a bash function that calls the text-mode browser lynx, seemed on-target enough and something worth writing an entry about. Details below.
The meat of the matter: bash function
To begin with, some credits. The template I cannibalized for my solution is found here: I only did some minor modifications to that code so that it would work more to my liking. There's another interesting proposition in that same thread, by the way, that uses lynx--though it pipes output through less. I tried that one and it got me thinking in the direction of using lynx for this. But I liked the way the output looked in lynx much more than when piped through less, so I decided to try further adapting the bash function for my uses and came up with the following.
The bash function outlined at that site actually uses google search and calls a graphical browser to display the output. The graphical browser part was the one I was trying to obviate so that would be the first change to make. I mostly use elinks these days for text-mode browsing, but having revisited lynx while experimenting with the other solution posed there, I decided I would try it out. And I must say that it does have an advantage over elinks in that URL's can be more easily copied from within lynx (no need to hold down the shift key).
I could not get the google URL given in that example to work in my initial trials, however. This is likely owing to changes google has made to its addressing scheme in the intervening interval since that post was made. So I first used a different URL from the search engine startpage.
After some additional web searching and tweaking, I was finally able to find the correct URL to return google search results. Though that URL is likely to change in the future, I include it in the example below.
What I have working on this system results from the code below, which I have entered into my .bashrc file:
Once that has been entered, simply issue . .bashrc so that your system will re-source your .bashrc file, and you're ready for command-line web searching/pseudo-scraping. To begin searching, simply enter the new terminal command you just created, search, followed by the word or phrase you wish to search for on google: search word, search my word, search "my own word", search my+very+own+word, or seemingly just about any other search term or phrase you might otherwise enter into google's graphical search portal seem to work fine.
lynx will then open in the current terminal to the google search results page for your query. You can have a quick read of summaries or follow results links. Should any of the entries merit graphical inspection, you can copy and paste the URL into your graphical browser of choice.
You'll probably want to tell lynx (by modifying the relevant option in lynx.cfg) either to accept or reject all cookies so as to save yourself some keystrokes. If you do not do so, it will, on receiving a cookie, await your input prior to displaying results. Of course you could use any other text-mode browser--such as w3m, the old links or xlinks, retawq, netrik, or any other text-mode-browser candidates as well.
Suggestions for improvements to my solution or offerings of alternative approaches will be appreciated. Happy pseudo-scraping/command-line searching!
AFTERTHOUGHT: I happened upon some other interesting-looking bash functions at another site that are supposed to allow other types of operations from the command line; e.g., defining words, checking weather, translating words. These are rather dated, though (2007), and I couldn't get them to work. Interpreting their workings and determing where the problem(s) lie is a bit above my pay grade: anyone have ideas for making any of these functions once again operable?
The Field notes of an Audacious Amateur series is offered in the spirit of the open source movement. While the concept of open source is typically associated with computer programmers, there is a growing body of those who don't know jack about programming, but who nevertheless use the creations of open source programmers. . . .
This series is written by a representative of the latter group, which is comprised mostly of what might be called "productivity users" (perhaps "tinkerly productivity users?"). Though my lack of training precludes me from writing code or improving anyone else's, I can, nonetheless, try and figure out creative ways of utilizing open source programs. And again, because of my lack of expertise, though I may be capable of deploying open source programs in creative ways, my modest technical acumen hinders me from utilizing those programs in what may be the most optimal ways. The open-source character, then, of this series, consists in my presentation to the community of open source users and programmers of my own crude and halting attempts at accomplishing computing tasks, in the hope that those who are more knowledgeable than me can offer advice, alternatives, and corrections. The desired end result is the discovery, through a communal process, of optimal and/or alternate ways of accomplishing the sorts of tasks that I and other open source productivity users need to perform.
Wednesday, January 23, 2013
Thursday, January 10, 2013
10th installment: resume an scp file transfer
NOTE: as a knowledgeable commenter later pointed out "[c]urrently (and since around 2004) the default transfer protocol in rsync *IS* ssh. There is no need for the '-e ssh' unless you directly connect to a remote rsync daemon." I have tested this claim and it is, indeed, true that the file transfer resumption using rsync does not require the -e ssh bit I stipulated in the instructions below. I did not manage to test whether the same alternate port switch (-p 1234) works, though I assume it does.
I recently went on vacation and, since my mythtv set-up was, for some crazy reason, not allowing me to do a direct download of recorded programming through the mythweb interface, I needed to find an alternate way of snagging those files. I have an ssh server running on my home LAN, so using scp for this seemed like it should work, though I knew it would take a bit of tinkering. Read on to see what sort of tinkering I did and, just as importantly, a way I discovered of resuming the disrupted download.
I first investigated the possibility of setting up an ssh tunnel, since the computer on my LAN that contains the video files is not the one running the ssh server. But doing that looked a tad beyond my skill level. So I decided I'd just copy those files over to the computer running the ssh server manually, then scp them to the remote computer from there.
These were very large files--i.e., > 2 GB--and, given the fairly limited rate at which I could transfer them, I expected there might be some disruption or disconnection during the download. Prior to beginning the dowloads, then, I searched google under "scp" and "resume," and I immediately came across results that showed how to use the rsync utility to resume disrupted downloads. This encouraged me to go ahead and try the scp download method.
As wikipedia informs us, "rsync is a software application and network protocol for Unix-like systems . . . that synchronizes files and directories from one location to another while minimizing data transfer." Though I had, when previously considering differing ways to back up certain directories on my computers, looked at some documentation on rsync, I had no prior experience with actually using it. Nonetheless, that's the solution I ended up employing--though I needed to do a slight adaptation for my circumstances. It seemed the slight variation I stumbled upon might warrant an entry on this blog.
Before describing in greater detail what I did, I should first at least mention a couple of other results I found that used differing utilities. One candidate used curl and sftp instead of rsync, while the other used the dd command. Since I did not attempt to implement either of those solutions, I will, after simply making note of the fact that those utilities apparently can be used for this, move on.
Getting back to rsync, the bulk of instructions I found for resuming scp transfers using it would not work "out of the box" for me because I run ssh on a non-standard port. For purposes of this blog entry, let's say that's port 1234. The question for me, then, was how to adapt the directions I'd found to the scenario involving the non-standard ssh port my LAN uses.
The resolution turned out to be fairly simple. I finally ran across the an incantation very close to what I needed here. A simplified sample entry follows (a slightly more complex rendition can be seen in the description for setting up an alias below):
rsync -P -e 'ssh -p 1234' user@remote.host.IP:path-to/remotefile.mpg localfile.mpg
Essentially, the command tells rsync to use ssh as the shell on remote end (the -e switch), while the -P switch tells it two things: that it should display the progress of the transfer, as well as that it only needs to do a partial transfer of the file. What falls between the inverted commas are the options that get passed to ssh--in this case -p 1234 stipulating the port to connect to on the remote end.
To simplify yet further this resuming process, an alias could, theoretically, be created as suggested here. That would would not work in my case, however, since aliases appear not to allow the passing of special options to ssh: entering alias scpresume='rsync -Pazhv -e ssh -p 1234' at the command line caused the port specification to be received as an option by rsync--an option it was unable to interpret. Thus, the more permanent solution of adding that line to .bashrc would not work for me either.
To make the alias work for me, I had to set up an ~/.ssh/config file with the following content (as discussed here):
After doing that, I was able to create the alias as alias scpresume='rsync -Pazhv -e ssh (note that the alternate port now does not need to be specified since it's entered into your ~/.ssh/config file). It can now simply be run as scpresume remote:path-to/remotefile localfile. Once your ~/.ssh/config file is set up, the scpresume='rsync -Pazhv -e ssh line is what needs to be entered into your .bashrc to make scpresume a permanent part of your command-line environment.
Feel free to offer any improvements you may have or other suggestions for using alternate command-line utilities for resuming downloads. This method did the trick for me, and I was able to resume transfer of files that had petered out at varying points during the download process, but of course there could be other methods that are in some way superior.
ADDENDUM: In light of the comment of a knowledgeable reader, the correct full, non-redundant command to use for file transfer resumption using rsync would be
I recently went on vacation and, since my mythtv set-up was, for some crazy reason, not allowing me to do a direct download of recorded programming through the mythweb interface, I needed to find an alternate way of snagging those files. I have an ssh server running on my home LAN, so using scp for this seemed like it should work, though I knew it would take a bit of tinkering. Read on to see what sort of tinkering I did and, just as importantly, a way I discovered of resuming the disrupted download.
I first investigated the possibility of setting up an ssh tunnel, since the computer on my LAN that contains the video files is not the one running the ssh server. But doing that looked a tad beyond my skill level. So I decided I'd just copy those files over to the computer running the ssh server manually, then scp them to the remote computer from there.
These were very large files--i.e., > 2 GB--and, given the fairly limited rate at which I could transfer them, I expected there might be some disruption or disconnection during the download. Prior to beginning the dowloads, then, I searched google under "scp" and "resume," and I immediately came across results that showed how to use the rsync utility to resume disrupted downloads. This encouraged me to go ahead and try the scp download method.
As wikipedia informs us, "rsync is a software application and network protocol for Unix-like systems . . . that synchronizes files and directories from one location to another while minimizing data transfer." Though I had, when previously considering differing ways to back up certain directories on my computers, looked at some documentation on rsync, I had no prior experience with actually using it. Nonetheless, that's the solution I ended up employing--though I needed to do a slight adaptation for my circumstances. It seemed the slight variation I stumbled upon might warrant an entry on this blog.
Before describing in greater detail what I did, I should first at least mention a couple of other results I found that used differing utilities. One candidate used curl and sftp instead of rsync, while the other used the dd command. Since I did not attempt to implement either of those solutions, I will, after simply making note of the fact that those utilities apparently can be used for this, move on.
Getting back to rsync, the bulk of instructions I found for resuming scp transfers using it would not work "out of the box" for me because I run ssh on a non-standard port. For purposes of this blog entry, let's say that's port 1234. The question for me, then, was how to adapt the directions I'd found to the scenario involving the non-standard ssh port my LAN uses.
The resolution turned out to be fairly simple. I finally ran across the an incantation very close to what I needed here. A simplified sample entry follows (a slightly more complex rendition can be seen in the description for setting up an alias below):
rsync -P -e 'ssh -p 1234' user@remote.host.IP:path-to/remotefile.mpg localfile.mpg
Essentially, the command tells rsync to use ssh as the shell on remote end (the -e switch), while the -P switch tells it two things: that it should display the progress of the transfer, as well as that it only needs to do a partial transfer of the file. What falls between the inverted commas are the options that get passed to ssh--in this case -p 1234 stipulating the port to connect to on the remote end.
To simplify yet further this resuming process, an alias could, theoretically, be created as suggested here. That would would not work in my case, however, since aliases appear not to allow the passing of special options to ssh: entering alias scpresume='rsync -Pazhv -e ssh -p 1234' at the command line caused the port specification to be received as an option by rsync--an option it was unable to interpret. Thus, the more permanent solution of adding that line to .bashrc would not work for me either.
To make the alias work for me, I had to set up an ~/.ssh/config file with the following content (as discussed here):
After doing that, I was able to create the alias as alias scpresume='rsync -Pazhv -e ssh (note that the alternate port now does not need to be specified since it's entered into your ~/.ssh/config file). It can now simply be run as scpresume remote:path-to/remotefile localfile. Once your ~/.ssh/config file is set up, the scpresume='rsync -Pazhv -e ssh line is what needs to be entered into your .bashrc to make scpresume a permanent part of your command-line environment.
Feel free to offer any improvements you may have or other suggestions for using alternate command-line utilities for resuming downloads. This method did the trick for me, and I was able to resume transfer of files that had petered out at varying points during the download process, but of course there could be other methods that are in some way superior.
ADDENDUM: In light of the comment of a knowledgeable reader, the correct full, non-redundant command to use for file transfer resumption using rsync would be
rsync -Pazhv -p 1234 user@remote.host.IP:path-to/remotefile.mpg localfile.mpg
Friday, January 4, 2013
Ninth Installment: Xmobar to the rescue!
Regular readers of this blog will be aware of my inclination towards minimalist desktops or window managers and my preference, within reasonable limits, for low-resource and/or command-line tools and applications. And I've previously mentioned making use lately of the evilwm window manager, which I've pretty much settled on now in preference to two other minimalist window manages--dwm and ion3--with which I experimented. One of the things I've missed, however, about the more full-blown desktops I've used, is some of the monitors or applets one can configure to run in the panel(s) and which give quite helpful system information like memory and/or CPU usage.
I became aware some time ago of conky, a minimalistic utility that can display these--and many other--types of helpful information under differing window managers or desktops and, though I'd seen screenshots of it configured to run as a sort of panel, most conky configurations I've come across actually have it display on the desktop background--not something particularly desirable for me since I tend to run applications full-screen on my evilwm desktops. But lately, I somehow came across information about another utility--Xmobar--that can display the sorts of system information I want but which seems to be configured to run mainly as a panel. So I decided to have a go with it. I was able to configure it to my liking fairly easily and decided to offer in this entry a further description of the program and to post the configuration I am using. That information follows.
To begin with a bit more information on Xmobar, it seems originally to have been written to complement the minimalist window manager Xmonad (which, incidentally, I've not tried). As the Arch Wiki entry--my main source for setting up and configuring Xmobar--informs us, it is written, as is Xmonad, in the Haskell programming language. In case you might be intimidated at the prospect of potentially having to learn something about that programming language, take heart; as the wiki entry further elucidates "while xmobar is written in Haskell, no knowledge of the language is required to install and use it."
As with many other GNU/Linux utilities, Xmobar relies on a hidden configuration file--named, predictably, .xmobarrc--located in the user's home directory. The Arch wiki contains a sample configuration file, and that's the one on which I based my initial experiments with Xmobar.
To start Xmobar, I simply call it as the last line in my .xinitrc file. That will, of course, not be the universally applicable way of starting the utility: those using a log-in manager will undoubtedly need to invoke the utility in some other way. Being the GUI-adverse type I am, however, this is what works for me.
Below I include a screenshot of the lower section of one of my desktops, which shows Xmobar running as it is currently configured on one of my systems.
I became aware some time ago of conky, a minimalistic utility that can display these--and many other--types of helpful information under differing window managers or desktops and, though I'd seen screenshots of it configured to run as a sort of panel, most conky configurations I've come across actually have it display on the desktop background--not something particularly desirable for me since I tend to run applications full-screen on my evilwm desktops. But lately, I somehow came across information about another utility--Xmobar--that can display the sorts of system information I want but which seems to be configured to run mainly as a panel. So I decided to have a go with it. I was able to configure it to my liking fairly easily and decided to offer in this entry a further description of the program and to post the configuration I am using. That information follows.
To begin with a bit more information on Xmobar, it seems originally to have been written to complement the minimalist window manager Xmonad (which, incidentally, I've not tried). As the Arch Wiki entry--my main source for setting up and configuring Xmobar--informs us, it is written, as is Xmonad, in the Haskell programming language. In case you might be intimidated at the prospect of potentially having to learn something about that programming language, take heart; as the wiki entry further elucidates "while xmobar is written in Haskell, no knowledge of the language is required to install and use it."
As with many other GNU/Linux utilities, Xmobar relies on a hidden configuration file--named, predictably, .xmobarrc--located in the user's home directory. The Arch wiki contains a sample configuration file, and that's the one on which I based my initial experiments with Xmobar.
To start Xmobar, I simply call it as the last line in my .xinitrc file. That will, of course, not be the universally applicable way of starting the utility: those using a log-in manager will undoubtedly need to invoke the utility in some other way. Being the GUI-adverse type I am, however, this is what works for me.
Below I include a screenshot of the lower section of one of my desktops, which shows Xmobar running as it is currently configured on one of my systems.
Though a good deal will be obvious from the picture, I will nonetheless offer a verbal description of each section of the panel. After the description, I will provide the content of my .xmobarrc file for further reference.
On the left side of the panel we see, on the far left, of course, a CPU meter of sorts (CPU percentage meter). Next to that is the memory meter, showing percentage of main RAM and swap in use. Then follows a network meter showing current upload/download speed: the dash to the left indicates Xmobar did not find eth0 since, in the instance when the screenshot was taken, no network cable was attched to it. After the network speed indicator follows a keyboard layout indicator: at the moment the US keyboard layout is in use, but I also have Russian and Greek keyboards configured for this machine (more on keyboard layouts and switching between them in a future entry).
On the right side of the panel is seen first the date and time. Next to that is a battery meter that displays percentage of battery charge left as well as estimated remaining time; the screenshot comes from a laptop, of course. That is followed by a location indicator and outside temperature reading: I happened to be near the town of Mikkeli in Finland at the time I wrote this entry--thus the EFMI weather station code in the configuration file below. Finally, the kernel version and distribution are listed--this being derived, of course, from uname output.
Below, then, is the content of my .xmobarrc file. I added a few tweaks to the one I found on the Arch Wiki, mainly the battery meter as well as the keyboard layout indicator. I also did a bit of color tweaking since it seemed to me the section dividers (the pipe character--|) needed to be in a different color so as to more readily draw attention to the field delimitations. A bottom rather than top orientation was more to my liking, so I made that modification as well.
I am thus far quite happy with Xmobar. At the same time, I would be interested to hear from conky users who have their layout configured as a panel like this. Feel free to pipe in with your input on these or other minimalist panel utilities.
Subscribe to:
Posts (Atom)