Field Notes of an Audacious Amateur: Addendum to 11th installment: Lynx; scraping credentialed web pages

Wednesday, January 13, 2016

Addendum to 11th installment: Lynx; scraping credentialed web pages

Sort of a dramatized headline for what I've accomplished using the command-line Lynx browser, but not too far from the mark. I've described in previous entries how I've used lynx to accomplish similar goals of extracting target information from web pages, so this entry is a continuation along those same lines.

I recently signed up for a prepaid cellular plan touted as being free, though it is one limited to a certain (unreasonably low, for most) number of minutes per month. The plan has thus far worked well for me. The only real issue I have come across is that I had not yet discovered any way easily to check how many minutes I've used and how many are left. The company providing the service is, of course, not very forthcoming with that sort of information: they have a vested interest in getting you to use up your free minutes, hoping thereby that you'll realize you should buy a paid plan from them, one that includes more minutes. The only way I'd found for checking current usage status is to log in to their web site and click around til you reach a page showing that data.

Of course I am generally aware of the phenomemon of web-page scraping and also have heard of python and/or perl scripts that can perform more or less automated interactions with web pages (youtube-dl being one example). So I initally thought my task would require something along these lines--quite the tall order for someone such as myself, knowing next to nothing about programming in either python or perl. But then I ran across promising information that led me to believe I might well be able to accomplish this task using the tried and true lynx browser, and some experimentation proved that this would, indeed, allow me to realize my goal.

The information I discovered came from this page. There is found a description of how it is possible to record to a log file all keystrokes entered into a particular lynx browsing session--something reminiscent of the way I used to create macros under Microsoft Word when I was using that software years ago. The generated log file can then, in turn, be fed to a subsequent lynx session, effectively automating certain browsing tasks, such as logging into a site, navigating to, then printing (to a file, in my case) a page. Add a few other utilities like cron, sed, and mail, and I have a good recipe for getting the cellular information I need into an e-mail that gets delivered to my inbox on a regular basis.

The initial step was to create the log file. An example of the command issued is as follows:
lynx -cmd_log=/tmp/mysite.txt http://www.mysite.com.

That, of course, opens the URL specified in lynx. The next step is to enter such keystrokes are are necessary to get to the target page. In my case, I needed to press the down arrow key a few times to reach the login and password entry blanks. I then typed in the credentials, hit the down arrow again, then the "enter" key to submit the credentials. I then needed to hit the "end" key on the next page, which took me all the way to the bottom of that page, then the up arrow key a couple of times to get to the link leading to the target page. Once I got to the target page, I pressed the "p" key (for print), then the "enter" key (for print to file), at which point I was prompted for a file name. Once I'd entered the desired file name and pressed the "enter" key again, I hit the "q" key to exit lynx. In this way, I produced the log file I could then use for a future automated session at that same site. Subsequent testing using the command
lynx -cmd_script=mysite.txt http://www.mysite.com

confirmed that I had, in fact, a working log file that could be used for retreiving the desired content from the target page.

The additional steps for my scenario were to turn this into a cron job (no systemd silliness here!), use sed to strip out extraneous content from the beginning and end of the page I'd printed/retrieved, and to get the resulting material into the body of an e-mail that I would have sent to myself at given intervals. The sed/mail part of this goes something like
sed -n 24,32p filename | mail -s prepaid-status me@mymail.com*

* I can't go into particulars of the mail program here, but suffice to say at least that you need a properly edited configuration file for your mail sending utility (I use msmtp) for this to work.

Field Notes of an Audacious Amateur

Wednesday, January 13, 2016

Addendum to 11th installment: Lynx; scraping credentialed web pages

No comments:

Post a Comment

Blog Archive

Labels