I have been webscraping for a few years now as a hobby. I’d like to share some tips I have learned.
1) Interface with Python’s “google” library
Simply use: pip install google.
This allows your webscraping code or bot to find what you want easily. In my experience, you should never need to go deeper than 2 levels for the websites suggested by google.
The issue I find with using google is that there is a quota per day for how many times you can search. You can pay them money to raise that quota.
Also, I have found that having multiple scripts in sub-folders can help increase the quota. The quota does not appear to be per diem but on a shorter time frame. If anyone has more insight on the quota, I’d love to hear about it. I have my scripts pause for a time if they are unable to search and then check periodically until the searching capability is back.
2) If your code stops working when it was previously working due to your own operating system…
Even when my computer is in the developer mode, my computer likes to try to sequester and simply shut down my scripts. You can obviously turn off your malware software but that may not be the best idea if your bot is actually downloading files when it is webscraping (which I like to do). It is important to not download content in an automated way that you may have access to over a local network with special privileges and then share them because you will definitely run into copyright issues. You could easily run into copyright issues just by letting your bot download whatever as well, of course.
The best work around in my opinion for this is to run the straight python code rather than turning the script into an executable file using pyinstaller.
3) If your code stops working when it was previously working due to the website you are scraping…
In this case you are going to want to have a dynamic IP address. I have not personally gone to the extent of masking my IP address, or making it such that I do not have a real one but I do have my code check whether it is being blocked and if it is, you can have your script release its ip address by talking directly to the operating system (i.e.,
import os
os.system(‘ip config /release’)
Note that if you do this that the wifi will be disconnected and you will have to reconnect. Your python script can do this too. If you want some of my code, I am happy to share it: admin@pharmacoengineering.com.