I'm about to embark upon job search. To keep it in the background, I've written a python script that crawls Craigslist's job boards nationwide and emails me any updates via smtp. This job agent of sorts could be easily adapted for any other crawling and scraping task, so here's the full source. I'll go over parts of it in what follows.
To get started, we need to satisfy the dependencies: sudo easy_install the mechanize API to query and traverse Craigslist and the BeautifulSoup library to scale the document tree and select elements. In addition, I strip html tags with html2text and store the results in a buzhug table.
#Create/open existing database: mydb = Base('./mydb') try: mydb.open() # close the db at the end with mydb.close() except IOError: mydb.create(('PostingID',str), # stores Craigslist post id ('dt',datetime), # post date, a datetime object ('title',str), # post title ('location',str), # post location ('url',unicode), # post url ('body',unicode), # digest of post ('notified',bool) # True if record emailed )
We'll now create a browser instance
#...and set headers: br = mechanize.Browser() br.set_handle_equiv(True) br.set_handle_gzip(True) br.set_handle_redirect(True) br.set_handle_referer(True) br.set_handle_robots(False) br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.10) Gecko/2009042523 Ubuntu/9.04 (jaunty) Firefox/3.0.10')] # optionally set debug levels #br.set_debug_http(True) #br.set_debug_redirects(True) #br.set_debug_responses(True)
We're now ready to start at the root of Craigslist and traverse the links from there:
#...CRAIG_ROOT ="http://geo.craigslist.org/iso/us" root = br.open(CRAIG_ROOT) root_html=root.read() root_soup=BeautifulSoup(root_html) # get a list of site links under <div #list> --> get anchors and the location strings they enclose divblock = root_soup.findAll(id="list") anchors = divblock[0].findAll('a') locs=[] for a in anchors: try: locs.append(a.findChildren()[0].string.strip()) except: locs.append(a.string.strip())
We now have a list of (link,location) tuples:
directory= zip([l["href"] for l in anchors],locs)
What we'll do is follow each link in turn, click on the "arch / engineering" section, and submit the search form. My pseudocode is in fact not far from mechanize syntax:
#...here l is the root link for given Craigslist site (e.g. "http://sfbay.craigslist.org/") r=br.open(l) req=br.click_link(text="arch / engineering") br.open(req) # select the first form on the page try: br.select_form(nr=0) except: # if the form is not found (Craigslist's "scam alert" page), follow the link to the form req=br.click_link(text="Continue to arch / engineering job postings") br.open(req) br.select_form(nr=0) # perform search br.form['catAbbreviation']=['egr'] br.form['query'] = QUERY_STR #QUERY_STR could be 'structural -"mechanical engineer" -drafter -draftsman -someSpammerString' br.submit()
Let's use a regular expression to extract the post links:
# ...a typical link looks like: "*/egr/1560953114.html" urlre = re.compile('(/egr/([0-9]+).html)') all_links = [l for l in br.links(url_regex=urlre)]
We can now extract the PostingID (and title) from the link (postid = link.url.split('/')[-1].split('.')[0]) and query our table to see if the record exists.
#... If not, we follow the link with: br.follow_link(link) html=br.response().read() soup=BeautifulSoup(html) # Extract the date and url: dates=re.findall(r'Date:([ 0-9-,:A-Z]+)', html) if dates: dates=string.join(dates[0].strip().split()[:2],' ') d=datetime.strptime(dates, "%Y-%m-%d, %H:%M%p") else: d=date.today() # if Date couldn't be found for some reason (a recently expired post?) # fetch post url post_url=br.response().geturl()
Now let's gnaw at the body of the post. The problem here is the amount of redundancy in Craigslist posts. We could write a custom filter, e.g. by matching sets of posts and setting some upper bar of overlap. On the other hand, some verbatim reposts might merely change the job title or some other essential detail that would be overlooked as a result of overzealous screening. Accordingly, I chose to rely on the search string instead. With some tweaking, this filters out the more aggressive spammers.
# ...Let's fetch the first N words (<tt>DIGEST_SIZE</tt>) from the post body and enter the record into our database: userbody=soup.findAll(id='userbody') if userbody: #if no body, don't bother processing userbody=unicode(userbody[0]) userbody=string.join(userbody.split(' ')[:DIGEST_SIZE],' ') userbody = html2text(userbody) # see if there's an exact same record in db (a repost?): recs = mydb.select(None, location = post_location, body = userbody) if len(recs)==0: # if no duplicates found, enter the record into table recid=mydb.insert(PostingID=postid, dt=d, title=ti, location=post_location, url=post_url, body=userbody, notified=False)
Finally, we can send an email update:
# ...using Google's or Yahoo's SMTP server recs=mydb.select_for_update(['dt','title','location','url','body','notified'],notified=False) logfile.write("\nThere's a total of %d records to email" %len(recs)) mesg='' for i in recs: mesg += "\n%s \nDate: %s \nTitle: %s (%s) \nDigest: %s" %(i.url.encode('utf-8'), i.dt.strftime("%A, %d %B %Y, %I:%M%p"), i.title,i.location, i.body.encode('utf-8')) mesg +="\n\n----------------------\n" encoding = 'iso-8859-15' msg = MIMEText(_text=mesg, _charset='charset=%s' % encoding) msg['subject'] = "Craigslist updates" sender = 'craigslistJobs.py <%s>' %("from.me@gmail.com") # substitute your gmail account msg['from'] = sender s = smtplib.SMTP('smtp.gmail.com',587) # gmail uses port 587 s.ehlo() s.starttls() s.ehlo() s.login("from.me@gmail.com","my password") s.sendmail(sender, "to.you@gmail.com", msg.as_string()) s.close() # if success, set notified=True mydb.update(recs, notified = True)
My script adds basic error logging and table maintenance (Craigslist job postings expire after 30 days). All that remains is to schedule a cron job!