Skip to content Skip to sidebar Skip to footer

Pass Values Into Scrapy Callback

I'm trying to get started crawling and scraping a website to disk but having trouble getting the callback function working as I would like. The code below will visit the start_url

Solution 1:

I'm not 100% sure but I think you can't rename the scrapy image files however you want, scrapy does that.

What you want to do looks like a job for CrawlSpider instead of Spider.

CrawlSpider by itself follows every link it finds in every page recursively and you can set rules on what pages you want to scrap. Here are the docs.

If you are stubborn enough to keep Spider you can use the meta tag on requests to pass the items and save links in them.

for link in soup.find_all("a"):
        item=crawlerItem()
        item['url'] = response.urljoin(link.get('href'))
        request=scrapy.Request(url,callback=self.scrape_page)
        request.meta['item']=item
        yield request

To get the item just go look for it in the response:

defscrape_page(self, response):
    item=response.meta['item']

In this specific example the item passed item['url'] is obsolete as you can get the current url with response.url

Also,

It's a bad idea to use Beautiful soup in scrapy as it just slows you down, the scrapy library is really well developed to the extent that you don't need anything else to extract data!

Post a Comment for "Pass Values Into Scrapy Callback"