
Recently, I decide to contribute to open source software as a starter programmer. As the official site said1:
Blog about Scrapy. Tell the world how you’re using Scrapy. This will help newcomers with more examples and will help the Scrapy project to increase its visibility.
Therefore, I decide to write another post about the great python library.
selector
The Scrapy provides two methods to locale data: XPath and CSS expressions. XPath support by many library out of box and is the de facto standard in extract data from HTML. CSS expressions can even selecting pseudo-elements which is Scrapy-/Parsel-specific.2
The selector can be used when the response
is given.
response.selector.xpath("…").get() // return single element
response.selector.xpath("…").getall() // return a list
// or css selector
response.selector.css("…").get() // or getall()
XPath
XPath is something like //div/p
. At first glance, it may scares you. But it just paths of HTML DOM. Given HTML text as follow:
<html>
<body>
<div>
<ul>
<li class="item">Dog</li>
<li class="item">Cat</li>
</ul>
</div>
</body>
</html>
The XPath of li tag of Dog
is simply //html/body/div/ui/li[1]
. It like the path in the operate system. You could use relative path as well: //ui/li[1]
.
You could get the text of the element by:
response.selector.xpath("//ui/li[1]/text()").get()
CSS expressions
The CSS selector can be handy when the HTML source name the CSS class with rules.
To get the text of the element, first select the class
. Scrapy implements pseudo-elements which make select elements more easily. // select text, ::text response.selector.css("item::text").get() // select attributes, ::attr(name)
Copy the selector of element
An easy way to obtain the XPath and CSS expression is through the browser. Chrome and Firefox are provide convenient function to let you do this.
Press F12
, on the element, right click, then on the copy menu, select XPath
or CSS Path
. You get absolute path by this way. Chrome even have the function of select relative path.
Final words
I was so amazed by the functions provide by Scrapy. The authors of the package truly know what is essential for web scrawler. Its official documentation also easy to understand. Highly recommend you read it.
How do you use the Scrapy? Very welcome to share in the comments.