我正在使用Node和Cheerio构建Web抓取工具,对于某个网站,我遇到以下错误(它仅在该网站上发生,没有其他我尝试抓取的错误。
每次都在不同的位置发生,因此有时url x
是引发错误,有时url x
是,并且完全是另一个URL:
Error!: Error: socket hang up using [insert random URL, it's different every time]
Error: socket hang up
at createHangUpError (http.js:1445:15)
at Socket.socketOnEnd [as onend] (http.js:1541:23)
at Socket.g (events.js:175:14)
at Socket.EventEmitter.emit (events.js:117:20)
at _stream_readable.js:910:16
at process._tickCallback (node.js:415:13)
调试起来非常棘手,我真的不知道从哪里开始。首先,什么是套接字挂起错误?是404错误还是类似错误?还是仅表示服务器拒绝连接?
我在任何地方都找不到这种解释!
编辑:这是(有时)返回错误的代码示例:
function scrapeNexts(url, oncomplete) {
request(url, function(err, resp, body) {
if (err) {
console.log("Uh-oh, ScrapeNexts Error!: " + err + " using " + url);
errors.nexts.push(url);
}
$ = cheerio.load(body);
// do stuff with the '$' cheerio content here
});
}
There is no direct call to close the connection, but I'm using Node Request
which (as far as I can tell) uses http.get
so this is not required, correct me if I'm wrong!
EDIT 2: Here's an actual, in-use bit of code that is causing errors. prodURL
and other variables are mostly jquery selectors that are defined earlier. This uses the async
library for Node.
function scrapeNexts(url, oncomplete) {
request(url, function (err, resp, body) {
if (err) {
console.log("Uh-oh, ScrapeNexts Error!: " + err + " using " + url);
errors.nexts.push(url);
}
async.series([
function (callback) {
$ = cheerio.load(body);
callback();
},
function (callback) {
$(prodURL).each(function () {
var theHref = $(this).attr('href');
urls.push(baseURL + theHref);
});
var next = $(next_select).first().attr('href');
oncomplete(next);
}
]);
});
}
如果您通过https连接遇到此错误,并且该错误立即发生,则设置SSL连接可能会出现问题。
对我来说,这就是这个问题https://github.com/nodejs/node/issues/9845,但对您而言,可能是其他问题。如果ssl有问题,那么您应该能够使用nodejs tls / ssl软件包重现它,而仅尝试连接到域