最近,网易云音乐积极响应反垄断政策,取消了音乐的“独家”标签,然后很多好的音乐都听不到了。刚好小编最近学了很多爬虫的教程,跟进学习了一下,也集大全了吧,本来想优化一下的,但是发现问题还是有点复杂,最后另辟捷径,提供了简单的方法啊! 首先我们要需要找到歌曲的id,如果你只是想下载一首歌,比如静茹-勇气,那就没有必要爬虫,但是需要注意的是,网易返回的链接,数据是js动态加载,也就是爬虫得到的网页数据和浏览器得到的dom内容和结构不一样,其中,搜索列表爬虫回来的内容,完全得不到歌曲id。那遇到这样的情况要怎么解决呢? 解决方法就是使用使用selenium无头浏览器,如果我们需要大量获取数据,就一定要挂上代理,这是爬虫必备的工作。用python,就一定要简单,我认为复杂的东西,还是尽量少做,能取巧就取巧,所以代理我们就使用最简单的动态转发方式就好了。 - from selenium import webdriver
- import string
- import zipfile
- # 代理服务器(产品官网 www.16yun.cn)
- proxyHost = "t.16yun.cn"
- proxyPort = "31111"
- # 代理验证信息
- proxyUser = "username"
- proxyPass = "password"
- def create_proxy_auth_extension(proxy_host, proxy_port,
- proxy_username, proxy_password,
- scheme='http', plugin_path=None):
- if plugin_path is None:
- plugin_path = r'D:/{}_{}@t.16yun.zip'.format(proxy_username, proxy_password)
- manifest_json = """
- {
- "version": "1.0.0",
- "manifest_version": 2,
- "name": "16YUN Proxy",
- "permissions": [
- "proxy",
- "tabs",
- "unlimitedStorage",
- "storage",
- "",
- "webRequest",
- "webRequestBlocking"
- ],
- "background": {
- "scripts": ["background.js"]
- },
- "minimum_chrome_version":"22.0.0"
- }
- """
- background_js = string.Template(
- """
- var config = {
- mode: "fixed_servers",
- rules: {
- singleProxy: {
- scheme: "${scheme}",
- host: "${host}",
- port: parseInt(${port})
- },
- bypassList: ["foobar.com"]
- }
- };
- chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
- function callbackFn(details) {
- return {
- authCredentials: {
- username: "${username}",
- password: "${password}"
- }
- };
- }
- chrome.webRequest.onAuthRequired.addListener(
- callbackFn,
- {urls: [""]},
- ['blocking']
- );
- """
- ).substitute(
- host=proxy_host,
- port=proxy_port,
- username=proxy_username,
- password=proxy_password,
- scheme=scheme,
- )
- with zipfile.ZipFile(plugin_path, 'w') as zp:
- zp.writestr("manifest.json", manifest_json)
- zp.writestr("background.js", background_js)
- return plugin_path
- proxy_auth_plugin_path = create_proxy_auth_extension(
- proxy_host=proxyHost,
- proxy_port=proxyPort,
- proxy_username=proxyUser,
- proxy_password=proxyPass)
- option = webdriver.ChromeOptions()
- option.add_argument("--start-maximized")
- # 如报错 chrome-extensions
- # option.add_argument("--disable-extensions")
- option.add_extension(proxy_auth_plugin_path)
- # 关闭webdriver的一些标志
- # option.add_experimental_option('excludeSwitches', ['enable-automation'])
- driver = webdriver.Chrome(chrome_options=option)
- # 修改webdriver get属性
- # script = '''
- # Object.defineProperty(navigator, 'webdriver', {
- # get: () => undefined
- # })
- # '''
- # driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {"source": script})
- driver.get("https://music.163.com/#/song?id=557581284")
复制代码
|