基于selenium的爬蟲(chromedriver)-java環境下
說一下使用場景先:

selenium是常用的網頁自動化測試框架,我這次的使用場景是這樣的,項目爬蟲范圍拓展到了一個新的站點,雖然登錄還是原來的單點登錄,但是這個網站后續判斷是否登錄授權中有使用一些前端js動態添加的cookies,這段邏輯具體會產生sessionId等cookie,其中使用了https://github.com/broofa/node-uuid等機制,通過觀察發現這部分邏輯一時無法在不依賴與瀏覽器引擎的后端爬蟲里面模擬再現;
導致的情況就是使用原有的單點登錄后的cookie進行后續爬蟲鑒權失敗,如果直接復制瀏覽器中帶有sessionId等信息的cookie后續操作正常,很明顯的是復制的cookie過一段時間肯定就失效了;
然后我的解決方案就是通過selenium動態獲取這個生成的cookie,因為原始項目已經比較肥胖,所以我采用的是spring +cxf? +selenium,以服Rest務的方式將這個爬蟲需要的cookie提供給原始的httpclient+jsoup爬蟲使用;
最終效果是符合我的預期的,注意spring?+selenium時需要移除pom中的一個guaua庫;
基于apache cxf微服務例子
服務接口 @Path("/r") @Produces("application/json") public?interface?XXXService?{ ????@GET ????@Path("/{uid}/{password}/{headless}") ????public?XXXModel?get(@PathParam("uid")?String?uid,@PathParam("password")?String?password,@PathParam("headless")?Integer?headless); ????@POST ????public?void?post(XXXModel?xxxModel); } @Service("xxxService") public?class?XXXServiceImpl?implements?XXXService?{ ????@Override ????public?XXXModel?get(String?uid,String?password,Integer?headless)?{ ????????prepare(headless); ????????String?cookies=""; ????????try?{ ???????? cookies?=?getCookies(uid,new?String(Base64.decode(password))); ????????}catch(Exception?e)?{ ??????????????e.printStackTrace(); ????????}finally?{ ???????? itsdown(); ????????} ????????return?new?XXXModel(cookies); ????} ????@Override ????public?void?post(XXXModel?xxxModel)?{ ????} ????private?String?testUrl; ????private?WebDriver?driver; ????public?void?prepare(Integer?headless)?{ ????????System.setProperty( ????????????????"webdriver.chrome.driver", ????????????????"D:\XXX\chrome\Chrome-bin\chromedriver.exe"); ????????testUrl?=?"https://xxxx/login"; ????????ChromeOptions?options?=?null; ????????try?{ ????????????options?=?new?ChromeOptions(); ???? }catch(Exception?e)?{ ???? e.printStackTrace(); ???? } ????????options.setBinary("D:\XXX\chrome\Chrome-bin\chrome.exe"); ????????options.setHeadless(headless!=0); ???? try?{ ????????????driver?=?new?ChromeDriver(options);//options ???? }catch(Exception?e)?{ ???? e.printStackTrace(); ???? } ????????driver.get(testUrl); ????} ???? ????public?String?getCookies(String?uid,String?password)??{ ???? /*try?{ Thread.sleep(3000); }?catch?(InterruptedException?e)?{ e.printStackTrace(); }*/ ???? (new?WebDriverWait(driver,?5)).until( ???? ExpectedConditions.visibilityOfElementLocated(By.id("password")) ????????); ???? //WebElement?uidE=?driver.findElement(By.id("uid")); ???? WebElement?passwordE=?driver.findElement(By.id("password")); ???? JavascriptExecutor?jsExecutor?=?(JavascriptExecutor)?driver; ???? try?{ ???????? //jsExecutor.executeScript("document.getElementById('password').setAttribute('value',?'"+password+"')"); ???????? passwordE.sendKeys(password); ???????? jsExecutor.executeScript("document.getElementById('uid').setAttribute('value',?'"+uid+"')"); ???????? jsExecutor.executeScript("submitForm()"); }?catch?(Exception?e)?{ e.printStackTrace(); } ???? (new?WebDriverWait(driver,?5)).until( ???? ExpectedConditions.visibilityOfElementLocated(By.className("head_searchBtn")) ???? /*new?ExpectedCondition
參考文檔:
https://stackoverflow.com/questions/35776826/how-to-specify-the-chrome-binary-location-via-the-selenium-server-standalone-com
https://stackoverflow.com/questions/45500606/set-chrome-browser-binary-through-chromedriver-in-python
https://stackoverflow.com/questions/47396547/how-to-set-the-geo-location-through-code
https://stackoverflow.com/questions/22130109/cant-use-chrome-driver-for-selenium
https://stackoverflow.com/questions/20349844/how-chromedriverservice-is-useful-in-selenium-automation
https://webcache.googleusercontent.com/search?q=cache:9Q8V7fW2DrUJ:https://xiaojingjing.iteye.com/blog/2382701+&cd=1&hl=en&ct=clnk&gl=sg
https://webcache.googleusercontent.com/search?q=cache:rjkU_qxcMkQJ:https://zhuanlan.zhihu.com/p/30644530+&cd=10&hl=en&ct=clnk&gl=sg
https://stackoverflow.com/questions/49788257/what-is-default-location-of-chromedriver-and-for-installing-chrome-on-windows
https://stackoverflow.com/questions/16689426/how-to-set-google-chrome-in-webdriver
版權聲明:本文內容由網絡用戶投稿,版權歸原作者所有,本站不擁有其著作權,亦不承擔相應法律責任。如果您發現本站中有涉嫌抄襲或描述失實的內容,請聯系我們jiasou666@gmail.com 處理,核實后本網站將在24小時內刪除侵權內容。
版權聲明:本文內容由網絡用戶投稿,版權歸原作者所有,本站不擁有其著作權,亦不承擔相應法律責任。如果您發現本站中有涉嫌抄襲或描述失實的內容,請聯系我們jiasou666@gmail.com 處理,核實后本網站將在24小時內刪除侵權內容。