一個(gè)HBase MultiActionResultTooLarge的問(wèn)題分享
概況:
某一用戶(hù)反饋的hbase查詢(xún)問(wèn)題,查詢(xún)使用get list,單次get list超過(guò)25條就查詢(xún)異常,客戶(hù)端返回multiActionResultTooLarge
2020-09-09 16:33:00,607Z+0000|INFO|custom-tomcat-51||||Https| requestId=05a89c1b-cecf-4693-a8d6-a319b6621cff|com.xxx.xxx.xxx.hbase.HBaseOperations.get(HBaseOperations.java:429)|(1078409497)get batch rows, tableName: DETAIL.
2020-09-09 16:33:00,622Z+0000|WARN|hconnection-0x39b61605-shared--pool1-t8272||||||org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.logNoResubmit(AsyncProcess.java:1313)|(1078409512)#1, table=DETAIL, attempt=1/1 failed=13ops, last exception: org.apache.hadoop.hbase.MultiActionResultTooLarge: org.apache.hadoop.hbase.MultiActionResultTooLarge: Max size exceeded CellSize: 132944 BlockSize: 109051904
問(wèn)題現(xiàn)象:
1.?????客戶(hù)的HBase集群,寫(xiě)入69條數(shù)據(jù)后,使用htable.get(list)查詢(xún)數(shù)據(jù),如果list大于25,則會(huì)遇到MultiActionResultTooLarge異常。經(jīng)過(guò)1小時(shí)左右后,list大于25不會(huì)出現(xiàn)異常。
2.?????初始分析時(shí),看到MultiActionResultTooLarge的報(bào)錯(cuò),還以為是服務(wù)端設(shè)置的查詢(xún)BlockSize超過(guò)了100M的閾值,100M由參數(shù)hbase.server.scanner.max.result.size控制,但是用戶(hù)反饋,該表總共占的存儲(chǔ)空間才幾百KB。
3.?????客戶(hù)表結(jié)構(gòu)信息為
COLUMN?FAMILIES?DESCRIPTION
{NAME?=>?'CF1',?BLOOMFILTER?=>?'ROW',?VERSIONS?=>?'1000',?IN_MEMORY?=>?'false',
KEEP_DELETED_CELLS?=>?'FALSE',?DATA_BLOCK_ENCODING?=>?'NONE',?TTL?=>?'2147472000?SECONDS?(24855?DAYS)',?COMPRES
SION?=>?'SNAPPY',?MIN_VERSIONS?=>?'0',?BLOCKCACHE?=>?'true',?BLOCKSIZE?=>?'65536',?REPLICATION_SCOPE?=>?'0'}
問(wèn)題分析:
1.?????從表現(xiàn)上看,問(wèn)題拋出了MultiActionResultTooLarge,查看該處代碼,是因?yàn)檫@個(gè)context.getResponseCellSize超過(guò)了quota值,這里的quota就是hbase.server.scanner.max.result.size設(shè)置的100MB,
if?(context !=?null
&& context.isRetryImmediatelySupported()
&& (context.getResponseCellSize() > maxQuotaResultSize
|| context.getResponseBlockSize() + context.getResponseExceptionSize()
> maxQuotaResultSize)) {
// We're storing the exception since the exception and reason string won't
// change after the response size limit is reached.
if?(sizeIOE ==?null?) {
// We don't need the stack un-winding do don't throw the exception.
// Throwing will kill the JVM's JIT.
//
// Instead just create the exception and then store it.
sizeIOE =?new?MultiActionResultTooLarge("Max size exceeded"
+?" CellSize: "?+ context.getResponseCellSize()
+?" BlockSize: "?+ context.getResponseBlockSize());
// Only report the exception once since there's only one request that
// caused the exception. Otherwise this number will dominate the exceptions count.
rpcServer.getMetrics().exception(sizeIOE);
}
2.?????接著分析context.getResponseCellSize為什么會(huì)超過(guò)100MB,從下面代碼可以看到,這里是將查詢(xún)的Result中的cell拿出來(lái)累加block的size,?如果上一個(gè)是相同block則不累加。
/**
* Method to account for the size of retained cells and retained data blocks.
* @return an object that represents the last referenced block from this response.
*/
Object addSize(RpcCallContext context, Result r, Object lastBlock) {
if (context != null && r != null && !r.isEmpty()) {
for (Cell c : r.rawCells()) {
context.incrementResponseCellSize(CellUtil.estimatedHeapSizeOf(c));
// We're using the last block being the same as the current block as
// a proxy for pointing to a new block. This won't be exact.
// If there are multiple gets that bounce back and forth
// Then it's possible that this will over count the size of
// referenced blocks. However it's better to over count and
// use two RPC's than to OOME the RegionServer.
byte[] valueArray = c.getValueArray();
if (valueArray != lastBlock) {
context.incrementResponseBlockSize(valueArray.length);
lastBlock = valueArray;
}
}
}
return lastBlock;
}
3.?????這時(shí)候懷疑可能是不是因?yàn)橛脩?hù)表的Version過(guò)多導(dǎo)致,從用戶(hù)側(cè)得知,他們的業(yè)務(wù)的確存在反復(fù)對(duì)一個(gè)Row做更新,且表的Version為1000,但是在重新把表的version從1000修改為1后,問(wèn)題還是存在。經(jīng)過(guò)測(cè)試,當(dāng)把數(shù)據(jù)手工執(zhí)行flush后,查詢(xún)又能恢復(fù),懷疑查詢(xún)有問(wèn)題的數(shù)據(jù)應(yīng)該是沒(méi)有落盤(pán)HDFS,可能是在WAL或者memstore中。
4.?????后面到社區(qū)去根據(jù)關(guān)鍵字“MultiActionResultTooLarge”查詢(xún)到https://issues.apache.org/jira/browse/HBASE-23158這個(gè)單,現(xiàn)象恰好是跟當(dāng)前遇到的問(wèn)題是一一樣的,這個(gè)單是Unresolved的狀態(tài),這個(gè)是hbase為了保護(hù)bigScan所以設(shè)置了一個(gè)代碼上的保護(hù),這里單提及如果Cell還在Memstore的時(shí)候,代碼中計(jì)算的那個(gè)array可能會(huì)變得很大。
5.?????由于平時(shí)Get List是比較常見(jiàn)的操作,應(yīng)該不至于因?yàn)檫@個(gè)保護(hù)就必然出現(xiàn)問(wèn)題。接著從ISSUE單提供的test patch發(fā)現(xiàn),復(fù)現(xiàn)此問(wèn)題時(shí),他把客戶(hù)端的retry次數(shù)調(diào)低了。這時(shí)候回過(guò)頭看客戶(hù)的報(bào)錯(cuò)日志,發(fā)現(xiàn)重試次數(shù)只有1次,當(dāng)我們把這個(gè)重試次數(shù)稍微調(diào)大,問(wèn)題就不出現(xiàn)了。
org.apache.hadoop.hbase.client.AsyncProcess$AsyncRequestFutureImpl.logNoResubmit(AsyncProcess.java:1313)|(1078409512)#1, table=DETAIL,?attempt=1/1?failed=13ops, last exception: org.apache.hadoop.hbase.MultiActionResultTooLarge
規(guī)避此問(wèn)題的方法是稍微調(diào)大客戶(hù)端重試次數(shù),當(dāng)客戶(hù)端重試次數(shù)為1時(shí),遇到些異常時(shí)就不會(huì)重新去請(qǐng)求服務(wù)端,容易引起一些偶發(fā)性的問(wèn)題。至于重試次數(shù)為1時(shí),出現(xiàn)此問(wèn)題,則需要HBase社區(qū)一起看看有什么好的解決方法。
EI企業(yè)智能 智能數(shù)據(jù) HBase 表格存儲(chǔ)服務(wù) CloudTable
版權(quán)聲明:本文內(nèi)容由網(wǎng)絡(luò)用戶(hù)投稿,版權(quán)歸原作者所有,本站不擁有其著作權(quán),亦不承擔(dān)相應(yīng)法律責(zé)任。如果您發(fā)現(xiàn)本站中有涉嫌抄襲或描述失實(shí)的內(nèi)容,請(qǐng)聯(lián)系我們jiasou666@gmail.com 處理,核實(shí)后本網(wǎng)站將在24小時(shí)內(nèi)刪除侵權(quán)內(nèi)容。