SparkAPI】countApprox、countApproxDistinct、countApproxDistinctByK

      網友投稿 871 2025-04-04

      /** * Approximate version of count() that returns a potentially incomplete result * within a timeout, even if not all tasks have finished. * * The confidence is the probability that the error bounds of the result will * contain the true value. That is, if countApprox were called repeatedly * with confidence 0.9, we would expect 90% of the results to contain the * true count. The confidence must be in the range [0,1] or an exception will * be thrown. * * @param timeout maximum time to wait for the job, in milliseconds * @param confidence the desired statistical confidence in the result * @return a potentially incomplete result, with error bounds */


      count()的相似版本,返回可能不完整的結果,即使不是所有任務都已完成也要在規定時間內完成。

      置信度是結果的誤差范圍將包含真值。也就是說,如果反復調用count近似值置信度為0.9時,我們預計90%的結果將包含真實計數。置信度必須在[0,1]范圍內,否則異常將被扔掉。

      @timeout 參數超時等待作業的最長時間(毫秒)

      【SparkAPI】countApprox、countApproxDistinct、countApproxDistinctByK

      @confidence 參數置信度結果中所需的統計置信度

      @返回一個可能不完整的結果,帶有錯誤界限

      // java public static PartialResult countApprox(long timeout, double confidence) public static PartialResult countApprox(long timeout) // scala def countApprox(timeout: Long): PartialResult[BoundedDouble] def countApprox(timeout: Long, confidence: Double): PartialResult[BoundedDouble]

      /** * Return approximate number of distinct elements in the RDD. * * The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: * Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available * here. * * @param relativeSD Relative accuracy. Smaller values create counters that require more space. * It must be greater than 0.000017. */

      返回RDD中不同元素的近似數目。

      所使用的算法基于streamlib實現的“HyperLogLog in Practice:

      Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm"

      @參數相對精度。較小的值創建需要更多空間的計數器。

      必須大于0.000017。

      // java public static long countApproxDistinct(double relativeSD) // scala def countApproxDistinct(relativeSD: Double): Long

      public class CountApproxDistinct { public static void main(String[] args) { System.setProperty("hadoop.home.dir", "E:\\hadoop-2.7.1"); SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Spark_DEMO"); JavaSparkContext sc = new JavaSparkContext(sparkConf); // 示例1 演示過程 JavaPairRDD javaPairRDD1 = sc.parallelizePairs(Lists.newArrayList( new Tuple2("cat", "11"), new Tuple2("dog", "22"), new Tuple2("cat", "33"), new Tuple2("pig", "44"), new Tuple2("duck", "55"), new Tuple2("cat", "66")), 3); System.out.println(javaPairRDD1.countApproxDistinct(0.1)); System.out.println(javaPairRDD1.countApproxDistinct(0.001)); } }

      5 19/03/20 15:56:03 INFO DAGScheduler: Job 0 finished: countApproxDistinct at CountApproxDistinct.java:23, took 0.773368 s 19/03/20 15:56:03 INFO SparkContext: Starting job: countApproxDistinct at CountApproxDistinct.java:24 19/03/20 15:56:03 INFO DAGScheduler: Got job 1 (countApproxDistinct at CountApproxDistinct.java:24) with 3 output partitions 19/03/20 15:56:03 INFO DAGScheduler: ResultStage 1 (countApproxDistinct at CountApproxDistinct.java:24) finished in 0.469 s 19/03/20 15:56:03 INFO DAGScheduler: Job 1 finished: countApproxDistinct at CountApproxDistinct.java:24, took 0.521162 s 6 19/03/20 15:56:03 INFO SparkContext: Invoking stop() from shutdown hook

      /** * Return approximate number of distinct values for each key in this RDD. */

      返回此RDD中每個鍵的近似不同值數

      適用于鍵值對類型(tuple)的RDD。它countApproxDistinct 相似。但是返回的類型不同,這個計算的是RDD中每個key值的出現次數,返回的value值即次數。

      參數relativeSD用于控制計算的精準度。 越小表示準確度越高。

      // java public JavaPairRDD countApproxDistinctByKey(double relativeSD, Partitioner partitioner) public JavaPairRDD countApproxDistinctByKey(double relativeSD, int numPartitions) public JavaPairRDD countApproxDistinctByKey(double relativeSD) // scala def countApproxDistinctByKey(relativeSD: Double): JavaPairRDD[K, Long] def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): JavaPairRDD[(K, Long)] def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner): JavaPairRDD[(K, Long)]

      public class CountApproxDistinctByKey { public static void main(String[] args) { System.setProperty("hadoop.home.dir", "E:\\hadoop-2.7.1"); SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("Spark_DEMO"); JavaSparkContext sc = new JavaSparkContext(sparkConf); JavaPairRDD javaPairRDD1 = sc.parallelizePairs(Lists.newArrayList( new Tuple2("cat", "11"), new Tuple2("dog", "22"), new Tuple2("cat", "33"), new Tuple2("pig", "44"), new Tuple2("duck", "55"), new Tuple2("cat", "66")), 3); JavaPairRDD javaPairRDD = javaPairRDD1.countApproxDistinctByKey(0.01); javaPairRDD.foreach(new VoidFunction>() { public void call(Tuple2 stringLongTuple2) throws Exception { System.out.println(stringLongTuple2); } }); } }

      19/03/20 16:09:48 INFO Executor: Running task 2.0 in stage 3.0 (TID 11) (duck,1) (cat,3) 19/03/20 16:09:48 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 3 blocks 19/03/20 16:09:48 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 19/03/20 16:09:48 INFO Executor: Finished task 2.0 in stage 3.0 (TID 11). 1009 bytes result sent to driver 19/03/20 16:09:48 INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID 11) in 15 ms on localhost (executor driver) (3/3) 19/03/20 16:09:48 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 19/03/20 16:09:48 INFO DAGScheduler: ResultStage 3 (foreach at CountApproxDistinct.java:28) finished in 0.062 s 19/03/20 16:09:48 INFO DAGScheduler: Job 2 finished: foreach at CountApproxDistinct.java:28, took 0.317332 s (dog,1) (pig,1)

      EI企業智能 spark 可信智能計算服務 TICS 智能數據

      版權聲明:本文內容由網絡用戶投稿,版權歸原作者所有,本站不擁有其著作權,亦不承擔相應法律責任。如果您發現本站中有涉嫌抄襲或描述失實的內容,請聯系我們jiasou666@gmail.com 處理,核實后本網站將在24小時內刪除侵權內容。

      版權聲明:本文內容由網絡用戶投稿,版權歸原作者所有,本站不擁有其著作權,亦不承擔相應法律責任。如果您發現本站中有涉嫌抄襲或描述失實的內容,請聯系我們jiasou666@gmail.com 處理,核實后本網站將在24小時內刪除侵權內容。

      上一篇:Excel圖表設置自動篩選
      下一篇:制造業企業生產成本研究(制造業的生產成本)
      相關文章
      在线a亚洲v天堂网2018| 亚洲av综合色区| 亚洲美女一区二区三区| 亚洲色大成网站www永久一区 | 亚洲黄网在线观看| 亚洲乱码一区二区三区在线观看 | 亚洲 另类 无码 在线| 男人的天堂av亚洲一区2区| 亚洲中文无码mv| 亚洲欧美日韩综合久久久| 亚洲自偷自偷在线成人网站传媒| 亚洲国产成人久久三区| 亚洲人成电影网站| 77777亚洲午夜久久多喷| 亚洲一级视频在线观看| 亚洲国产区男人本色在线观看| 午夜在线a亚洲v天堂网2019| 亚洲愉拍一区二区三区| 欧美亚洲国产SUV| 亚洲乱码中文字幕综合234 | 中文字幕亚洲无线码a| 最新国产AV无码专区亚洲| 国产亚洲高清不卡在线观看| 亚洲国产精品嫩草影院在线观看| 亚洲av福利无码无一区二区| 久久久亚洲裙底偷窥综合| 亚洲成人黄色网址| 亚洲看片无码在线视频| 亚洲精品天堂无码中文字幕| 亚洲成av人片一区二区三区| 亚洲伊人成无码综合网| 国产AV无码专区亚洲AV男同| 亚洲AV午夜成人片| 亚洲黑人嫩小videos| 亚洲国产乱码最新视频| 国产精品亚洲а∨天堂2021 | 久久久精品国产亚洲成人满18免费网站 | 亚洲成a人片在线观看日本麻豆| 久久精品国产精品亚洲| 亚洲激情中文字幕| 亚洲综合久久一本伊伊区|