DataJPA之Repository接口">SpringDataJPA之Repository接口
646
2025-04-02
Journeys in big Data statistics
大數(shù)據(jù)統(tǒng)計(jì)之旅
School of Mathematical Sciences, University of Nottingham, UK
諾丁漢大學(xué)數(shù)學(xué)科學(xué)學(xué)院,英國(guó)
Abstract
The realm of big Data is a very wide and varied one. We discuss old, new, small and big data, with some of the important challenges including dealing with highly-structured and object-oriented data. In many applications the objective is to discern patterns and learn from large datasets of historical data. We shall discuss such issues in some transportation network applications in non-academic settings, which are naturally applicable to other situations. Vital aspects include dealing with logistics, coding and choosing appropriate statistical methodology, and we provide a summary and checklist for wider implementation.
摘要
大數(shù)據(jù)領(lǐng)域是一個(gè)非常廣泛和多樣的領(lǐng)域。我們討論舊的、新的、小的和大的數(shù)據(jù),以及一些重要的挑戰(zhàn),包括處理高度結(jié)構(gòu)化和面向?qū)ο蟮臄?shù)據(jù)。在許多應(yīng)用中,目標(biāo)是識(shí)別模式并從大量歷史數(shù)據(jù)中學(xué)習(xí)。我們將在一些非學(xué)術(shù)環(huán)境下的運(yùn)輸網(wǎng)絡(luò)應(yīng)用中討論這些問(wèn)題,這些問(wèn)題自然適用于其他情況。關(guān)鍵方面包括處理后勤、編碼和選擇適當(dāng)?shù)慕y(tǒng)計(jì)方法,我們?yōu)楦鼜V泛的實(shí)施提供總結(jié)和檢查表。
Keywords:Big data Object-oriented data Transport Networks
關(guān)鍵詞:大數(shù)據(jù) 面向?qū)ο髷?shù)據(jù) 傳輸 網(wǎng)絡(luò)
1. A new natural resource一種新的自然資源
We will be the first to admit that it is difficult to keep up. How can you expect someone who is trained in dealing with datasets of n = 30 observations with p = 3 variables to suddenly cope with a 100 K-fold increase of n = 3 000 000 observations and p = 300 000 for example, or even worse? Everything has to change. Summarizing a dataset becomes a major computational challenge and p-values take on a ludicrous role where everything is significant. Yet dealing with a wide range of sizes of datasets has become vital for the modern statistician.
我們將是第一個(gè)承認(rèn)很難跟上的人。你如何能期望那些在處理n=30個(gè)觀測(cè)值以及p=3個(gè)變量的數(shù)據(jù)集方面受過(guò)訓(xùn)練的人突然能夠處理n=3000000個(gè)觀測(cè)值的100K倍的增長(zhǎng),例如p=300000,甚至更糟?一切都必須改變。對(duì)數(shù)據(jù)集進(jìn)行總結(jié)成為主要的計(jì)算挑戰(zhàn),并且p值承擔(dān)著一個(gè)荒謬的角色,其中一切都很重要。然而,處理各種各樣的數(shù)據(jù)集對(duì)現(xiàn)代統(tǒng)計(jì)學(xué)家來(lái)說(shuō)是至關(guān)重要的。
Virginia Rometty, chairman, president and chief executive officer of IBM said the following at Northwestern University’s 157th commencement ceremony in 2015: What steam was to the 18th century, electricity to the 19th and hydrocarbons to the 20th, data will be to the 21st century. That’s why I call data a new natural resource.
IBM董事長(zhǎng)、總裁兼首席執(zhí)行官弗吉尼亞·羅梅蒂在2015年西北大學(xué)第157屆畢業(yè)典禮上說(shuō):什么蒸汽是十八世紀(jì),電力第十九和碳?xì)浠衔锏降诙瑪?shù)據(jù)將是二十一世紀(jì)。這就是為什么我把數(shù)據(jù)稱為一種新的自然資源。
The need to make sense of the huge rich seams of data being produced underlines the great importance of Statistics, Mathematical and Computational Sciences in today’s society. But what is ‘new’ about data? Data has been used for centuries, for example data collected on the first bloom of cherry blossoms in Kyoto, Japan starting in 800AD and now highlighting climate change (Aono, 2017); Gauss’ meridian arc measurements in 1799 used to define the metre (Stigler, 1981); and Florence Nightingale’s 1859 mortality data and graphical rose diagram presentation on causes of death in the Crimean War leading to modern nursing practice (Nightingale, 1859). All of these old, small datasets are at the core of important issues for mankind, so it is not the data or its importance but the size, structure and ubiquity of data that is new.
對(duì)正在產(chǎn)生的大量豐富數(shù)據(jù)的理解需要強(qiáng)調(diào)統(tǒng)計(jì)、數(shù)學(xué)和計(jì)算科學(xué)在當(dāng)今社會(huì)中的重要性。但是什么是“新”的數(shù)據(jù)呢?幾個(gè)世紀(jì)以來(lái),數(shù)據(jù)一直被使用,例如,從公元800年開(kāi)始收集的關(guān)于日本京都櫻花第一次綻放的數(shù)據(jù),現(xiàn)在突出了氣候變化(Aono,2017);1799年高斯子午線弧度測(cè)量用來(lái)定義米(Stigler,1981);和佛羅倫薩夜鶯18的數(shù)據(jù)。59死亡率數(shù)據(jù)和圖形玫瑰圖介紹的死亡原因克里米亞戰(zhàn)爭(zhēng)導(dǎo)致現(xiàn)代護(hù)理實(shí)踐(南丁格爾,1859年)。所有這些舊的、小型的數(shù)據(jù)集都是人類重要問(wèn)題的核心,因此新的不是數(shù)據(jù)或其重要性,而是數(shù)據(jù)的大小、結(jié)構(gòu)和普遍性。
Many of the challenges in the new world of Statistics in the Age of Big Data are of a different nature from traditional scenarios. Statisticians are used to dealing with bias and uncertainty, but how can this be handled when datasets are so large and collected in the wild without traditional sampling protocols? What do you do with all the data is an important question. The last 20 years has seen an explosion of statistical methodology to handle large p, often with sparsity assumptions (Hastie et al., 2015). Large n used to be the realm of careful asymptotic theory or thought experiments, but in reality one often does encounter large n now in practice. Two possible routes to practical inference are conditioning and sampling. Conditioning on a small window of values of a subset of covariates will very quickly reduce the size of data available as the number of covariates increases, due to the curse of dimensionality. Such small subsets of the dataset can be used to estimate predictive distributions conditional on the values of the covariates, leading to useful predictions. We give some further detail below in a case study from the transport industry. Sampling sensibly on the other hand is a more difficult task. Although it is straightforward to sample at random et al., 2015). Large n used to be the realm of careful asymptotic theory or thought experiments, but in reality one often does encounter large n now in practice.
大數(shù)據(jù)時(shí)代的統(tǒng)計(jì)新世界中的許多挑戰(zhàn)具有不同于傳統(tǒng)情景的性質(zhì)。統(tǒng)計(jì)學(xué)家習(xí)慣于處理偏差和不確定性,但是在沒(méi)有傳統(tǒng)采樣協(xié)議的情況下,當(dāng)數(shù)據(jù)集如此之大且在野外收集時(shí),如何處理這些呢?你如何處理所有的數(shù)據(jù)是一個(gè)重要的問(wèn)題。過(guò)去20年,處理大p的統(tǒng)計(jì)方法爆炸式增長(zhǎng),通常采用稀疏假設(shè)(Hastie等人,2015)。大n過(guò)去是仔細(xì)的漸近理論或思想實(shí)驗(yàn)的領(lǐng)域,但現(xiàn)實(shí)中人們?cè)趯?shí)踐中經(jīng)常遇到大n。實(shí)際推理的兩種可能途徑是調(diào)節(jié)和采樣。由于維數(shù)災(zāi)難,隨著協(xié)變量數(shù)量的增加,對(duì)協(xié)變量子集的值的小窗口進(jìn)行條件化將非常迅速地減少可用數(shù)據(jù)的大小。數(shù)據(jù)集的這種小子集可以用于以協(xié)變量的值為條件估計(jì)預(yù)測(cè)分布,從而產(chǎn)生有用的預(yù)測(cè)。在運(yùn)輸行業(yè)的案例研究中,我們給出了一些進(jìn)一步的細(xì)節(jié)。另一方面,明智地取樣是一項(xiàng)更困難的任務(wù)。雖然它是簡(jiǎn)單的隨機(jī)抽樣等,2015)。大N過(guò)去是仔細(xì)的漸近理論或思想實(shí)驗(yàn)的領(lǐng)域,但現(xiàn)實(shí)中人們?cè)趯?shí)踐中經(jīng)常遇到大N。
Two possible routes to practical inference are conditioning and sampling. Conditioning on a small window of values of a subset of covariates will very quickly reduce the size of data available as the number of covariates increases, due to the curse of dimensionality. Such small subsets of the dataset can be used to estimate predictive distributions conditional on the values of the covariates, leading to useful predictions. We give some further detail below in a case study from the transport industry. Sampling sensibly on the other hand is a more difficult task. Although it is straightforward to sample at random of course, given the inherent biases in most big data one needs to carry out sampling to counteract the bias in the data collection.
實(shí)際推理的兩種可能途徑是調(diào)節(jié)和采樣。由于維數(shù)災(zāi)難,隨著協(xié)變量數(shù)量的增加,對(duì)協(xié)變量子集的值的小窗口進(jìn)行條件化將非常迅速地減少可用數(shù)據(jù)的大小。數(shù)據(jù)集的這種小子集可以用于以協(xié)變量的值為條件估計(jì)預(yù)測(cè)分布,從而產(chǎn)生有用的預(yù)測(cè)。在運(yùn)輸行業(yè)的案例研究中,我們給出了一些進(jìn)一步的細(xì)節(jié)。另一方面,明智地取樣是一項(xiàng)更困難的任務(wù)。雖然隨機(jī)抽樣當(dāng)然很簡(jiǎn)單,但是考慮到大多數(shù)大數(shù)據(jù)中固有的偏差,需要執(zhí)行抽樣來(lái)抵消數(shù)據(jù)收集中的偏差。
A further aspect of the avalanche of new data being available is that it is often highly-structured. For example, large quantities of medical images are routinely collected each day in hospitals around the world, each containing highly complicated structured information. The emerging area of Object Oriented Data Analysis (Marron and Alonso, 2014) provides a new way of thinking of statistical analysis for such data. Examples of object data include functions, images, shapes, manifolds, dynamical systems, and trees. The main aims of multivariate analysis extend more generally to object data, e.g. defining a distance between objects, estimation of a mean, summarizing variability, reducing dimension to important components, specifying distributions of objects, carrying out hypothesis tests, prediction, classification and clustering.
可用的新數(shù)據(jù)雪崩的另一個(gè)方面是它經(jīng)常是高度結(jié)構(gòu)化的。例如,全世界的醫(yī)院每天例行收集大量的醫(yī)學(xué)圖像,每個(gè)圖像都包含高度復(fù)雜的結(jié)構(gòu)化信息。面向?qū)ο髷?shù)據(jù)分析的新興領(lǐng)域(Marron和Aonso,2014)為這些數(shù)據(jù)提供了一種統(tǒng)計(jì)分析的新思路。對(duì)象數(shù)據(jù)的示例包括函數(shù)、圖像、形狀、流形、動(dòng)力系統(tǒng)和樹(shù)。多變量分析的主要目標(biāo)更一般地?cái)U(kuò)展到對(duì)象數(shù)據(jù),例如,定義對(duì)象之間的距離,估計(jì)平均值,總結(jié)可變性,將維數(shù)減少到重要分量,指定對(duì)象的分布,進(jìn)行假設(shè)檢驗(yàn),預(yù)測(cè),分類。和聚類。
From Marron and Alonso (2014), in any study an important consideration is to decide what are the atoms (most basic parts) of the data. A key question is ‘what should be the data objects?’, and the answer will then lead to appropriate methodology for statistical analysis. The subject is fast developing following initial definitions in Wang and Marron (2007), and a recent summary with discussion is given by Marron and Alonso (2014) with applications to Spanish human mortality functional data, shapes, trees and medical images. One of the key aspects of object data analysis is that registration of the objects must be considered as part of the analysis. In addition the identifiability of models, choice of regularization and whether to marginalize or optimize as part of the inference are important aspects of object data analysis, as they are in statistical shape analysis (Dryden and Mardia, 2016).
根據(jù)Marron和Aonso(2014),在任何研究中,一個(gè)重要的考慮是確定數(shù)據(jù)的原子(最基本的部分)是什么。一個(gè)關(guān)鍵的問(wèn)題是“數(shù)據(jù)對(duì)象應(yīng)該是什么?”然后,答案將引出適當(dāng)?shù)慕y(tǒng)計(jì)分析方法。根據(jù)Wang和Marron(2007)中的初始定義,該主題正在迅速發(fā)展,Marron和Aonso(2014)給出了最近的總結(jié)和討論,并將其應(yīng)用于西班牙人的死亡率功能數(shù)據(jù)、形狀、樹(shù)木和醫(yī)學(xué)圖像。對(duì)象數(shù)據(jù)分析的關(guān)鍵方面之一是必須將對(duì)象的注冊(cè)視為分析的一部分。此外,模型的可識(shí)別性、正則化的選擇以及作為推理一部分是否邊緣化或優(yōu)化是目標(biāo)數(shù)據(jù)分析的重要方面,正如在統(tǒng)計(jì)形狀分析中一樣(.den和Mardia,2016)。
It is obvious that the realm of big data is a very wide and varied one. In some realms the difficulties lie with truly astronomical quantities of data which are not even feasibly stored for future retrieval, for which online algorithm development is a key area of research; whereas in other realms the challenge is in discerning patterns and learning from large datasets of historical data. We shall discuss the latter, in generality, below for what can loosely be thought of as transportation network applications in non-academic settings. Many of the approaches and recommendations discussed below are naturally applicable to other applications, such as a general practice of data retention, while others related to origin–destination filtering are clearly more specific to transportation problems.
顯然,大數(shù)據(jù)領(lǐng)域是一個(gè)非常廣泛和多樣的領(lǐng)域。在某些領(lǐng)域中,困難在于甚至不可能為將來(lái)檢索而存儲(chǔ)大量真正天文數(shù)字的數(shù)據(jù),對(duì)此,在線算法開(kāi)發(fā)是研究的一個(gè)關(guān)鍵領(lǐng)域;而在其他領(lǐng)域中,挑戰(zhàn)在于識(shí)別模式和從大型數(shù)據(jù)集中學(xué)習(xí)。歷史數(shù)據(jù)。下面,我們將一般性地討論后者,以了解在非學(xué)術(shù)環(huán)境中可以寬松地認(rèn)為是運(yùn)輸網(wǎng)絡(luò)應(yīng)用的內(nèi)容。下面討論的許多方法和建議自然適用于其他應(yīng)用程序,例如數(shù)據(jù)保留的一般實(shí)踐,而與源目的地過(guò)濾相關(guān)的其他方法顯然更特定于運(yùn)輸問(wèn)題。
2. Case study: transportation big data案例研究:交通大數(shù)據(jù)
The classification of problems into different areas of interest can be greatly beneficial in allowing techniques of particular relevance to all problems in a particular area to be discussed as one. The contemporary challenge we shall now discuss surrounds the use of statistics in real-world infrastructure problems that can arise for public or mass transportation, such as train travel, bus travel, or similar networked transportation methods. Collaborations between universities and businesses up and down the country already exist, and will continue to grow in the coming years for trying to share best practices and perform statistical analysis on datasets harvested by businesses about their customers, to either improve customer experience or to improve business efficiency. We concern ourselves here with the challenges one will meet in embedding good practice and developing useful models for exploitation of data in businesses where perhaps even the initial data handling task has so far seemed daunting.
將問(wèn)題分類到不同的關(guān)注領(lǐng)域可以極大地有助于將特定領(lǐng)域中與所有問(wèn)題具有特定相關(guān)性的技術(shù)作為一個(gè)整體進(jìn)行討論。我們現(xiàn)在要討論的當(dāng)代挑戰(zhàn)圍繞著統(tǒng)計(jì)在現(xiàn)實(shí)世界基礎(chǔ)設(shè)施問(wèn)題中的應(yīng)用,這些問(wèn)題可能出現(xiàn)在公共或大眾運(yùn)輸中,例如火車旅行、公共汽車旅行或類似的網(wǎng)絡(luò)運(yùn)輸方法。全國(guó)上下的大學(xué)和企業(yè)之間的合作已經(jīng)存在,并且在未來(lái)幾年將繼續(xù)增長(zhǎng),以嘗試共享最佳實(shí)踐并對(duì)企業(yè)收集的關(guān)于客戶的數(shù)據(jù)集進(jìn)行統(tǒng)計(jì)分析,以改進(jìn)客戶體驗(yàn)。e或提高業(yè)務(wù)效率。我們?cè)诖岁P(guān)注在嵌入良好實(shí)踐以及開(kāi)發(fā)有用模型以利用企業(yè)中的數(shù)據(jù)方面將遇到的挑戰(zhàn),在這些企業(yè)中,甚至可能最初的數(shù)據(jù)處理任務(wù)迄今看起來(lái)都令人畏懼。
Studying transportation systems as networked queues has been one of the most natural approaches, borne out of the queueing theory literature of previous decades. Courtesy of advances in computing, larger and larger network problems are now attempted to be ‘solved’ or at least approximately solved. Much of the focus in recent years lies with proposing online algorithms for live traffic management. With big data, opportunities arise to try and optimize these local dynamic decision problems: of re-routing a vehicle; skipping stops (if permitted); or allocating platforms, all in light of a wealth of additional statistical information. Approaches to dynamic resource allocation laid out in Glazebrook et al. (2014) would often benefit from a serious statistical analysis to first properly understand the dynamics of a network-based model, so that when formulating the problem in a queueing framework an appropriate level of confidence can be placed on the stochastic quantities. In particular, if you were to consider traffic management decisions on a railway surrounding the choice of platforms or use of signals outside a busy station, an effective algorithm for allocating the resource that is the station platform at a particular time can only function with a well-calibrated cost function which accounts for knock-on effects of such a decision. Possessing years of historical data during which a wealth of such decisions have been made and their consequences mapped, leads us very naturally to first want to perform some robust statistical analyses.
將運(yùn)輸系統(tǒng)作為網(wǎng)絡(luò)隊(duì)列進(jìn)行研究,是前幾十年排隊(duì)論文獻(xiàn)中最自然的方法之一。由于計(jì)算技術(shù)的進(jìn)步,越來(lái)越大的網(wǎng)絡(luò)問(wèn)題現(xiàn)在試圖“解決”或至少近似解決。近年來(lái)的許多焦點(diǎn)在于提出在線交通管理算法。有了大數(shù)據(jù),就出現(xiàn)了嘗試和優(yōu)化這些局部動(dòng)態(tài)決策問(wèn)題的機(jī)會(huì):重新選擇車輛的路線;跳過(guò)停車站(如果允許);或分配平臺(tái),所有這些都基于豐富的附加統(tǒng)計(jì)信息。在Glazebrook等人提出的動(dòng)態(tài)資源分配方法。(2014)通常得益于認(rèn)真的統(tǒng)計(jì)分析,以便首先正確理解基于網(wǎng)絡(luò)的模型的動(dòng)態(tài),從而當(dāng)在排隊(duì)框架中制定問(wèn)題時(shí),可以對(duì)隨機(jī)量設(shè)置適當(dāng)?shù)闹眯哦取L貏e地,如果您要考慮在圍繞著選擇站臺(tái)或使用繁忙站臺(tái)外的信號(hào)的鐵路上的交通管理決策,那么在特定時(shí)間分配站臺(tái)資源的有效算法只能很好地進(jìn)行校準(zhǔn)。D成本函數(shù),該函數(shù)決定了這種決策的敲擊效應(yīng)。擁有數(shù)年的歷史數(shù)據(jù),在這期間,已經(jīng)做出了大量這樣的決策并繪制了它們的結(jié)果,這很自然地導(dǎo)致我們首先想要進(jìn)行一些穩(wěn)健的統(tǒng)計(jì)分析。
For statistical analyses, the natural starting point to a statistician is gaining access to the appropriate historical data. There are already two large data types of interest, customer-centric journey counting or vehicle journeys. In the world of buses it is estimated that over 5 billion passenger journeys occur each year in the UK Department of Transport, UK (2016), for example, and the alternative approach lies with vehicle logging data. For our discussion we shall concern ourselves more with logged vehicle movements, such as the approximately three million individual train movements which are logged on a given day. For both buses and trains there is far more information available than just logged movements of stops, departures, and waypoint visiting. These can include signal changes, platform assignments, detours, engine types, vehicle capacities or raw passenger numbers. A few million daily datapoints is certainly not as large as some datasets, however the ability to store and then later quickly access and filter a database over a considerable time period can still become a non-trivial challenge. The typical format of vehicle data is broken down across the regions of a country, but sometimes even individual journeys may span more than one region. Considerable data cleaning may also be required to remove duplication, to collate, and at times even resolve issues of contradictory data logged by different systems or network operators. Each logged message can also typically contain tens of covariates indicating information ranging from the current time and present location of a vehicle, to its previous location, intended destination, top speed, personal capacity, and even properties like the engine type.
對(duì)于統(tǒng)計(jì)分析,統(tǒng)計(jì)學(xué)家的自然起點(diǎn)是獲得適當(dāng)?shù)臍v史數(shù)據(jù)。已經(jīng)有兩個(gè)大的數(shù)據(jù)類型感興趣,以客戶為中心的旅程計(jì)數(shù)或車輛旅行。例如,在公共汽車領(lǐng)域,估計(jì)每年在英國(guó)交通部(2016)有超過(guò)50億的旅客出行,而備選方法在于車輛記錄數(shù)據(jù)。對(duì)于我們的討論,我們將更多地關(guān)注已記錄的車輛運(yùn)動(dòng),例如某一天記錄的大約300萬(wàn)個(gè)單獨(dú)的列車運(yùn)動(dòng)。對(duì)于公交車和火車來(lái)說(shuō),除了記錄停靠、離開(kāi)和路點(diǎn)訪問(wèn)的移動(dòng)之外,還有更多的可用信息。這些可以包括信號(hào)變化、平臺(tái)分配、彎道、發(fā)動(dòng)機(jī)類型、車輛容量或原始乘客數(shù)。每天幾百萬(wàn)個(gè)數(shù)據(jù)點(diǎn)當(dāng)然不像某些數(shù)據(jù)集那么大,但是存儲(chǔ)和隨后在相當(dāng)長(zhǎng)的時(shí)間段內(nèi)快速訪問(wèn)和過(guò)濾數(shù)據(jù)庫(kù)的能力仍然可能成為一個(gè)非平凡的挑戰(zhàn)。車輛數(shù)據(jù)的典型格式被分解為一個(gè)國(guó)家的各個(gè)區(qū)域,但有時(shí)甚至單個(gè)行程也可能跨越多個(gè)區(qū)域。為了消除重復(fù)、校對(duì),有時(shí)甚至解決不同系統(tǒng)或網(wǎng)絡(luò)運(yùn)營(yíng)商記錄的相互矛盾的數(shù)據(jù)問(wèn)題,可能還需要大量的數(shù)據(jù)清理。每個(gè)記錄的消息通常還可以包含數(shù)十個(gè)協(xié)變量,這些協(xié)變量指示從車輛的當(dāng)前時(shí)間和當(dāng)前位置到其先前位置、預(yù)定目的地、最高速度、個(gè)人能力甚至諸如發(fā)動(dòng)機(jī)類型的屬性的信息。
Thus the first challenge is to place the data into a file structure robust to future data collection, but readily accessible for planned statistical analyses. When this data comfortably runs into the hundreds of gigabytes this is not a small issue. We imagine that in many real-world scenarios the potential future benefits to a company of just putting in place the procedures to store large quantities of log data which are readily available contemporaneously, but which are not intended for immediate use, is already a positive step towards future-proofing oneself to technological change across a range of sectors. In some domains such as social media, there exist a range of companies offering archiving and filtered searching facilities as a service, generally for marketing or research purposes. However, the reliance on outsourcing is likely not the preferred route in many industries, especially given the rate of attrition in some of these third-party service providers. Taking these database creation exercises in-house, and embedding the requisite expertise for later retrieval is already a positive benefit to big data thinking.
因此,第一個(gè)挑戰(zhàn)是將數(shù)據(jù)放置到對(duì)未來(lái)數(shù)據(jù)收集具有魯棒性的文件結(jié)構(gòu)中,但容易用于計(jì)劃的統(tǒng)計(jì)分析。當(dāng)這些數(shù)據(jù)舒適地進(jìn)入數(shù)百千兆字節(jié)時(shí),這不是一個(gè)小問(wèn)題。我們?cè)O(shè)想,在許多真實(shí)世界的場(chǎng)景中,潛在的未來(lái)好處是,公司只需要設(shè)置程序來(lái)存儲(chǔ)大量的日志數(shù)據(jù),這些數(shù)據(jù)是同時(shí)可用的,但不打算立即使用,這已經(jīng)是未來(lái)的一個(gè)積極步驟。在各種領(lǐng)域內(nèi),為自己的技術(shù)變化打下了良好的基礎(chǔ)。在一些領(lǐng)域,如社交媒體,有一系列公司提供歸檔和過(guò)濾搜索設(shè)施作為一種服務(wù),一般用于市場(chǎng)或研究目的。然而,對(duì)外包的依賴很可能不是許多行業(yè)的首選路線,尤其是考慮到一些第三方服務(wù)提供商的流失率。在內(nèi)部進(jìn)行這些數(shù)據(jù)庫(kù)創(chuàng)建練習(xí),并將必要的專業(yè)知識(shí)嵌入到后續(xù)的檢索中,這對(duì)大數(shù)據(jù)思維已經(jīng)是一個(gè)積極的好處。
When it comes to handling large datasets some of the go-to tools of the research statistician such as the R software (R Core Team, 2017) need to be handled with care. Whilst packages exist to manage hardware issues such as large memory footprints, at every stage consideration needs to be given to repeated database access and careful filtering to ensure that analysis is only performed on sets of data which are not excessively large. In transport applications, this often means filtering journeys by their origin–destination pair, or for spatial analysis, aggregating datapoints over small geographical cells, for a large number of different cells. In removing all messages not directly related to vehicles travelling between a specific Origin–Destination (OD) pair one can concentrate on identifying recurrent behaviours and patterns. Unfortunately, in transport networks there may be other vehicles which interact with vehicles on this particular route but which do not share the same OD pair. Thus a combination of filtering approaches may be required. Many natural approaches to transport, therefore, try to model the characteristics of each individual piece of infrastructure, as a function of a range of up to fif ?or so covariates of vehicles that pass through it. This can then be used either for infrastructure assessment, or as part of live vehicle prediction modelling. Of more prevalent use in the business world are data visualization tools like Tableau for creating slick graphics for presentations. With big data, however, there is very often the need to have already performed some careful database filtering and covariate selection to ensure appropriately-sized and relevant datasets are used to create the visualizations.
當(dāng)涉及到處理大型數(shù)據(jù)集時(shí),一些研究統(tǒng)計(jì)學(xué)家常用的工具,如R軟件(R核心小組,2017)需要小心處理。盡管存在軟件包來(lái)管理諸如大內(nèi)存占用之類的硬件問(wèn)題,但是在每個(gè)階段都需要考慮重復(fù)的數(shù)據(jù)庫(kù)訪問(wèn)和仔細(xì)過(guò)濾,以確保只對(duì)不過(guò)大的數(shù)據(jù)集執(zhí)行分析。在運(yùn)輸應(yīng)用中,這通常意味著通過(guò)它們的起點(diǎn)-目的地對(duì)過(guò)濾旅程,或者用于空間分析,通過(guò)小地理單元聚集數(shù)據(jù)點(diǎn),用于大量不同的單元。在移除與特定始發(fā)地-目的地對(duì)(OD)之間行駛的車輛不直接相關(guān)的所有消息時(shí),可以集中精力識(shí)別重復(fù)的行為和模式。不幸的是,在運(yùn)輸網(wǎng)絡(luò)中,可能有其他車輛與該特定路線上的車輛交互,但是它們不共享相同的OD對(duì)。因此,可能需要過(guò)濾方法的組合。因此,許多自然的交通方式試圖將每個(gè)基礎(chǔ)設(shè)施的特征建模為最多15個(gè)或15個(gè)通過(guò)該基礎(chǔ)設(shè)施的車輛的協(xié)變量的范圍的函數(shù)。這可以被用于基礎(chǔ)設(shè)施評(píng)估,或者作為活車輛預(yù)測(cè)模型的一部分。在商業(yè)世界中更普遍使用的是數(shù)據(jù)可視化工具,如Tableau,用于為演示文稿創(chuàng)建光滑的圖形。然而,對(duì)于大數(shù)據(jù),通常需要已經(jīng)執(zhí)行了一些仔細(xì)的數(shù)據(jù)庫(kù)篩選和協(xié)變量選擇,以確保使用適當(dāng)大小的相關(guān)數(shù)據(jù)集來(lái)創(chuàng)建可視化。
A number of academic challenges arise when trying to use such enormous quantities of data to make single predictions, or evaluations, for example in trying to predict the future lateness of a particular vehicle’s on-going journey. Identification of explanatory variables, along with appreciation of the physical mechanisms will often drive an appropriate model choice to distil the enormous datasets into appropriate predictions or summary data. Deciding whether to chain the use of linear models across a network of (assumed) independent locations, or to attempt kernel regression methods to weight a much larger number of proximate locations for the same purpose is not always easily determined. Indeed the most flexible approach obviously lies in maintaining the ability to try as many plausible avenues as possible, without having to make compromises to accommodate the potentially overwhelmingly large numbers of datapoints that may arise without appropriate data filtering.
華為云APP 云市場(chǎng)
版權(quán)聲明:本文內(nèi)容由網(wǎng)絡(luò)用戶投稿,版權(quán)歸原作者所有,本站不擁有其著作權(quán),亦不承擔(dān)相應(yīng)法律責(zé)任。如果您發(fā)現(xiàn)本站中有涉嫌抄襲或描述失實(shí)的內(nèi)容,請(qǐng)聯(lián)系我們jiasou666@gmail.com 處理,核實(shí)后本網(wǎng)站將在24小時(shí)內(nèi)刪除侵權(quán)內(nèi)容。
版權(quán)聲明:本文內(nèi)容由網(wǎng)絡(luò)用戶投稿,版權(quán)歸原作者所有,本站不擁有其著作權(quán),亦不承擔(dān)相應(yīng)法律責(zé)任。如果您發(fā)現(xiàn)本站中有涉嫌抄襲或描述失實(shí)的內(nèi)容,請(qǐng)聯(lián)系我們jiasou666@gmail.com 處理,核實(shí)后本網(wǎng)站將在24小時(shí)內(nèi)刪除侵權(quán)內(nèi)容。