Sampling: Sampling is a method used to select, manipulate and analyze a set of data. This refer to technique that picks a specifically chosen number of samples out of number of data items in a dataset for data analysis.
Advantages of sampling: sampling guarantees comfort, accumulation of escalated and thorough information, reasonableness in restricted assets and better affinity
- low cost
- less time consumption
- high data accuracy
Disadvantages of sampling: the real difficulties of sampling lie in selection, estimation and administration of samples
- need for subject specific knowledge
- changeability of sampling units
Simple random sample: A simple random sample is a subset of a statistical population in which each member of the subset has an equal probability of being chosen. An example of a simple random sample would be the names of 5 employees being chosen from a company of 25 employees. In this case, the population is all 25 employees, and the sample is random because each employee has an equal chance of being chosen.
Simple random sampling is not the best methodology since it will eliminate most of the points. It is better to under sample the regions where data objects are too dense while keeping most or all the data objects from sparse regions. Sampling is used because the process of exploring the data can be very time consuming and/or expensive depending on the data size. Sampling is the main technique used in data mining for selecting a subset of relevant data from a large data set to explore
Alvi, M.H. (2014). A Manual for Basic Techniques of Data Analysis and Distribution.
Barreiro, P. .L & Albandoz, J. .P. (2001). Population and sample Sampling techniques. Management Mathematics for European Schools.
Discuss the advantages and disadvantages of using sampling to reduce the number of data objects that need to be displayed.
Would simple random sampling (without replacement) be a good approach to sampling? Why or why not?
The complete process involved in exploring data is usually time and cost consuming mostly depending on the data size. However, sampling simplifies the process while minimizing the time and cost incurred during the process (Devkota para. 7). Often, sampling is the main technique utilized in data mining when analysts want to select a subset of relevant data from a large data set. The selected data sample should be a good representative of the whole data.
Apart from that, sampling allows one to incorporate sophisticated algorithms thus improving the accuracy of the results (Devkota para. 7). This flexibility makes it easier for the data analyst to achieve the desired results effectively while ensuring that the level of accuracy is optimum.
Further, sampling helps in avoiding monotony. This is because with sampling one does not need to keep repeating the query to all the individual data. In addition, sampling helps produce detailed information on the data even after employing small amounts of resources.
On the disadvantages, there are high chances of biases regarding the selection of the sampling method since it depends on the mindset of the person who chooses it. This is possible since different people prefer different approaches (Devkota para. 8). In that case, choosing the wrong sampling technique might be disastrous as it may render the whole process invalid.
Besides, the selection of the proper size of the sample is usually a difficult process. This is because the selected data sample must be a representative of the whole data for one to achieve reliable results (Devkota para. 8). In addition, during the selection process, one may exclude some crucial data due to homogeneity thus reducing the accuracy and reliability of the results.
Despite being the most popular sampling design, simple random sampling without replacement is not the ideal approach in sampling. The technique usually eliminates most of the points in sparse regions thus reducing accuracy. For better results, one should choose to undersample the regions with dense data and keep most/all the data items in a sparse region. This will improve the accuracy and reliability of the results.