Skip to main content
Engineering LibreTexts

Data wrangling

  • Page ID
    31461
  • \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

      The efficient collection of data can be critical to project success.

      Methods of Data wrangling are introduced.

      In the first example the download URL will be copied and used by wget.

       

      Open the page in the following link;

      http://bioinformaticstoolspw.us/downloadPage-V2.html

      A simple list of files for download are shown.

      In this example the aim is to download the data file named dataFile-53.dat. 

      For your specific OS pick up the content of the link.  For Windows it can be done as follows;

      1. Right click on the URL of interest (dataFile-53.dat)
      2. In the menu select Copy link location
      3. Paste the link into an editor to confirm you have it in your buffer.
      4. Change directories to where you want to put the data. Example ‘Download’ (or make a project directory).
      5. Type wget then paste from the buffer. 

      It should look like this;

      wget http://bioinformaticstoolspw.us/files/dataFile-53.dat

      Running the above command will output similar to the following.

      --2020-07-20 12:17:50--  http://bioinformaticstoolspw.us/files/dataFile-53.dat
      Resolving bioinformaticstoolspw.us (bioinformaticstoolspw.us)... 162.241.217.126
      Connecting to bioinformaticstoolspw.us (bioinformaticstoolspw.us)|162.241.217.126|:80... connected.
      HTTP request sent, awaiting response... 200 OK
      Length: 34
      Saving to: âdataFile-53.datâ

      dataFile-53.dat                100%[=================================================>]      34  --.-KB/s    in 0s      

      2020-07-20 12:17:50 (799 KB/s) - âdataFile-53.datâ saved [34/34]

       

      The destination filename can be changed as follows;

      wget http://bioinformaticstoolspw.us/files/dataFile-53.dat  -O ~/Downloads/new-name-dataFile-53.dat

      In the above example the ‘-O’ switch is used to set the target name, that is the capital O.

      ----------------------

      This manual method is fine for a small collection of small files.

      This next example uses a script to deal with downloading large files.

      The aim is to only download files 54 through 60.

      First open a new script name in the preferred editor. And write and test the following for-loop in the script.

      vim wgetInForloop.sh

      or

      nano wgetInForloop.sh

      for N in {54..60}
      do
        echo $N
      exit
      done
      
      

      Save it and open another terminal to run it.

      bash ./wgetInForloop.sh

      If it runs without error return to the terminal with the editor and make the next modifications.

      for N in {54..60}
      do
        echo wget http://bioinformaticstoolspw.us/files/dataFile-$N.dat
      
      exit
      
      done
      

      Again edit save and run in the other terminal.

      bash ./wgetInForloop.sh

      If it runs without errors and outputs what looks like a functional wget command then remove the echo and comment out the exit.

      After saving it is now a script to automate data download.

      -----------------------------

      A similar method can be used to efficiently process the data.

      Start a new script called processData.sh.

      for FILE in ./dataFile-*.dat
      do
        ls -l $FILE
      
      exit
      
      done
      

       If it runs without errors the command ‘ls -l’ demonstrates that you have a data file in the variable ready for further processing per the data type.

       

       

       


      Data wrangling is shared under a GNU General Public License 3.0 license and was authored, remixed, and/or curated by LibreTexts.

      • Was this article helpful?