Outline:
This step-by-step tutorial is a risk-free paragliding tandem flight into the cloud. We will guide you through the process of leveraging the cloud to analyse your RADseq data.
For an evolutionary biologist, restriction site-associated DNA sequencing (RADseq) and new genomic tools are great opportunities, but the fast-changing field can be very challenging to follow. The most important tool for RADseq, after your brain and the sequencer, is your computer. RADseq analysis can be very demanding on personal computers, there can be not enough CPU and/or memory and the analysis can crash or worse, your computer is now solely dedicated to this task and nothing else. If you’re lucky enough to have access to a large up-to-date university computer clusters this might be a good solution but they come with their share of problems.
We chose Amazon Elastic Compute Cloud (EC2) because: (i) it’s relatively easy, their’s extensive documentation and big community to ask for help; (ii) it’s relatively cheap to use machines with lots of CPU and memory. Using the cloud offers 2 more advantages:
Assumptions
Although you can go through this tutorial with copy and paste commands, knowing very basic Terminal commands will be useful:
Recommendations:
In the beginning most students will feel lost and very intimidated by everything related to computers, software and RADseq analysis. Commonly the end result is that your scientific work flow is impeded. If you’re at the Master and PhD level and all the RADseq stuff is overwhelming and you’re not sleeping at night, consider:
you should focus first on biology, second, on bioinformatics (a computer is the main tool you’ll likely use in your future) and if you have time, the wet-lab part. Now, don’t get me wrong, I think the wet-lab is the most important step, but unless you want to become a wet-lab technician, my advice is stay away as much as humanly possible from the wet-lab and leave it to experts.
asking for help:
Thierry Gosselin
Used for fine tuning the workflow, running preliminary analysis with small data sets and making ready-to-publish figures. Here are some specs that will help make the RADseq pipeline run smoothly:
Understanding Amazon Elastic Cloud Compute (EC2) Instances
Amazon EC2 instances comes in different CPU and memory configurations (instances types and prices). For molecular biologists 3 purchasing options will be interesting: spot, on-demand and reserved instances.
Spot Instances allow you to name your own price for Amazon EC2 computing capacity. Yes, you read correctly, no need to see the web site or look in the documentation. You simply bid on unused Amazon EC2 instances and run your instances for as long as your bid remains higher than the current Spot Price. You specify the maximum hourly price that you are willing to pay to run a particular instance type. The Spot Price fluctuates based on supply and demand for instances, but customers will never pay more than the maximum price they have specified. If the Spot Price moves higher than a customer’s maximum price, the customer’s instance will be shut down by Amazon EC2 (how to manage interruption). Other than those differences, Spot Instances perform exactly the same as On-Demand or Reserved Instances. This pricing model provides the most cost-effective option for obtaining compute capacity with interruption-tolerant tasks.
On-Demand Instances let you pay the specified hourly rate for the instances you use with no long-term commitments or upfront payments. This type of purchasing option is recommended for applications that cannot be interrupted.
Reserved Instances let you make a low, one-time, upfront payment for an instance, reserve it for a one or three year term, and pay a significantly lower hourly rate for that instance. For applications that have steady state needs, Reserved Instances can provide savings of up to 70% compared to using On-Demand Instances. This is probably the least interesting option for biologists.
Do you think Google Cloud might be a better solution for you? See their compute engine here.
Here is one instance that I recommend for best price/performance ratio -> Amazon EC2 instance i3.8xlarge
Linux and macOS can follow guidelines in this tutorial to install GBS/RADseq software on your personal computer. The codes below will help you get the required amazon tools to access AMAZON cloud services from your computer.
The shell start up script and PATH to programs
To make things a little easier to talk to your computer, each time you open the Terminal a shell start up scripts tells your computer where to look for programs. The path for your programs can be modified in your shell start up script. When your computer is searching for programs, it looks into these path:
# In your Terminal
$PATH
The output vary depending on computer OS and version. Sometimes, it will also say: No such file or directory
(no worries, see below).
Use the pwd
(print working directory) command to know exactly where you are!
The name of the shell startup file differs across platforms. It’s usually called .bash_profile
. Filenames beginning with .
are reserved for the system and are invisible/hidden. Not all text editors are configured to see those files by default.
Find your shell start up script with the following command:
# In your Terminal
ls -al ~ | grep profile
If this returns nothing (blank), you don’t have a shell start up script. Create one with this command
sudo touch $HOME/.bash_profile
# $HOME points to your home directory
To modify, you can use BBEdit to open or make and modify hidden items (using the option Show hidden items on the open file screen). With Linux, use Vi! With most unix system, sudo nano
will get the job done.
After modifying your shell start up script to reload it, always run the command:
# Terminal
source $HOME/.bash_profile
Universal Command Line Interface for Amazon Web Services
cd ~/Downloads #the bundled installer doesn't support installing to paths that contain spaces
curl "https://s3.amazonaws.com/aws-cli/awscli-bundle.zip" -o "awscli-bundle.zip"
unzip awscli-bundle.zip
sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
Few more steps before using Amazon Web Services (AWS)
Go to http://aws.amazon.com and click Sign In to the Console.
Follow the on-screen instructions.
Note: If you have used Amazon services before (e.g. to buy books) use the same username and password, the process will be fast!
To drive Amazon’s computers you need 2 keys: an access and a secret keys Instructions on how to get your Keys.
Your 2 keys are stored in the security credentials section under your name in the upper right corner on the amazon console. Although you can retrieve your access key ID from the Your Security Credentials page, you can’t retrieve your Secret Access Key. Therefore, if you can’t find your Secret Access Key, you’ll need to create a new one before using CLI tools.
If you don’t want to specify your access keys every time you issue a command, using the --aws-access-key
an --aws-secret-key
(or -O
and -W
) options, you have 2 options:
.bash_profile
file discussed above):# copy/paste the 2 lines:
export AWS_ACCESS_KEY=your-aws-access-key-id
export AWS_SECRET_KEY=your-aws-secret-key
aws configure
These will be necessary:
Make sure everything is properly configured and that your computer now talks Amazon language:
aws ec2 describe-regions # this will output different regions available
To help you solve computer related problems and start focusing back on biology:
Steps before requesting/starting an Instance
I’m sparing your sanity with most of the details, if you really want the information click on the link for the documentation.
The code below will:
radseq_keys
in the working directoryaws ec2 create-key-pair --key-name radseq_keys --query 'KeyMaterial' --output text > radseq_keys.pem
sudo chmod 400 radseq_keys.pem # computer password required
Get the description of your keypairs with aws ec2 describe-key-pairs
aws ec2 create-vpc --cidr-block 10.0.0.0/16
Write down the vpc id that starts with vcp-
:
vpc="vpc-0493d12477c2c2b51"
Modify the VPC to enable DNS hostnames:
aws ec2 modify-vpc-attribute --vpc-id $vpc --enable-dns-support
aws ec2 modify-vpc-attribute --vpc-id $vpc --enable-dns-hostnames
aws ec2 create-subnet --availability-zone us-east-1a --cidr-block 10.0.0.0/16 --vpc-id $vpc
Write down the subnet id starting with subnet-
:
subnet="subnet-0dee7de0007fac906"
aws ec2 create-internet-gateway
Write down the gateway starting with igw-
:
igw="igw-06f188634be4cb4ab"
Internet gateways documentation
Attach the internet gateway to the VPC
aws ec2 attach-internet-gateway --internet-gateway-id $igw --vpc-id $vpc
Get the route tables id:
aws ec2 describe-route-tables
Write down the route table starting with rtb
:
rtb="rtb-0b8297dd315ded4ff"
Modify the route tables to allow the connection
aws ec2 create-route --route-table-id $rtb --destination-cidr-block 0.0.0.0/0 --gateway-id $igw
1. Create a security group named radseq:
aws ec2 create-security-group --description "radseq analysis" --group-name radseq --vpc-id $vpc
Write down the security group ID, it starts with sg-
:
sg="sg-0c6ffc9bb018f9005"
aws ec2 delete-security-group
.aws ec2 describe-security-groups
.2. Communication with the instances
Communication with your instance through port 22 needs to be open. This is reserved for Secure Shell (SSH) and if you’re planning on using RStudio, port 8787 also needs to be open.
Communication security depends if your planning on using your instance from one computer/devices or more.
Use your external IP address: e.g. If you plan on using the instance from your office computer only, associate your external IP address for increase security. Several ways to get your external IP address:
In the Terminal, use this code: curl -s checkip.dyndns.org|sed -e 's/.*Current IP Address: //' -e 's/<.*$//'
.
Google search my ip address
(the first result displayed will be your external, or public, IP address).
Add /32
prefix after your IP address:
# for port 22
aws ec2 authorize-security-group-ingress --group-id $sg --protocol tcp --port 22 --cidr 203.0.113.0/32 # enable SSH from your IP address.
# for port 8787
aws ec2 authorize-security-group-ingress --group-id $sg --protocol tcp --port 8787 --cidr 203.0.113.0/32 # enable SSH from your IP address.
0.0.0.0/0
if your planing to use the instance from different computerse.g. you plan on using another computer or phone to check the state of the computations, use this command instead:
# for port 22
aws ec2 authorize-security-group-ingress --group-id $sg --protocol tcp --port 22 --cidr 0.0.0.0/0
# for port 8787
aws ec2 authorize-security-group-ingress --group-id $sg --protocol tcp --port 8787 --cidr 0.0.0.0/0
To describe the security group:
aws ec2 describe-security-groups --group-ids $sg
Uploading your RADseq data in the cloud
After trying to download your Illumina lanes to your computer or university computer clusters, you finally understand the meaning of omic jargon: avalanche, deluge and tsunami.
Moving biological data around is a challenge and it isn’t going away, technologies in the -omic fields are constantly evolving and producing more data… Accessibility, Expendability, Redundancy and Reliability: this is where Amazon Simple Storage Service (S3) come in to play.
Use Amazon S3 as:
Notes:
Get the most out of Amazon S3 with the free applications:
Linux users will find FileZilla very useful.
The Terminal can be very powerful and useful for uploading large files to your new Amazon S3 bucket. The Amazon tools installed on you computer earlier can take advantage of multipart upload of Amazon.
Further Amazon S3 readings:
Below, we show how to start a spot instance using the terminal (the command line). You could also do it with the console, but it’s much faster with the command line. Before requesting a Spot Instance, we will get the Pricing History of the instance over the last month to help make a price decision.
This is copy/paste of Amazon doc:
In the Terminal
use the code below:
s="--start-time 2019-09-15T09:45:00" # Start-time UTC format
e="--end-time 2019-10-15T09:45:00" # End-time UTC format
t="--instance-types i3.8xlarge" # Instance-type
aws ec2 describe-spot-price-history $s $e $t > spot.price.txt
# To get the date and time in UTC format in R: format(Sys.time(), "%Y-%m-%dT%H:%M:%S-0400")
In R, we can easily generate boxplots showing the spot price statistics by regions and os:
require(tidyverse)
readr::read_tsv(
file = "spot.price.txt",
col_names = c("REGION", "INSTANCES", "OS", "SPOT_PRICE", "DATE"),
col_types = "_cccnc"
) %>%
ggplot2::ggplot(data = ., ggplot2::aes(x = OS, y = SPOT_PRICE)) +
ggplot2::geom_boxplot() +
ggplot2::labs(y = "EC2 Spot Price ($)", x = "OS", title = "Spot price history") +
ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, hjust = 1)) +
ggplot2::facet_grid(. ~ REGION, scales = "free")
This is a json
file to help to get things done faster during the request:
touch ec2_radseq_specification.json
nano ec2_radseq_specification.json
Copy/paste in the nano
editor:
{
"ImageId": "ami-09abbab8663956c8a",
"InstanceType": "i3.8xlarge",
"KeyName": "radseq_keys",
"NetworkInterfaces": [
{
"DeviceIndex": 0,
"SubnetId": "subnet-0dee7de0007fac906",
"Groups": ["sg-0c6ffc9bb018f9005"],
"AssociatePublicIpAddress": true
}
]
}
Write the file and exit: ^x
(control-x), y
(for yes) and Enter
(the keyboard key…)
z="--availability-zone-group us-east-1a" # availability zone
n="--instance-count 1" # number of spot instances
p="--spot-price 0.99" # maximum price
r="--type one-time" # request type 'one-time|persistent'
s="--launch-specification file://ec2_radseq_specification.json"
aws ec2 request-spot-instances $z $n $p $r $s
aws ec2 describe-spot-instance-requests
request_id="your-spot-instance-request-id"
aws ec2 cancel-spot-instance-requests $request_id
Further reading: Introduction to spot instances
For the SSH connection to work, provide the information for:
Description of your instance
Look in Amazon console for approval of your spot instance request or use the code below to get the description of your instance:
aws ec2 describe-instances
Write down the instance public DNS:
instance="ec2-3-85-86-52.compute-1.amazonaws.com"
Write down the instance id
instance_id="i-0f9207c5a0dc76303"
This info is also available in the browser console
ssh -i "radseq_keys.pem" ec2-user@$instance
The output:
Are you sure you want to continue connecting (yes/no)?
# answer... yes
nproc
lscpu
The i3.8xlarge instance comes with 4 SSD storage (EBS volume). Use the lsblk
command to:
Note that the output of lsblk
removes the /dev/
prefix from full device paths.
sudo mkfs -t ext4 /dev/nvme0n1
sudo mkfs -t ext4 /dev/nvme1n1
sudo mkfs -t ext4 /dev/nvme2n1
sudo mkfs -t ext4 /dev/nvme3n1
sudo mount /dev/nvme0n1 /media/ebs_1
sudo mount /dev/nvme1n1 /media/ebs_2
sudo mount /dev/nvme2n1 /media/ebs_3
sudo mount /dev/nvme3n1 /media/ebs_4
You need to edit s3fs, the software used to mount your s3 bucket to your Amazon instance.
sudo nano /etc/passwd-s3fs # edit the personal file used by s3fs
s3-bucket-name:AccessKey:SecretKey # add the information keeping " : "
crtl-o # to write the change to the file
crtl-x # to exit nano editor
chmod 640 /etc/passwd-s3fs # set permissions (might need sudo)
chown ec2-user:ec2-user /etc/passwd-s3fs # change ownership
s3fs -o allow_other gbs_data /media/s3/ # mount s3 bucket 'gbs_data'
test your mounting installation:
sudo touch /media/s3/testing # will save the file 'testing' to your bucket
ls -l /media/s3 # will show the content of your bucket and your testing file
s3 drive mount on instance reboot
/usr/sbin
root:root
+x
(executable)echo "s3fs bucket_name -o allow_other /media/s3" >> automount-s3
sudo mv automount-s3 /usr/sbin
sudo chown root:root /usr/sbin/automount-s3
sudo chmod +x /usr/sbin/automount-s3
sudo reboot # to reboot the EC2 Instance
After reboot is complete, connect to your instance and run the executable file
instance="ec2-54-87-41-96.compute-1.amazonaws.com" # public DNS
keypair_path="radseq_keypair.pem" # path to your key pair
ssh -i $keypair_path ec2-user@$instance # to start the connection
sudo -i # become root and keep sudo privilege
sudo /usr/sbin/automount-s3
df -h # see if disk is mounted and command successful
If you need more disk space:
Create a 1 TB EBS drive:
SIZE=1000
TYPE=standard
ZONE=us-east-1a
aws ec2 create-volume --size $SIZE --volume-type $TYPE --availability-zone $ZONE
Describe the volumes you have:
aws ec2 describe-volumes
Attach the 1TB EBS drive to a running or stopped instance:
I="--instance-id i-611a8642" # ID of the instance
EBS="--volume-id vol-e859faa4" # ID of Amazon EBS volume
D="--device /dev/xvdf" # the device name
aws ec2 attach-volume $I $EBS $D
After the 1 TB EBS volume is attached, you’ll have to format, make a mounting point and mount the EBS volume.
sudo mkfs -t ext4 /dev/xvdd # format
sudo mkdir /media/ebs_3 # create mounting point
sudo mount /dev/xvdd /media/ebs_3 # mount
Detach a volume from an instance use these commands:
I="--instance-id i-611a8642" # ID of the instance
EBS="--volume-id vol-e859faa4" # ID of Amazon EBS volume
D="--device /dev/xvdf" # the device name
aws ec2 detach-volume $I $EBS $D
Note: the difference between S3 and EBS drives:
Don’t forget to terminate your instance when your analysis are completed!
aws ec2 terminate-instances --instance-ids $instance_id # modify to match your instance id
Now, I guess you can’t wait to try Stacks… Check that everything is working properly:
populations # to test your installation!
You might have to give the full path if the above command doesn’t work:
/usr/local/bin/populations
You are ready to start analyzing your RADseq data!
tar -czvf stacks_output.tar.gz path/of/stacks_output/files
and transfer the results to your s3 bucket to have access to your data when the instance is shut down.You can use our public AMI already loaded with RADseq software + R + RStudio.
To access RStudio server you need to configure the security group that will allow commutication with port 8787. This is explain in the security group section above.
In your favourite web browser you can type the public ip adress of your instance with the port 8787
at the end:
54.172.183.211:8787
# These credentials will be asked:
# Username: rstudio
# Password: rstudio
Time required: ~ 15min.
Spot instances can’t be used to generate AMI because the instance needs to be stopped easily and re-started. This is best accomplished with a On-Demand Instances.
For the example, let’s use Amazon Linux 2 AMI (HVM). The image comes with 8GB of disk space and will likely not be enough to install most genomic software you want. To increase the size we need the information called block-device-mappings.
ID of the image:
ami="ami-0b69ea66ff7391e80"
Additionnal info required (generated above):
subnet="subnet-0dee7de0007fac906"
sg="sg-0c6ffc9bb018f9005"
Get the info on block-device-mappings:
aws ec2 describe-images --image-id $ami --output json
We need:
/dev/xvda
snap-0e4c15b8cba3e8ae6
gp2
Generate a json
file to help to get things done faster during the request:
touch ec2_radseq_mapping.json
nano ec2_radseq_mapping.json
Modify values below based on the image description obtained above, copy/paste in the nano
editor:
[
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-0e4c15b8cba3e8ae6",
"VolumeSize": 30,
"VolumeType": "gp2",
"Encrypted": false
}
}
]
To exit nano
: with your keyboard: ^x
(control-x), y
and enter
t="--instance-type i3.8xlarge" # instance type
k="--key-name radseq_keys" # name of YOUR keypair
n="--count 1" # number of instances
i="--associate-public-ip-address" # add a public IP address
aws ec2 run-instances --image-id $ami $t $z $k $n $i --security-group-ids $sg --subnet-id $subnet --block-device-mappings file://ec2_radseq_mapping.json
aws ec2 describe-instances
Write down the instance public DNS (starts with ec2-
and ends with compute-1.amazonaws.com
) and id (starts with i-
:
instance="ec2-54-159-198-106.compute-1.amazonaws.com"
instance_id="i-0f9207c5a0dc76303"
ssh -i "radseq_keys.pem" ec2-user@$instance
To this question (if you have it):
Are you sure you want to continue connecting (yes/no)?
Answer: yes
Become root user and keep your sudo privilege active throughout the session:
sudo -i
For librairies, software/dependencies and system updates, run the following commands:
yum update -y
Note that Fedora/Linux distros uses yum
while Debian/Ubuntu distros uses apt-get
natively.
Allow access to the Extra Packages for Enterprise Linux (EPEL):
sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
Also install these dependencies:
yum install -y automake autoconf bzip2-devel curl-devel xerces-c expat-devel fuse fuse-devel freetype* gcc gcc-c++ gcc-gfortran gettext gettext-devel git gsl-devel java-1.8.0-openjdk-devel libcurl-devel libpng-devel libstdc++-devel libX11-devel libxml2-devel libXt-devel libtool mailcap mesa-libGLU-devel mysql mysql-devel numpy openssh-* openssl-devel python-devel python-magic python-pip readline-devel sqlite-devel subversion swig texinfo-tex trickle unixODBC-devel unzip zlib-devel
The i3.8xlarge instance comes with 4 SSD storage (EBS volume).
Before accessing those drives you need to:
sudo mkdir /media/ebs_1
sudo mkdir /media/ebs_2
sudo mkdir /media/ebs_3
sudo mkdir /media/ebs_4
# give proper permission:
sudo chown -R ec2-user:root /media/ebs_1
sudo chown -R ec2-user:root /media/ebs_2
sudo chown -R ec2-user:root /media/ebs_3
sudo chown -R ec2-user:root /media/ebs_4
Use the lsblk
command to:
Note that the output of lsblk
removes the /dev/
prefix from full device paths.
sudo mkfs -t ext4 /dev/nvme0n1
sudo mkfs -t ext4 /dev/nvme1n1
sudo mkfs -t ext4 /dev/nvme2n1
sudo mkfs -t ext4 /dev/nvme3n1
sudo mount /dev/nvme0n1 /media/ebs_1
sudo mount /dev/nvme1n1 /media/ebs_2
sudo mount /dev/nvme2n1 /media/ebs_3
sudo mount /dev/nvme3n1 /media/ebs_4
If you plan on using a s3 bucket, you need to install s3fs tool and configure it:
cd /home/ec2-user
git clone https://github.com/s3fs-fuse/s3fs-fuse.git
cd s3fs-fuse
./autogen.sh
./configure --prefix=/usr --with-openssl && make -j99 && sudo make install
cd ..
sudo rm -R s3fs-fuse*
Open fuse configuration file:
sudo nano /etc/fuse.conf
#user_allow_other
:#
in front of user_allow_other
^x
, Answer Y
to: Save modified buffer? (Answering "No" will DISCARD changes
and enter
sudo chmod 640 /etc/fuse.conf
Allow s3fs to use your s3 bucket:
sudo touch /etc/passwd-s3fs
sudo nano /etc/passwd-s3fs
s3 bucket name:aws_access_key_id:aws_secret_access_key
e.g. if bucket name is radseq, this is what’s in the file:radseq:TKUDJHLS3YPQSHWIKDBF:sLLDoi9uSBoBD0g8ECH3ZDTL9Onio2Ky5CdLm87C
^x
, answer y
, hit enter
sudo chmod 640 /etc/passwd-s3fs
sudo chown ec2-user:root /etc/passwd-s3fs
sudo mkdir -p /media/s3
sudo chown -R ec2-user:root /media/s3
s3fs -o allow_other radseq /media/s3
sudo umount /media/s3/
df -h
Congratulations! You now have a mounted s3 bucket and 4 EBS volumes mounted on your instance.
At this point your instance is ready to work. If your ok with what was installed and just want to go with that, the next step is saving your AMI. You can also continue installing software before saving your ami. Below are 2 additional and optinal sections:
Time required: < 5min
If the goal is to share your AMI (make it available to the public) always remove sensitive data:
As an example, the following command can help locate the root user and other users’ shell history files on disk and delete them, when run as root:
find /root/.*history /home/*/.*history -exec rm -f {} \;
s3cfg configuration file
and s3fs password
:sudo rm /root/.s3cfg /etc/passwd-s3fs
# From your instance:
sudo umount /media/s3/
sudo umount /media/ebs_1/
sudo umount /media/ebs_2/
sudo umount /media/ebs_3/
sudo umount /media/ebs_4/
#On your computer:
aws ec2 stop-instances --instance-ids $instance_id
# On your computer
aws ec2 create-image --instance-id $instance_id --name "radseq" --description "radseq analysis image"
Documentation to create an Amazon EBS-backed AMI
It can take several minutes to generate the image. To get the description of your image, use this command:
aws ec2 describe-images --executable-users all
Write down the ami id:
ami="ami-09abbab8663956c8a"
Congratulation you now have a private AMI associated with your Amazon AWS account!
To make your AMI public, modify the AMI’s launch permissions by adding all
to the launch permission attribute:
aws ec2 modify-image-attribute --image-id $ami --launch-permission "Add=[{Group=all}]"
To verify the lauch permissions of the AMI:
aws ec2 describe-image-attribute --image-id $ami --attribute launchPermission
From a stopped instance’s root volume:
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0 --description "for RADseq analysis image"
aws ec2 start-instances --instance-ids $instance_id
Below are just a few software you can install and run on Amazon EC2. Time required: ~ 10min.
1. Make sure you have /usr/local/bin
in your PATH
:
sudo nano /home/ec2-user/.bash_profile
Add /usr/local/bin
at the end of the line starting with PATH
.
2. Add LD_LIBRARY_PATH
Also add this line in the file:
export LD_LIBRARY_PATH=/usr/local/lib/
Save and Exit: ^x
, answer y
, hit enter
source /home/ec2-user/.bash_profile
Inside my .bash_profile
file, it look like:
# .bash_profile
# Get the aliases and functions
if [ -f ~/.bashrc ]; then
. ~/.bashrc
fi
# User specific environment and startup programs
PATH=$PATH:$HOME/.local/bin:$HOME/bin:/usr/local/bin
export PATH
export LD_LIBRARY_PATH=/usr/local/lib/
/usr/local/bin
wget http://cmpg.unibe.ch/software/BayeScan/files/BayeScan2.1.zip
unzip BayeScan2.1.zip
sudo cp BayeScan2.1/binaries/BayeScan2.1_linux64bits /usr/local/bin/bayescan
sudo chmod 777 /usr/local/bin/bayescan
sudo rm -R BayeScan*
bayescan # to test
git clone https://github.com/vcftools/vcftools.git
cd vcftools
./autogen.sh
./configure && make -j99 && sudo make install
cd ..
sudo rm -R vcftools
vcftools # to test
wget http://s3.amazonaws.com/plink2-assets/plink2_linux_x86_64_20191020.zip
unzip plink2* -d plink
sudo cp plink/plink2 /usr/local/bin/plink # password required
sudo rm -R plink* # to remove the folder and zip file
plink # to test
wget http://cmpg.unibe.ch/software/fastsimcoal2/downloads/fsc26_linux64.zip
unzip fsc26_linux64.zip
sudo mv fsc26_linux64/fsc26 /usr/local/bin/fsc26
sudo chmod 777 /usr/local/bin/fsc26
sudo rm -R fsc26*
fsc26 # to test
To install stacks, you’ll need to install:
git clone https://github.com/lh3/bwa.git
cd bwa
make -j99
sudo mv bwa /usr/local/bin/bwa
cd ..
sudo rm -R bwa
git clone https://github.com/samtools/htslib.git
cd htslib
autoheader
autoconf
./configure --prefix=/usr/local && make -j99 && sudo make install
cd ..
sudo rm -R htslib*
Note that this will install htslib files in: /usr/local/lib
, /usr/local/include
, /usr/local/bin
, /usr/local/share
To change where executables are installed use: ./configure --prefix=/where/to/install
(remember to add this directory to your $PATH:
).
git clone https://github.com/samtools/samtools.git
cd samtools*
autoheader
autoconf -Wno-syntax
./configure --with-htslib=/usr/local && make -j99 && sudo make install
cd ..
sudo rm -R samtools*
samtools
git clone https://github.com/samtools/bcftools.git
cd bcftools*
autoheader
autoconf
./configure && make -j99 && sudo make install
cd ..
sudo rm -R bcftools*
wget http://catchenlab.life.illinois.edu/stacks/source/stacks-2.41.tar.gz # you want to use version 1.48 instead? Just change to the version number you want.
tar -xvf stacks*
cd stacks-*
./configure && make -j99 && sudo make install
cd ..
sudo rm -R stacks* # to remove the folder and gz file
populations # to test your installation!
Time required: ~ 45min.
Amazon always as outdated R version. Consequently, you need to install the latest version:
cd /home/ec2-user
wget https://cran.r-project.org/src/base/R-latest.tar.gz
tar -xvf R-latest.tar.gz
cd R-*
./configure --with-x=yes --enable-R-shlib=yes --with-cairo=yes && make -j99 && sudo make install #take a coffee break!
cd ..
sudo rm -R R-*
R
in the Terminalq()
Save workspace image? [y/n/c]:
answer n
in the file: /home/ec2-user/.bash_profile
, using sudo nano /home/ec2-user/.bash_profile
as shown above or a text editor, add these lines:
export PATH=/usr/local/bin/R:$PATH
export RSTUDIO_WHICH_R=/usr/local/bin/R
Don’t forget to refresh with:
source /home/ec2-user/.bash_profile
This will allow you to use RStudio throygh your favorite internet browser…
cd /home/ec2-user
wget https://download2.rstudio.org/server/centos6/x86_64/rstudio-server-rhel-1.2.5001-x86_64.rpm
sudo yum install -y rstudio-server-rhel-1.2.5001-x86_64.rpm
sudo rm -R rstudio-server*
Configure:
sudo adduser rstudio
Change password:
If you’re not already at the root level type: sudo -i
passwd rstudio
Grant user access to the rstudio home folder:
sudo chmod -R ugo+rwx /home/rstudio/
We need to set the library path for RStudio:
sudo nano /etc/rstudio/rserver.conf
Add this line:
rsession-ld-library-path=/usr/lib64/:/usr/local/lib/
Save and Exit: ^x
, answer y
, hit enter
To test that RStudio server works, go to your browser and use the public IP of the instance followed by :8787
Get the public ip:
# from the Terminal ON YOUR COMPUTER or a new Terminal window:
aws ec2 describe-instances
Or from Amazon EC2 console…
In the browser:
54.159.198.106:8787
You should see:
Enter the credentials:
When everything is configured properly, RStudio running in your browser should look like this:
Restarting RStudio Server If you ever need to restart the server:
sudo rstudio-server restart
1. Create a .Renviron
file:
#In Terminal
cd /home/ec2-user
touch .Renviron
sudo nano /home/ec2-user/.Renviron
2. Inside /home/ec2-user/.Renviron
, add these lines:
PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"
LD_LIBRARY_PATH="/usr/local/lib/:/usr/lib64:/usr/local/lib64/R/lib::/lib:/usr/local/lib64:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.222.b10-0.amzn2.0.1.x86_64/jre/lib/amd64/server"
INCLUDE_PATH="/usr/include/sys/stat.h:/usr/include/sys/signal.h"
Save and Exit nano
with: ^x
, answer y
, hit enter
You could also do all this inside RStudio on your browser:
3. make available for users: ec2-user
and rstudio
Because you might want to use R directly from the Terminal or inside RStudio…
sudo cp /home/ec2-user/.Renviron /home/rstudio/.Renviron
4. Permissions:
chown ec2-user:rstudio /home/ec2-user/.Renviron
chown ec2-user:rstudio /home/rstudio/.Renviron
chmod 777 /home/ec2-user/.Renviron
chmod 777 /home/rstudio/.Renviron
The next steps are totally up to you.
Note that for Linux using radiator
, dartR
, adegenet
or any packages that rely on the sf
package… it’s more complicated and you first need to install the dependencies through the terminal:
udunits
cd /home/ec2-user/
curl -O ftp://ftp.unidata.ucar.edu/pub/udunits/udunits-2.2.26.tar.gz
tar -xvf udunits*
cd udunits*
./configure && make -j99 && sudo make install
sudo ldconfig
cd ..
sudo rm -R udunits*
Using sudo nano /home/ec2-user/.bash_profile
, add these lines:
export UDUNITS2_XML_PATH="/usr/local/share/udunits/udunits2.xml"
Refresh:
source /home/ec2-user/.bash_profile
wget http://download.osgeo.org/proj/proj-6.1.0.tar.gz
tar zxf proj*
cd proj*
./configure --libdir=/usr/lib64 && make -j99 && sudo make install
cd ..
sudo rm -R proj*
wget http://download.osgeo.org/gdal/CURRENT/gdal-3.0.1.tar.gz
tar zxf gdal*
cd gdal*
./configure --prefix=/usr/local --libdir=/usr/lib64 --with-proj=/usr/local && make -j99 && sudo make install #time for coffee...
cd ..
sudo rm -R gdal*
git clone https://git.osgeo.org/gitea/geos/geos.git
cd geos*
./autogen.sh
./configure --libdir=/usr/lib64 && make -j99 && sudo make install
cd ..
sudo rm -R geos*
source /home/ec2-user/.bash_profile
After those dependencies, I usually go over my browser in RStudio and do these:
if (!require("devtools")) install.packages("devtools") # quite long on Linux
install.packages("tidyverse") # long
install.packages("adegenet") # not too bad
devtools::install_github("thierrygosselin/grur") # quite long on Linux
devtools::install_github("thierrygosselin/assigner")
# No need to install radiator because it's a dependency or grur and assigner...à
install.packages("BiocManager")
BiocManager::install("SeqVarTools")
BiocManager::install("SNPRelate")
If you want to install dartR
follow these instructions, because the direct CRAN install won’t work:
BiocManager::install("qvalue")
devtools::install_github("green-striped-gecko/PopGenReport")
devtools::install_github("green-striped-gecko/dartR")
My vignette RADseq genomics in R, as sections on:
To know how much time was taken to execute a command, or shell script, or any other program. Use time
in front of the command.
If you ever accidentally close the Terminal application or get disconnected over SSH, the state of all your processes will be lost unless you use the screen commands!
Important Screen commands:
screen -h # Display options
screen -list # List open screens
screen -r sessionID # Reatach to screen session with sessionID
screen -S sessionID # Create new screen session with name 'sessionID'
screen -D sessionID # Force detach session (if cannot re-attach)
ctrl-a ctrl-d # Detach from current session
ctrl-a ctrl-c # Create new window in session
ctrl-a ctrl-n # Go to next window
ctrl-a ctrl-p # Go to previous window
ctrl-a ctrl-a # Go to previously selected window
ctrl-a ? # Display help
screen -S session ID
# press return
# and enter your command or script
time populations ...
# you can close the terminal window, go home and re-open the terminal window ...
screen -ls
screen -r session ID