Lotus Base blog

L. japonicus Gifu v1.2 genome

Mon, 20 Apr 2020 00:00:00 +0200

The Lotus japonicus Gifu genome assembly v1.2 is now officially released and an associated pre-publication manuscript has been submitted to bioXriv. Datasets on Lotus Base has therefore seen some restructuring to support incoming data from a new genome assembly.

Note to users with elevated privileges that have access to the v1.1 assembly preview:

v1.1 contains gene predictions that have been reworked and with a deprecated naming nomenclature, so there is no one-to-one mapping between the gene IDs from v1.1 to v1.2. We strongly encourage you to use the v1.2 data going forward.

Gifu data available today

The following tools on the site have been updated to allow access to L. japonicus Gifu data.

Genome browser

The Gifu genome v1.2 is now accessible as a new dataset on our JBrowse implementation, the Genome Browser. At the time of writing, the new genome contains the following datasets:

Gene model with human readable annotations, GO annotations, and InterPro domain predictions (GFF3 file is available for download, see below)
Non-coding RNAs
Genome gaps
Repeats

Expression Atlas (ExpAt)

The ExpAt tool has now been updated with new RNAseq data mapped to the Gifu genome by Dugald Reid. Here is a sample expression heatmap produced using the candidate genes published in the manuscript:

List of genes included in the heatmap:

LotjaGi3g1v0307700, LjCCaMK
LotjaGi2g1v0343300, LjCyclops
LotjaGi1g1v0643700, LjErn1
LotjaGi3g1v0414350, LjNin
LotjaGi1g1v0257100, LjNsp2
LotjaGi4g1v0343900, LjNf-yb1
LotjaGi5g1v0106700, LjNf-ya1
LotjaGi1g1v0001500, LjNin
LotjaGi3g1v0512000, LjHar1

View

The View tool is seen as a replacement for Transcript Explorer (TrEx), which allows you to have a quick overview for individual transcripts, genes, GO annotations and more. For example, if you are interested in all the data associated with the gene LjNin (LotjaGi1g1v0001500), you can search for it on the View page, or access the link directly.

Downloadable data

All Gifu-related downloadable data can be accessed from our data page. The newly published files are:

FASTA files for the genome assembly, coding sequences, and predicted protein sequences
GFF3 file for Gifu predicted gene annotations, containing human readable annotations, GO annotations, and InterPro domain predictions
Gene Ontology file

Future roadmap

In the next few months, we will be gradually mapping additional data to the Gifu genome:

LORE1 insertion data

Post-mortem on downtime experienced by BLAST-related tools

Wed, 21 Nov 2018 00:00:00 +0100

Over the course of the weekend of November 17–18 of 2018 and the subsequent working week that follows, all BLAST-related tools on Lotus Base became inaccessible. The affected modules were:

Lotus BLAST, which runs SequenceServer v1.0.9 as a Passenger app
The Sequence Retrieval tool (SeqRet), which relies on being able to sniff out BLAST database metadata by executing the blastdbcmd binary

Diagnosing the issue

The issue was two-fold:

SequenceServer was running as a Phusion Passenger app initialized by an arbitrarily named user via the PassengerUser option in the httpd.conf file. The user should have been apache, so that the processes spawned by Apache will have the correct read permissions to access all app binaries.
The Sequence Retrieval tool calls an internal API endpoint which relies on being able to execute the blastdbcmd binary. However, since the binary belongs to a different user group, the API wil fail and return an empty array: this causes PHP to throw an error when attempting to display BLAST database-related metadata.

What was done to fix it?

By updating the read permissions for the BLAST binaries and changing the PassengerUser for Sequence Server fixes the issue.

GateKeeper—Migrating away from IP-based controlled access

Mon, 12 Dec 2016 00:00:00 +0100

We are announcing in a change in user access to controlled, internal data available to CARB members with immediate effect. Traditionally, we have been offering access to CARB members based on their IP address (and VPN connection). This strategy worked out fine for quite awhile as there is no need to fine tune access to internal data.

However, in light of the changing personnel in the lab, coupled with collaborators who we wish to grant access to sensitive data, detecting IP addresses to restrict data access will become overwhelmingly tedious and unreliable.

Ensuring continue, undisrupted acccess

Therefore, all CARB users will be required to register for a Lotus Base account in order to access internal files, if they have not done so already. Features that are affected by this change are:

The current system administrator (Terry, terry@mbg.au.dk) will be responsible for adding verified/validated CARB members into a user group that has exclusive access to internal data. He will be notified when new CARB members have registered for accounts, and will perform necessary validation with new users before granting access.

Registering for an account does not automatically grant access to internally-available resources.

If you are a CARB collaborator who wish to have access to internal data, please do not hesitate to reach out to us.

How are you affected?

Pre-existing CARB users with accounts with Lotus Base will not see any service disruptions—you have been automatically migrated over to the new controlled access system. To access internally available data, simply remember to log in. Users that do not have an account with Lotus Base, however, are strongly encouraged to register. Terry will keep in touch with you once you have registered for an account.

Your security is our priority

In order to prevent session hijacking, we recycle user sessions frequently. This means that you might be logged off within 24 hours of logging in, unless you have explicitly asked to be logged in for a week when signing in. You are encouraged not to save your login credentials on public terminals.

If you suspect your account is being compromised or you have misplaced your login credentials, you can reset your password and regain control over your account.

Using InterProScan like a pro

Mon, 28 Nov 2016 00:00:00 +0100

Biologists are often challenged with this question when working with proteins:

Now… what does your protein do?

Domain prediction—best friend or worst nightmare?

People want to know everything about your-favourite-protein-1 (YFP1). How does it look like? What are the predicted domains? Do these domains have any functions and processes associated with them? Are they located in specific parts of the cell?

A very simplified pipeline would be as follow:

Check the amino acid sequence of YFP1 against various domain prediction programs
Obtain domain and/or structural predictions of YFP1
Infer biological function, molecular processes, and/or cellular components associated from said domain predictions

However, there are so many domain prediction algorithms out there, and an overwhelming bunch of them using Hidden Markov Models. These algorithms—such as PANTHER, Phobius, Pfam, SuperFamily, TMHMM—offer simple web interfaces that allows end-users to submit single (or a small number of, at best) sequences. EMBL-EBI offers InterPro, which integrates all of these prediction algorithms, but again only allows single sequence submission from their web interface.

There appears to be no simple way of submitting a set of protein sequence to multiple prediction algorithms—through a web interface, at least. If you are willing to dive into the world of command line interfaces, things start to look a bit better.

This article is written based on my experience with using InterPro, and my work with using RESTful services made available by EMBL-EBI on offering comprehensive Lotus data to legume researchers around the world.

Example use case: Lotus Base

As the principle developer and designer behind Lotus Base, I have worked on performing predictions on the entire set of predicted proteins using the most recently published Lotus japonicus genome — meaning 50,000+ predicted proteins in total that has to be parsed. The screenshot below shows an example of how I have pulled protein-specific data from a MySQL database built based on InterProScan’s domain prediction results, and merged the data with additional metadata obtained with the EB-eye REST service.

Domain predictions for my-favourite-gene, the flagellin receptor LjFls2. Domain prediction graph made using d3.js.

Of course, Lotus Base presents itself as a rather extreme use case due to the large volume of predicted proteins analysed. However, the methods described below would be just as applicable to a researcher who say, has obtained a list of proteins that are significantly enriched in one biological sample compared to another. The first step towards unraveling the functions of these proteins, based on their gene ontology predictions, would be to obtain their domain predictions first.

InterProScan vs InterPro RESTful service

You have two options from here on—if you are blessed with access to a computing cluster running on Linux, you can download and install a local version of InterProScan, and run InterProScan with FASTA files containing *n *number of sequences in parallel or in queue. The second, less handy option—but also the most accessible one—is to take advantage of the RESTful service provided by EMBL-EBI. The latter can be run on any computer, although preferably one running Unix/Linux (because that’s what my code will be running on). The only drawback is that EMBL-EBI’s fair use agreement only allows you to run InterProScan on 30 sequences at any one time.

Both services will give you the most up-to-date domain predictions, and necessitates re-running your proteins if they have included additional datasets. When InterProScan includes additional prediction algorithms, you can simply select to run said algorithms—instead of the entire set—on your sequences, and simply join the output with existing predictions.

Option A: Using EMBL-EBI’s InterPro REST service

Using the REST service provided by EMBL-EBI is a way to perform domain predictions on your protein(s) of interest without needing to invest in an expensive computing cluster, or obtaining access to one. For this part of the tutorial to work, you will need to ensure that Python3 is installed (InterPro provides a Python2 client library, but that is not covered in this section).

Explode FASTA file into individual sequence files

As the InterPro REST service only accepts single sequences, the easiest way is to split a multi-sequence FASTA file into individual sequence files. If your FASTA files are formatted such that each entry takes up two lines — one for the header and one for the sequence—you can do something like:

split -l 2 /path/to/your/fasta/file

However, this is often not the case, as FASTA files are recommended to be broken into lines containing no more than 60 characters long. If that is the case, you might want to rely on BioPython to do the parsing for you:

Hand individual FASTA file off to the REST service

When you have generated We can then iterate through these individual FASTA files and pass them to InterPro’s REST service. InterPro has provided us with various clients to interface with their REST service—I have chosen to work with their Python3 client. I did not modify their client script, with the exception of commenting out the line that prints the status in the function, so that my console will not be crowded with printouts.

It is important to respect the 30 sequences per batch limit of the InterPro REST service. Therefore, we will use a simple bash script that, while iterating through all individual FASTA file, stops after 30 files until the outcome from all 30 jobs have been returned:

If you want to obtain other output formats, remember to modify the option. According to my experience, each batch (of 30 sequences) takes around 2 minutes to complete.

The major drawback of this method is that it is a rather nuclear option if you are attempting to scan the entire collection of predicted proteins/transcripts. Use InterProScan on a computing cluster, if ever possible.

Option B: InterProScan on a computing cluster

Installing InterProScan

Follow the published instructions on installing InterProScan. I have ran into small hiccups, such as accidentally using an outdated version of Java (≤1.7) and having a dated GCC library. Loading the most updated one ensured that both InterProScan and the bundled BLAST+ package can be executed properly.

Adding proprietary algorithms

Note that InterProScan does not come with Phobius, SignalP, and TMHMM preinstalled. You will have to request for the compiled binaries of these algorithms, and upload them to their respective folders in the directory.

If you are unable to get hold of these libraries, you will have to retrieve the output of these algorithms via InterPro REST service.

A hitch with SignalP is that it assumes a fixed directory for loading the library files. This causes a fatal error where FASTA.pm cannot be loaded—remember to update the environment so that it points to the signalp directory (it will load libraries from the subfolder automagically).

# full path to the signalp-4.1 directory on your system (mandatory)
BEGIN {
    $ENV{SIGNALP} = '<path/to/interproscan>/bin/signalp/4.1';
}

Check that all prediction algorithms are loaded

After you’ve done that, ensure that file is properly updated with the file paths of your binaries for the added libraries. After that, proceed to run without any arguments. It will print out all the algorithms that were detected and loaded correctly. Ensure that none is left behind—InterProScan will inform you if any of them has failed to load.

Depending on the number of sequences you want to submit per batch, you will have to update

Getting your FASTA files ready

You would want to process FASTA files in batches instead of all at one go. I have decided to split a unified FASTA file that contains all 50,000+ of the amino acid sequences into files containing 500 entries each. If your FASTA files are formatted such that each entry takes up two lines—one for the header and one for the sequence—you can do something like:

split -l 1000 /path/to/your/fasta/file

…assuming that you want 500 entries per file. However, this is often not the case, as FASTA files are recommended to be broken into lines containing no more than 60 characters long. If that is the case, you might want to rely on BioPython to do the parsing for you. The first step is to create a filtered FASTA file that is formatted such that each entry occupies two lines, generating a file. The second step is to batch parse this filtered file using itertools, to create batches of FASTA files containing 500 entries (i.e. 1000 lines) each:

Submit your jobs to iteratively to the computing cluster

In this case, I am using SLURM for batch job submission. I will not go into details on how the job submission is done, as it is highly dependent on the configuration of individual clusters. The actual command is quite simple:

/path/to/interproscan.sh \
-i /path/to/fasta.fa -dp -iprlookup --goterms --pathways

Note that I have turned off precalcualted match lookup using the flag because the computing cluster I am on blocks external connections for security reasons. Moreover, the *Lotus japonicus *proteins are yet to be submitted to UniProt so it is highly unlikely that we will find too many matching proteins in the public database.

Here is an example of how a batch job submission template you can use:

Boom! Run it and wait for magic to happen.

Performance

In the case of Lotus Base and our collection of predicted transcripts, we have 49,598 sequences scanned in batches of 1,000, creating 50 jobs. The jobs were run with an allocated 24Gb memory over 12 cores, on nodes equipped with Intel “Sandy Bridge” E5–2670 (2.67GHz)or “Haswell” E5–2680v3 (2.5GHz) CPUs. After normalizing for processor speeds and library sizes, the real time consumed per job stands at 2.50±0.28h (CPU time: 4.32±0.35h).

Parsing InterPro/InterProScan outputs

The file that contains all the juicy data is the TSV file, which you can easily import into a relational database such as MySQL. The InterProScan wiki has the information on what does each individual column in the TSV file contain.

For Lotus Base, I simply imported the TSV file *as-is *into a MySQL table, and used statements to merge transcript metadata from additional tables we have. It’s as simple as that!

This article is also published on Medium.com.

Introducing ExpAt, the Lotus japonicus Expression Atlas

Fri, 26 Aug 2016 00:00:00 +0200

Expression data from the model legume Lotus japonicus, while publicly available through other online resources, face a fragmented landscape that lacks accessibility and options for analysis and visualisation. Here we introduce ExpAt, the L. japonicus Expression Atlas, that offers features that empower legume researchers without the need for extensive knowledge in computation or skills in data visualisation.

Give ExPat a try

What does ExpAt do?

ExpAt is a tool that allows you to query for the expression levels of your genes/transcripts of interest. It generates almost publication-ready, vector-based graphics based on the retrieved expression data, and presents them in a line graph and a heatmap. The line graph feature will be turned off when too many genes/transcripts were used in a single search, as it offers little insight on the expression patterns. You may export all the relevant data and charts by visiting the “export data” options that appears above the charts.

Here is an example of an unmodified ExpAt chart, showing a line graph and a clustered heatmap with one dendrogram on each axis:

Here is an example of a ExpAt chart which is slightly tweaked in Adobe Illustrator, and used in a publication (Mun et al., in review):

Data transformation

For easing quick visual comparison across genes with signicantly different levels of absolute expression—measured by either (1) reads per kilobase of transcript (RPKM) for RNAseq datasets, or (2) arbitrary Affymetrix units for Affymetrix MicroArray datasets—we included two possibilities to transform the expression levels, by normalisation or standardisation.

Data normalisation is simply the rescaling of expression values to the domain $[0, 1]$, by subtracting the log-transformed sample expression levels $x_s$ with the lowest log-transformed expression level, $(\log_{10} x)_{\min}$, followed by the division of the difference between the log-transformed maximum and minimum expression levels. In order to allow comparison for extreme values, expression values are $\log_{10}$-transformed prior to normalisation.

$x^\prime_s = \frac{(\log_{10} x_s) - (\log_{10} x)_{\min}}{(\log_{10} x)_{\max} - (\log_{10} x)_{\min}}$

Meanwhile, data standardisation serves to rescale the expression levels on a per row basis, across conditions, to have a mean of zero and a standard deviation of one. This is performed by subtracting the sample expression levels $x_s$ by the average expression level $\mu$ across all samples, and dividing the didderence with the sample standard deviation computed across all samples $\sigma$. This strategy is however erroneously labelled as “normalisation” in some studies.

$x^\prime_s = \frac{x_s - \mu}{\sigma}$

Clustering

ExpAt offers the possibility to cluster your expression level data by gene/transcript idenifiers and/or conditions/samples. This two-dimensional data is referred to as a matrix—when this matrix has a dimension of $1 \times n$ or $n \times 1$, k-means clustering is used; when this matrix is larger than that, hierarchical agglomerative clustering is used. Changes to the clustering parameters can be modified on the fly and the heatmap and/or line graphs will be updated accordingly.

Dataset availability

We have integrated several publicly-available datasets:

The L. japonicus gene expression (LjGEA)^1–5, with probe identifiers mapped to L. japonicus v3.0 proteins using NCBI BLAST, and
The early Lotus root responses to germinating spore exudates from arbuscular mycorrhizal fungi⁶.

Should you want to add your own expression data to the ExpAt tool, feel free to reach out to us via the contact form.

Citation

If you have used ExpAt for data transformation, analysis (k-means or hierarchical clustering), and/or visualisation, we ask that you cite Lotus Base⁷, and the relevant publications that generated the said dataset.

References

Verdier, J., Torres-Jerez, I., Wang, M., Andriankaja, A., Allen, S. N., He, J., Tang, Y., Murray, J. D., and Udvardi, M. K. (2013). Establishment of the Lotus japonicus gene expression atlas (LjGEA) and its use to explore legume seed maturation. Plant J, 74(2):351–362.
Díaz, P., Betti, M., Sánchez, D. H., Udvardi, M. K., Monza, J., and Márquez, A. J. (2010). De ciency in plastidic glutamine synthetase alters proline metabolism and transcriptomic response in Lotus japonicus under drought stress. New Phytol, 188(4):1001–1013.
Guether, M., Balestrini, R., Hannah, M., He, J., Udvardi, M. K., and Bonfante, P. (2009). Genome-wide reprogramming of regulatory networks, transport, cell wall and membrane biogenesis during arbuscular mycorrhizal symbiosis in Lotus japonicus. New Phytol, 182(1):200–212.
Høgslund, N., Radutoiu, S., Krusell, L., Voroshilova, V., Hannah, M. A., Go ard, N., Sanchez, D. H., Lippold, F., Ott, T., Sato, S., Tabata, S., Liboriussen, P., Lohmann, G. V., Schauser, L., Weiller, G. F., Udvardi, M. K., and Stougaard, J. (2009). Dissection of symbiosis and organ development by integrated transcrip- tome analysis of Lotus japonicus mutant and wild-type plants. PLoS One, 4(8):e6556.
Sanchez, D. H., Lippold, F., Redestig, H., Hannah, M. A., Erban, A., Krämer, U., Kopka, J., and Udvardi, M. K. (2008). Integrative functional genomics of salt acclimatization in the model legume Lotus japonicus. Plant J, 53(6):973–987.
Giovannetti, M., Mari, A., Novero, M., and Bonfante, P. (2015). Early Lotus japonicus root transcriptomic responses to symbiotic and pathogenic fungal exudates. Front Plant Sci, 6:480.
Mun, T., Bachmann, A., Gupta, V., Stougaard, J., and Andersen, S. U. (under review). Lotus base, an integrated information portal for Lotus japonicus.

Mapping transcripts across Lotus genome versions

Wed, 24 Aug 2016 00:00:00 +0200

The availability of various versions of the L. japonicus genome, while proving to be an important resource in legume research, makes it difficult for users to map annotated genes and/or transcripts from one version to another. The Transcript Mapper (TRAM) tool can be used for exactly this purpose, and we currently support versions 2.5 and 3.0. As with all the other tooklits provided with Lotus Base, TRAM provides deep-linking to other tools for your convenience.

Redesigned LORE1 search form, and pan-version TREX searches

Sat, 05 Mar 2016 00:00:00 +0100

Lotus Base was originally conceived as a very simple web interface for the searching for, and ordering of, LORE1 lines, but over the years it gradually evolved into a fully-fledged Lotus japonicus online resource. Therefore, it is not surprising that the LORE1 search form is one of the most antiquated and complicated components of the site, which we never really got around to upgrading it.

Now the LORE1 line search page has been revamped and brought up to date with the cleaner style of the site in general. To improve user experience, we have removed the step-form-like search flow, which complicated the decidedly simple purpose of the form anyway—to search for LORE1 lines of interest.

In other news, we have enabled pan-Lj-genome-version Transcript Explorer (TREX) searches. Although the form defaults to the latest version of the genome (at the time of writing, this would be v3.0), it is possible to select from other versions, either in a standalone or combinatory manner, of all hitherto published L. japonicus genomes. Do note that due to the way the genome is assembled, genome coordinates are not preserved across versions. For example, position 65,536 on chromosome 1 in v2.5 will not be position 65,536 on chromosome 1 in v3.0.

JBrowse updated to v1.12.0

Tue, 01 Mar 2016 00:00:00 +0100

Just upgraded the Lotus japonicus #genome browser to the latest version of JBrowse, v1.12.0 https://t.co/ttAO0z7O8o—thank you, @usejbrowse!
— LotusBase (@lotusbase) March 1, 2016

We have successfully upgraded our jBrowse installation, used to power visualization of the L. japonicus genome, to version 1.12.0. The new version comes with several exciting features, such as:

Added ability to open a new genome in FASTA format from the browser. Also supports indexed FASTA.

Support for inline reference sequence configurations.

Source: http://jbrowse.org/jbrowse-1-12-0/

Lotus Base Soft Launch

Wed, 27 Jan 2016 00:00:00 +0100

After in a few years of continuous development and a total of 160,000+ lines of code written, we are soft-launching Lotus Base, an online resource platform for the model legume Lotus japonicus.

What is Lotus Base?

Lotus Base aims to be a one-stop online resource for everything concering the model legume Lotus japonicus. We are currently hosting both v2.5 and v3.0 of the L. japonicus genome, as well as other databases associated with it, such as LORE1 insertions, protein sequences, coding sequences, mRNA libraries and more.

LORE1 lines ordering

For now we are sticking to the old site for processing LORE1 orders, due to delays in implementing the new order system. We will let you know once this migration in complete. Otherwise, the ordering protocol is the same as usual.

Made by users, for users

Lotus Base is developed in-house in the Centre for Carboydrate Recognition and Signalling, Aarhus University, Denmark. Before this public release, we have tested the site extensively with the help of our own users—the user experience of the site is tailored to the needs of researchers like us.

We have an extensive set of tools for the end-user, and have integrated various third-party, open-source project to make most of our resources easily available to the public. For example, we are using SequenceServer as a wrapper for NCBI BLAST binaries to power our customized BLAST, as well as JBrowse for our genome browser. We will update these tools on a regular basis whenever we see fit.

We have some tools that are only available for internal access, marked with a “Closed Beta” message if you ever encounter them. If you want to gain access to them, please contact us.

Help us make Lotus Base better

If you have encountered any technical issues with using the site, feel free to let us know via our issue tracker. We have a small but dedicated team of developers working on the project.

Stay updated

Subscribe to the Lotus Base newsletter to stay updated with the most recent news.

Change in v3.0 gene nomenclature

Thu, 13 Aug 2015 00:00:00 +0200

Due to the upcoming release of version 4 of the Lotus genome and gene accession IDs, and that we are expecting coordinates to change drastically, we pre-empted a possible clash in the namespace of gene accessions. Therefore, we have implemented a change in version 3.0 gene accessions for Lotus japonicus with immediate effect.

For example, the old accessions ID for the gene “ATP synthase D chain-related protein” is Lj1g2536050.1. With the updated nomenclature where the version number is appended after the chromosome name, the new accession ID for the same gene will be Lj1g3v2536050.1.

A quick way to convert your existing gene IDs, should you want to search them against our databases, would be to append 3v after the Lj[…]g[…] text in your gene accession ID so that it becomes Lj[…]g3v[…]. The databases and site features affected by this update in nomenclature:

LORE1 search
BLAST databases
Gene annotations (≥v3.0)
Genic and exonic insertions databases (≥v3.0)
Expression Atlas (ExpAt) databases