Splitting A Huge Textfile By Regular Expressions

Hi!

I have a fasta file with biological DNA sequences.
Fasta files are build like this:
>This_is_a_FASTA_header
TTTATATATAGACGATGACGATGACA
>The_next_sequence_begins
GGGCACAGTAGCAGA
>And_another
TGCGAGAGGTAGTAGAT

In my case all the header lines (starting with ">") have one 360 indices starting after the ">:
>001_blabla
....
>360_blabla

I want to split my big combined fasta file into 360 single files with sequences sharing the same index.

Thank you very much!


Similar Content



Grep: Find Files That Do Not Have Multiple Different Strings

Hi all,

I'm trying to identify files that do not have matches for certain strings. FYI, these are files of DNA sequences and I'm trying to find those that are NOT sampled for any species by my group of interest (e.g., genes that are specific to that group of organisms).

I tried this code but it's actually yielding a list of files that DO match for my regexp.
Code:
for FILENAME in *.fas
do
grep -q -L ">PBAH" $FILENAME && grep -q -L ">SKOW" $FILENAME && grep -q -L ">CGRA" $FILENAME && echo $FILENAME
done

Basically I want to somehow go through and file files that do not contain ">PBAH" ">SKOW" or ">CGRA". Any assistance would be greatly appreciated!

Best,
Kevin

Removing Multiple Lines From Cell Data In A .csv File

I am trying to process some .csv files with Linux as follows:

Some fields have data with newline characters embedded, like so:

"Bob Smith
531 Pennsylvania Avenue
Washington, DC"

(I verified the existence of the " via Wordpad. The file is too large to easily edit in Wordpad to get all the data for each row on a single line).

what linux command would I use on the files to get the data in each cell on one line?

I have tried:

1. awk -v RS="" '{gsub (/\n/,"")}1' file > newfile

but the cell data was still being read in as if "531 Pennsylvania Avenue" was a brand new row in the CSV file.

2. Command 1 followed by awk -v RS="" '{gsub (/\r/,"")}1' newfile > finalFile

but that resulted in all of the data in the file being put onto a single line.

3. awk -v RS="" '{gsub (/\r\n/,"")}1' file > newFile

But that result was the same as attempt number 2.

How can I preprocess the file so that:

"Bob Smith
531 Pennsylvania Avenue
Washington, DC"

is read as a single field on a single line as part of the row it should be associated with, like

"Bob Smith 531 Pennsylvania Avenue Washington, DC"

Converting Multiple Gedit Files To Windows Versions

i have a few score of files (>50) in fasta format. these work fine in linux os
but i have to send these to a collegue who uses windows. and these files don't open properly in notepad or wordpad. executing save as to windows format does the trick

but i don't want to manually convert all of them

is ther a way i can accomplish conversion of multiple files and saving them in a format of my choosing using say terminal

Need Help In Bash Scripting

I have two files which has exact same number of lines.
I want first line of first file should be filename of new file and content of this new file should be first line of second file.
Then second line of first file should be filename of again new file and content of this new file should be second line of second file.
then third line of first file should be filename of again new file and content of this new file should be third line of second file.
and so on...
I am trying to do it using for loop but I am not able to create two for loops.
This is what I have done
Code:
IFS=$'\n'
var=$(sed 's/\"http\(.*\)\/\(.*\).wav\"\,\".*/\2/g' 1797.csv) # filenames of all files
var2=$(sed 's/\"http\(.*\)\/\(.*\).wav\"\,\"\(.*\)\"$/\3/g' 1797.csv) # contents of all files
for j in $var;
do
#Here I do not know how to use $var2
done

Please help.

Need Kernel Header File

Hi folks,

I'm trying to install the drivers for my "legacy" nvidia graphics card.
I've downloaded the file from nvidia's website to install the driver, but during the process I get this message...

Kernel header file ' /lib/modules/3.13.0-37-generic/build/include/linux/version.h ' does not exist .

The most likely reason is the kernel source files in ' /lib/modules/3.13.0-37-generic/build ' have not been configured.

Anyone know how to configure this file? I've been working on this for a few days now...it's getting old!

Thanks for your help.

Joe

How I Can Print A Specific Range Of Nubers Form A File.

hello,

i am trying to make a table from some files. i used this to record how much "RD_" field i have in my file. Quote:
grep -o 'RD_' $f|grep -c 'RD_'
forexample i got 5 "RD_" fields now i want to print 5 number of fields from another file starting from 2nd field. i did it mannully like Quote:
awk 'NR==1{print"{"$2","$3","$4","$5","$6","0.0000",""0.0000""}"","}' $file
i want to make it work together and a bit auto matic like PHP Code:
awk 'NR==1{print"{"$2"to "$5"," apend zeros to make it total 7 fields"}"","}' $file 


your coments would be apreciated
thanks alot

Python Ftplib

hello all,

please help me with python ftplib. i was trying to copy files from my linux machine to a windows server using ftplib. everything was working good. but i'm only able to copy files from the same directory the script is. how do i copy files from a different directory? i always get "file not found error message". here's my code :

Code:
tester_name = str (socket.gethostname())
def upload(ftp, file):
    ext = os.path.splitext(file)[1]
    if ext in (".txt", ".htm", ".html"):
        ftp.storlines("STOR " + file, open(file))
    else:
        ftp.storbinary("STOR " + file, open(file, "rb"), 1024)



parse_source_path = ('/path/to/where/i/go/')
parse_source_file_list = os.listdir(parse_source_path)

ftp = ftplib.FTP("server_IP")
ftp.login("username", "pass")

folder_list = []

ftp.dir(folder_list.append)

if str(tester_name) not in str(folder_list) :
    ftp.mkd("%s"%tester_name)
    ftp.cwd("%s"%tester_name)
    for files in parse_source_file_list :
        print files
        upload(ftp, files)


else :
    print "later"

Extract Middle Of File - How To Strip Header/footer

I have a log file with a header (which I can skip with awk), and a footer, which I need to find a way to remove. The goal is to extract the middle lines from a file. Specifically, there is a header (1 line) and a footer (1 line).
The only way I can figure out how to do this is if I already know how many lines are in the file to begin with. For example, if the file looks like this: line 1 (header)
line 2 (interesting line)
line 3 (interesting line)
line 4 (footer)
I just want to extract the middle "interesting lines" without the header/footer lines.
I can't use grep to remove the header/footer, because I don't know what those lines will contain, only that they exist and are exactly 1 line each. In general, I don't know how many lines are in the file.

Need Help Cat Multiple Files To One File

I am currently running a system simulation on multiple files.
I have a computer algorithm written in perl to run "system" simulations for all the files I need

What I am trying to do is put multiple files into one file, only problem is that its not doing exactly what I need it do

Example:

I am "cat txt0.txt txt1.txt txt2.txt txt3.txt > allfiles.txt"

I need it to read as

txt0.txt
txt1.txt
txt2.txt
txt3.txt

Instead its taking all the files and taking the information within each txt file and putting them all together. Info that looks like this


fdfasdfqwdefdfefdkfkkkkkkkkkkkkkkkfsdfasdxfewqfe..........

all clustered together

you get the picture ?

I am really confused how to get this to work, there are over 100 files that need to go into a single file.
That way when I run it through the perl algorithm I created, I can do it in one shot.

Libre Calc Exceeding Limit

iam trying to open a 100mb fasta file in libre calc, and it stays like that for several minutes and finally displays that the file exceeded the row limit in excel , is there any way for me to view this file in excel